ocbe-uio / rBAPS

R implementation of the BAPS software for Bayesian Analysis of Population Structure
http://www.helsinki.fi/bsg/software/BAPS/
GNU General Public License v3.0
10 stars 4 forks source link

Replace greedyMix() with rhierbaps::load_fasta() #16

Closed wleoncio closed 3 years ago

wleoncio commented 3 years ago

Basic tasks

Consequences

This solution would eliminate the need to fix issues #13 and #15, and would also close #2.

Details

From https://github.com/ocbe-uio/rBAPS/projects/2#card-64016651:

fastbaps and rhierbaps handle haploid data in fasta format, so they'd need to be expanded to handle diploid data in BAPS format (which is one for the greedyMix functionalities). The loading functions for those packages are thankfully simple enough to pick up and develop on, but I'd need to spend some time studying the methodology behind the different formats (I'm afraid simply copying missing code from BAPS will just move the issues I'm having to a different place).

On the different file formats handled by greedyMix and load_fasta

Since BAPS data format was created in the early days of modern population genetics, it was borne out of convenience and is currently an anomaly as most of things from those days. For diploid organisms, we should commit only to using common modern data formats I think, such as VCF (variant call format), SAM, and perhaps we should keep GENEPOP format since that program is still being used for diploid population genetics and it's convenient for older types of markers such as microsatellites. Preprocessed data can also be skipped. There are ready made tools for handling VCF and SAM files in R, see eg here.

The important difference between haploid and diploid data is that the former has one allele (coded by some integer for these analyses) per locus per individual (=sample) whereas the latter has two alleles, so gives a data matrix with 2 rows per 'sample'. In BAPS models these rows are assumed as iid samples from an underlying model conditional on any clustering of samples (and allele frequencies of the clusters).

Packages to potentially use for handling established data formats

Genepop

From https://github.com/ocbe-uio/rBAPS/projects/2#card-64017129:

VCF, SAM

https://cran.r-project.org/web/packages/vcfR/vignettes/intro_to_vcfR.html

wleoncio commented 3 years ago

Implementing this will close issue #12, #13 and #15, which are related to the superseded function greedyMix().

wleoncio commented 3 years ago

load_fasta() should stay untouched, as it does one thing and it does it well. Other functions will be created to handle VCF, SAM and Genpop files, possibly using the packages suggested above.