lvclark / polyRAD

Genotype Calling with Uncertainty from Sequencing Data in Polyploids 🍌🍓🥔🍠🥝
24 stars 8 forks source link

Export to and import from updog #12

Open lvclark opened 3 years ago

lvclark commented 3 years ago

It would be nice to have some convenience functions to convert between a RADdata object and the input and output of updog. This would allow users to take advantage of the file import and export options in polyRAD, while performing the genotype calling itself in updog (more accurate than polyRAD in some cases but much slower).

If you would like to add this feature and make a pull request, just comment here and I will give any help and guidance that I can. In particular see the multidog and format_multidog functions in updog. See also the checklist for pull requests.

nk183 commented 3 years ago

@lvclark can I work on this issue?

lvclark commented 3 years ago

@nk183 Sure, thanks for doing this!

Going from polyRAD to updog

Going from updog to polyRAD

multidog outputs a list of two items. inddf has most of the information that is needed:

Misc

Documentation of the RADdata class: https://github.com/lvclark/polyRAD/wiki/RADdata

From your profile I'm not sure if you're new to R... If it is a new language to you, be aware that loops are very slow because the code gets reinterpreted on each iteration. Many functions and operations can process an entire vector/matrix/array at once, however.

If you are new to bioinformatics, what we're trying to accomplish is genotype calling, which in a diploid is basically determining whether an individual is AA, Aa, or aa at a particular site (AKA locus/marker/SNP/gene) in the genome. What we have is a random sample of DNA sequence, where the locus has usually been sequenced multiple times. The "read depth" is the number of times we see the sequence for a given allele or a given locus. Using that read depth, along with information about the population of individuals being studied, we can use Bayesian statistics to get a posterior probability of each genotype AA, Aa, and aa being the true genotype. In polyploids it is more complicated, for example in a tetraploid you could have AAAA, AAAa, AAaa, Aaaa, or aaaa.

Updog only supports two alleles per locus. polyRAD supports any number of alleles per locus, but treats them as "pseudo-biallelic", where each allele is treated as a marker and each read either belongs to that allele or does not. Hence multiple markers in updog might correspond to a single marker in polyRAD.