millanek / Dsuite

Fast calculation of Patterson's D (ABBA-BABA) and the f4-ratio statistics across many populations/species
160 stars 26 forks source link

Mixed haploid diploid datasets #33

Open tbringloe opened 3 years ago

tbringloe commented 3 years ago

Hi, I've been going over the formulas for the various statistics calculated by Dsuite. Evidently they are all tailored towards biallelic sites. I am therefore wondering if I am violating any assumptions by running these stats on a mixed haploid and diploid species level dataset (there is no way around it, we are sampling different life history states in species that alternate generations). At least two of the species for which we have haploid data appear to be playing a role in hybridization patterns, so I am keen to include them.

In the vcf file, the haploid species are genotyped as such. Dsuite offers no warnings and appears to calculate everything appropriately, and results appear to make biological sense (i.e. elevated D and f-ratios reflecting edges in a network showing shared genetic information at odds with ILS). So how is the haploid data treated, particularly in calculations that appear to explicitly demand biallelic sites (such as the f4-ratio)?

Really appreciate any insight on this before reading too much into results

Trev

millanek commented 3 years ago

Hi Trev

I would have to see a little of your VCF and the SETS.txt file to be sure how this is processed.

I think that all should be fine as long as haploid and diploid individuals are not mixed within the "Species" or populations specified in the SETS.txt file. I.e. each species/population can be composed either entirely of haploid individuals or entirely of diploid individuals.

Hope this makes sense

Milan