millanek / Dsuite

Fast calculation of Patterson's D (ABBA-BABA) and the f4-ratio statistics across many populations/species
160 stars 26 forks source link

Dtrios error in calculating f4-ratio when using genotype probabilities, but not with genotype calls #67

Closed zjnolen closed 1 year ago

zjnolen commented 1 year ago

I am trying to use Dtrios (r50) on a dataset with relatively low depth, so am interested in using the -g / --use-genotype-probabilities option for calculating allele frequencies, since we are using likelihood based analyses for the rest of our workflow. Using the same VCF without the -g option works successfully, however, once -g is added, I receive an error:

> ./Build/Dsuite Dtrios -g -o dsuite_test test.8000.vcf SETS.txt
There are 8 sets (populations/species) excluding the Outgroup
Going to calculate D and f4-ratio values for 56 trios
Done permutations
The VCF contains 5242 variants
Going to use block size of 261 variants to get 20 Jackknife blocks
Assertion failed: (p_S3a >= 0), function DminMain, file Dmin.cpp, line 426.
[1]    63164 abort      ./Build/Dsuite Dtrios -g -o dsuite_test test.8000.vcf SETS.txt

Looking at the line mentioned in Dmin.cpp and seeing it was a part of the Fstats section, I added --no-f4-ratio to confirm if it was related and the tool ran successfully:

> ./Build/Dsuite Dtrios -g --no-f4-ratio -o dsuite_test test.8000.vcf SETS.txt
There are 8 sets (populations/species) excluding the Outgroup
Going to calculate D values for 56 trios
Done permutations
The VCF contains 5242 variants
Going to use block size of 261 variants to get 20 Jackknife blocks
Done processing VCF. Preparing output files...

I've also tested this on a VCF for another dataset that was produced through a different variant caller and the same behavior occurs, though it is not always p_S3a in the error, it can also be p_S2a. I've tested a few previous versions and it seems anything from r44 onward causes this error for me.

Thanks for any help on this!

Best, Zach

rtfcoimbra commented 1 year ago

I'm running into the exact same issue...

$ Dsuite Dtrios -c -g -k 100 -o dsuite -n oag-m1 -t oag.m1.r2.rooted.nwk snps.sampled.vcf sample_sets.txt
There are 14 sets (populations/species) excluding the Outgroup
Going to calculate D and f4-ratio values for 364 trios
Done permutations
The VCF contains 462697 variants
Going to use block size of 4625 variants to get 100 Jackknife blocks
Dsuite: Dmin.cpp:426: int DminMain(int, char**): Assertion `p_S3a >= 0' failed.
Aborted (core dumped)

I used bcftools for genotype calling and vcflib to randomly sample sites from the VCF file.

millanek commented 1 year ago

Hi Zach and rtfcoimbra

Maybe this is something I should look into. I would need from you a dataset that I can use to reproduce the error. Would you be willing to share this with me?

All the best Milan

rtfcoimbra commented 1 year ago

Hi @millanek,

Thank you for your quick response.

Sure. Here is a smaller version of the VCF file, the sample sets file, and tree file that I used with the previous command. test_files.tar.gz

Best, Raphael

millanek commented 1 year ago

Hi Zach, Raphael

Yes, there was an issue related to previous improvements in f4-ratio calculations. This worked well for genotypes, but it broke the -g (genotype likelihoods/probabilities) option. This is now fixed in Dsuite v0.5 r52.

Pleasxe test and let me know if it works well for you. Thanks for reporting and so contributing to Dsuite ;).

Milan

rtfcoimbra commented 1 year ago

Hi Milan,

I've just tested the update and it worked with no issues now.

Thank you for the fix!

Best, Raphael

zjnolen commented 1 year ago

Hi Milan,

I've given it a test as well now and it also is working on our VCFs. Thanks so much for looking into and fixing this!

Best, Zach