Closed marqueda closed 5 years ago
Hello David
Thank you for this report. The error occurred because the default jackknife block size for calculating the p-values was 20,000 SNPs and your dataset contained only ~3,000 sites. Therefore the p-values could not be calculated and the program terminated with an error. The solution would be to reduce the block size to something like 100 SNPs or even 50 SNPs (using the -j option).
In response to your report, I implemented exception handling for this, so that, instead of crashing, the program informs the user what the issue is, like below:
WARNING: Not enough blocks to calculate jackknife!!
Could not calculate p-values for the trio: A_calliptera Alticorpus_geoffreyi Copadichromis_cf_trewavasae
You should probably decrease the the jackknife block size (-j option)
...
p-value could not be claculated for 62196 trios
You should definitely decrease the the jackknife block size!!!
Dear Milan,
Thank you very much for your super fast response and for implementing the warning on this issue! Very much appreciated!
Just as additional information for other users: I think you can also run into this issue even though you have a lot of SNPs in your dataset, in case that you have one individual (or one group with individuals) containing a lot of missing data, so that in the end for the jackknife procedure, less than -j sites remain for the particular comparison(s) involving that individual.
Best wishes, David
Yes David, that's correct, the number of sites available for calculating D is different for each trio, due to different amounts of missing data and also other factors - e.g. sites where all individuals in the trio have the same allele are not used. The program will write a warning about any trios for which the p-value can't be calculated due to insufficient number of useable SNPs. The p-value in the output files then appears as 'nan'.
As said, the solution is to use a lower value for the -j option...
Yes indeed, thank you for confirming that. I am looking forward to use your tool for many future datasets! Thanks again.
You're welcome. I will be happy to hear any future feedback, feature requests, etc.
Milan
Dear Milan,
Thank you for publishing this software, which I hope will be very useful for many! I just tried to run it on our cluster after installation with gcc/4.9.2, and with zlib/1.2.8 modules loaded. I run into a potential bug with different datasets, including the settings file and a shortened version of the Malinsky_et_al_2018_LakeMalawiCichlids_scaffold_0.vcf.gz file associated with the bioRxiv preprint (the first 3,012 sites). However, with the full file Malinsky_et_al_2018_LakeMalawiCichlids_scaffold_0.vcf.gz, the program worked like a charm.
I always used the following command(s):
Dsuite Dtrios Malinsky_et_al_2018_LakeMalawiCichlids_scaffold_0.vcf.gz sets.txt
Dsuite Dtrios Malinsky_et_al_2018_LakeMalawiCichlids_scaffold_0_subset3012.vcf.gz sets.txt
The STDOUT/STDERR then looked like this:
Could this be a bug in the code or a problem with accepting gzipped / bgzipped or uncompressed VCF files or with my installation? Thank you in advance for your help with resolving this issue!
Best wishes, David