'std::length_error' bug?

marqueda commented 5 years ago

Dear Milan,

Thank you for publishing this software, which I hope will be very useful for many! I just tried to run it on our cluster after installation with gcc/4.9.2, and with zlib/1.2.8 modules loaded. I run into a potential bug with different datasets, including the settings file and a shortened version of the Malinsky_et_al_2018_LakeMalawiCichlids_scaffold_0.vcf.gz file associated with the bioRxiv preprint (the first 3,012 sites). However, with the full file Malinsky_et_al_2018_LakeMalawiCichlids_scaffold_0.vcf.gz, the program worked like a charm.

I always used the following command(s): Dsuite Dtrios Malinsky_et_al_2018_LakeMalawiCichlids_scaffold_0.vcf.gz sets.txt Dsuite Dtrios Malinsky_et_al_2018_LakeMalawiCichlids_scaffold_0_subset3012.vcf.gz sets.txt

The STDOUT/STDERR then looked like this:

Going to calculate 62196 Dmin values
Done permutations
Processed 10000 variants in 35.78secs
GettingCounts 0 calculation 0secs
Processed 20000 variants in 73.1secs
GettingCounts 0 calculation 0.01secs
Processed 30000 variants in 108.12secs
GettingCounts 0 calculation 0secs
(...)
Done processing VCF. Preparing output files...
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_default_append
Aborted

Could this be a bug in the code or a problem with accepting gzipped / bgzipped or uncompressed VCF files or with my installation? Thank you in advance for your help with resolving this issue!

Best wishes, David

millanek commented 5 years ago

Hello David

Thank you for this report. The error occurred because the default jackknife block size for calculating the p-values was 20,000 SNPs and your dataset contained only ~3,000 sites. Therefore the p-values could not be calculated and the program terminated with an error. The solution would be to reduce the block size to something like 100 SNPs or even 50 SNPs (using the -j option).

In response to your report, I implemented exception handling for this, so that, instead of crashing, the program informs the user what the issue is, like below:

WARNING: Not enough blocks to calculate jackknife!!
Could not calculate p-values for the trio: A_calliptera Alticorpus_geoffreyi Copadichromis_cf_trewavasae
You should probably decrease the the jackknife block size (-j option)

...
p-value could not be claculated for 62196 trios
You should definitely decrease the the jackknife block size!!!

marqueda commented 5 years ago

Dear Milan,

Thank you very much for your super fast response and for implementing the warning on this issue! Very much appreciated!

Just as additional information for other users: I think you can also run into this issue even though you have a lot of SNPs in your dataset, in case that you have one individual (or one group with individuals) containing a lot of missing data, so that in the end for the jackknife procedure, less than -j sites remain for the particular comparison(s) involving that individual.

Best wishes, David

millanek commented 5 years ago

Yes David, that's correct, the number of sites available for calculating D is different for each trio, due to different amounts of missing data and also other factors - e.g. sites where all individuals in the trio have the same allele are not used. The program will write a warning about any trios for which the p-value can't be calculated due to insufficient number of useable SNPs. The p-value in the output files then appears as 'nan'.

millanek commented 5 years ago

As said, the solution is to use a lower value for the -j option...

marqueda commented 5 years ago

Yes indeed, thank you for confirming that. I am looking forward to use your tool for many future datasets! Thanks again.

millanek commented 5 years ago

You're welcome. I will be happy to hear any future feedback, feature requests, etc.

Milan

millanek / Dsuite

'std::length_error' bug? #1