luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
301 stars 37 forks source link

vcf filtering failed - VCF file is too big #177

Closed davidecarlson closed 3 years ago

davidecarlson commented 3 years ago

Describe the bug The variant calling step appeared to work properly, but the filtering step fails. Here are the last several lines of the debug log:

[2021-05-08 07:23:06] <DEBG> Writing completed task HiC_scaffold_998:0-23282 that finished in 38s
[2021-05-08 07:23:06] <DEBG> Writing 18 calls to output
[2021-05-08 07:23:06] <DEBG> Writing completed task HiC_scaffold_999:0-23242 that finished in 17s
[2021-05-08 07:23:06] <INFO>                     -             100%          13h 57m                 -
[2021-05-08 07:23:06] <DEBG> Merging 6877 temporary VCF files
[2021-05-08 07:27:47] <INFO> Starting Call Set Refinement (CSR) filtering
[2021-05-08 07:27:47] <DEBG> Encountered an error whilst filtering, attempting to cleanup
[2021-05-08 07:27:48] <EROR> A program error has occurred:
[2021-05-08 07:27:48] <EROR>
[2021-05-08 07:27:48] <EROR>     Encountered an exception during calling 'VCF file
[2021-05-08 07:27:48] <EROR>     /gpfs/projects/HollisterGroup/datahome/rad-seq/gbs_combined/octopus-temp-3/1_30.octopus.unfiltered.vcf
[2021-05-08 07:27:48] <EROR>     is too big'. This means there is a bug and your results are
[2021-05-08 07:27:48] <EROR>     untrustworthy.
[2021-05-08 07:27:48] <EROR>
[2021-05-08 07:27:48] <EROR> To help resolve this error run in debug mode and send the log file to
[2021-05-08 07:27:48] <EROR> https://github.com/luntergroup/octopus/issues.
[2021-05-08 07:27:48] <INFO> ------------------------------------------------------------------------

Version

$ octopus --version
octopus version 0.7.4 (ee37c643)
Target: x86_64 Linux 3.10.0-1160.24.1.el7.x86_64
SIMD extension: AVX2
Compiler: GNU 10.2.0
Boost: 1_76

Command Command line to install octopus:

$ octopus/scripts/install.py --forests

Command line to run octopus:

$ /gpfs/software/octopus/octopus \
--reference $REF \
--keep-temporary-files \
--threads 40 \
--debug \
--disable-denovo-variant-discovery \
--source-candidates-file variants_list_1_30.txt \
--reads-file bamlist_1_30.txt \
--filter-expression 'MQ < 20 | MP < 20 | AF < 0.05 | SB > 0.98 | BQ < 15 | DP < 1' \
--output octopus_out/1_30.octopus.vcf

Additional context This is a follow up to #176. I added another 10 samples to the analysis and re-ran the joint calling with the source variants previously discovered in these samples. I'm not sure what specifically the error message is referring to when it says the unfiltered vcf is "too big". The bgzipped unfiltered vcf file is ~100 Mb in size. I assume that's too big to attach, but if you want to take a look at it, I'm happy to send via another route.

Any thoughts?

Thanks for all your help! Dave

davidecarlson commented 3 years ago

I suspect it's the following block of code from the vcf_reader.cpp that's throwing the error:

    if (file_type == ".vcf") {
        auto vcf_file_size = fs::file_size(file_path);
        if (vcf_file_size > 1e9) { // 1GB
            throw std::runtime_error {"VCF file " + file_path.string() + " is too big"};
        }

Is there a particular reason why this 1 Gb vcf file size limit is in place?

Should I disable variant filtering within Octopus and do it manually with another tool (e.g., vcftools/bcftools)? Thanks! Dave

dancooke commented 3 years ago

This is due to using uncompressed VCF as the output format - if you change the output to octopus_out/1_30.octopus.vcf.gz then you won't get this error.

The motivation behind this was to avoid situations where users might accidentally provide very large unindexed source VCF files as this requires linear (slow) searches for subregions. I would agree that this is not a very elegant way of achieving this - nor is the error very informative! I'll leave this issue open to remind me to do something better. Moreover, the VCF is streamed for filtering so I should prevent this from triggering here.

A self reminder to consider adding a warning that uncompressed VCF has been selected as output.

davidecarlson commented 3 years ago

Thanks, Dan! Writing compressed output fixed the issue, like you said. Best, Dave

dancooke commented 3 years ago

Changed exception to warning in case of large uncompressed input (c611a2c9e727d8f77f48124d5f9d07972aa191e5). Also added warning for uncompressed output (94283bec5a562ad19ac809acc63338417ad606fc).