luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
305 stars 38 forks source link

very low -B, safe? && core dump 'what(): BadRegionCompare:' #237

Open iamh2o opened 2 years ago

iamh2o commented 2 years ago

A question and a Bug Report

Is setting -B very low going to negatively impact var calling?

CHR9 @ 30x

-B 50m -18 13:49:25] Finished calling 24,421,782bp, total runtime 3m 33s [2022-06-18 13:49:25] Removed 6 temporary files

-B 1G ... a lot longer

Bug Deets

We have been experimenting with parallelizing octopus by running per chromosome, then combining the resulting VCF's. This has worked out quite well, no problems with the GIAB samples, but as soon as I ran clinicals, this crash began to occur ~5% of the time.

[2022-06-15 20:24:37]       3:38230000            19.5%           15m 8s             1h 2m          │··································
[2022-06-15 20:24:54]       3:39229900            20.0%          15m 24s             1h 1m          │··································
[2022-06-15 20:25:13]       3:40276483            20.5%          15m 44s             1h 1m          │··································
terminate called after throwing an instance of 'octopus::BadRegionCompare'                                │··································
  what():  BadRegionCompare: comparing regions on different contigs: :40347621-40347771 & 3:40347621-40347│··································
771                                                                                                       
[2022-06-15 20:24:54]       3:39229900            20.0%          15m 24s             1h 1m        
[2022-06-15 20:25:13]       3:40276483            20.5%          15m 44s             1h 1m       
terminate called after throwing an instance of 'octopus::BadRegionCompare'            
  what():  BadRegionCompare: comparing regions on different contigs: :40347621-40347771 & 3:40347621-40347
/bin/bash: line 32: 41877 Aborted                 singularity exec -B /locus:/locus:rw resources/locus/dat
a/external_data/research_experiments/PHYTO__RESOURCES/snakemake_environments/containers/octopus_amd_native resources/octopus/octopus_f --temp-directory-prefix $oct_tmpd -T $oochrm_mod --reference resources/locus/
data/external_data/research_experiments/PHYTO__RESOURCES/genomic_data/organism_refrences/H_sapiens/b37Invi
tae/human_g1k_v37_modified.fasta/human_g1k_v37_modified.fasta --reads RU35772_EX4953_SQ6101_0.bwa2b.sort.bam --forest-model individual.v0.7.0.v2.forest --max-indel-errors 36 --max-open-read-files 9000000 --max-haplotypes 300 --bad-region-tolerance NORMAL $variable_args -X 40G -B 50M --threads 86 --sequence-error-model novaseq.4a38e55.model --annotations AD ADP AF AFB ARF BQ CRF DP FRF GC GQ MC MF MQ MQ0 QD GQD QUAL SB STRL STRP --duplicate-read-detection-policy AGGRESSIVE --min-forest-quality 0 --skip-regions-file b37_all_gaps_octo.txt                  
[Thu Jun 16 03:25:24 2022]              
Error in rule octopus:   

Which, I am currently handling by backing off on some of the var calling settings to be more permissive on the second or third attempt- that generally solves the issue.

I am supplying octopus the complete BAM, and subsetting using -L, so there should be visibility to reads outside the specified region is needed.

Version

$ octopus --version
octopus version 0.7.4 (develop 4b334d59)
Target: x86_64 Linux 5.4.0-96-generic
SIMD extension: AVX2
Compiler: GNU 9.4.0
Boost: 1_71```

**Command**
Command line to install octopus:
```shell
I have a recent develop pull compiled more specifically for our amd harware.  I hacked the python install script (which btw, is missing the Boost_Include(sic.) env var being set (or I needed to add it for what I was attempting at least).

Additional context Add any other context about the problem here, e.g. -- b37, and I can point you to all of the files offline if you'd like to run yourself.

thanks-- jem

dancooke commented 2 years ago

This is very low coverage WGS, correct? Probably what's happening is that larger values of -B (including the default) results in very large calling windows that prevent parallelisation - in the extreme case the read buffer is large enough to fit the entire chromosome into memory so no parallelisation occurs (if only calling on that chromosome). Ideally you want the largest possible calling windows that don't hurt cpu utilisation (processing less calling windows tasks in parallel than the number of provided cpu cores). I've recently started implementing an additional layer of multithreading at the calling task level which should improve utilisation even when there are less tasks than cores, which is common - even under normal workloads - right at the start and end of runs. However, that still won't fully compensate for having less calling tasks that cores, so for low-coverage data, lowering -B is currently your best bet. I could also potentially add an option to explicitly cap the size of calling task windows (currently hard-capped at 25,000,000bp).

Having very small windows can potentially hurt variant calling since you inhibited phasing opportunities (i.e., limit haplotype lengths). There's a lower-bound cap of 5,000bp, which prevents this being too much of an issue, but I'd still expect more windowing artefacts than with larger calling windows. The size of potential deletions/complex-subs is also capped by the size of the calling window.