Open iamh2o opened 2 years ago
This is very low coverage WGS, correct? Probably what's happening is that larger values of -B
(including the default) results in very large calling windows that prevent parallelisation - in the extreme case the read buffer is large enough to fit the entire chromosome into memory so no parallelisation occurs (if only calling on that chromosome). Ideally you want the largest possible calling windows that don't hurt cpu utilisation (processing less calling windows tasks in parallel than the number of provided cpu cores). I've recently started implementing an additional layer of multithreading at the calling task level which should improve utilisation even when there are less tasks than cores, which is common - even under normal workloads - right at the start and end of runs. However, that still won't fully compensate for having less calling tasks that cores, so for low-coverage data, lowering -B
is currently your best bet. I could also potentially add an option to explicitly cap the size of calling task windows (currently hard-capped at 25,000,000bp).
Having very small windows can potentially hurt variant calling since you inhibited phasing opportunities (i.e., limit haplotype lengths). There's a lower-bound cap of 5,000bp, which prevents this being too much of an issue, but I'd still expect more windowing artefacts than with larger calling windows. The size of potential deletions/complex-subs is also capped by the size of the calling window.
A question and a Bug Report
Is setting -B very low going to negatively impact var calling?
which I had usually been setting to -X 40G -B 10G
, and seemed to perform as others reported, speed wise. But, I ran a parameter space scan of these two options, and was surprised to find that octopus ran extremely fast with lower -B values.... down the minimum of 50M which would process human b37 chr9 in ~4 min, where -B set to 200M could run for an hour or two. I do not see anything warning about using too low a -B, but this seems too good to be true.... The impact on variant calling concordance with the GIAB samples is not obviously hurt... but I'm still a little uneasy. What is your opinion re: using low -B? Or, whay might the potential impacts be?CHR9 @ 30x
Bug Deets
We have been experimenting with parallelizing octopus by running per chromosome, then combining the resulting VCF's. This has worked out quite well, no problems with the GIAB samples, but as soon as I ran clinicals, this crash began to occur ~5% of the time.
Which, I am currently handling by backing off on some of the var calling settings to be more permissive on the second or third attempt- that generally solves the issue.
I am supplying octopus the complete BAM, and subsetting using -L, so there should be visibility to reads outside the specified region is needed.
Version
Additional context Add any other context about the problem here, e.g. -- b37, and I can point you to all of the files offline if you'd like to run yourself.
thanks-- jem