Closed ahstram closed 4 years ago
Hi @ahstram,
Yes, this is a problem that I'm currently working on. In the cases I've seen so far it's due to these regions being incorrectly flagged as "bad" and therefore ignored. This feature is intended to avoid wasting large amounts of time trying to analyse uncallable regions (i.e. regions with high mapping error rates). It can be disabled by setting --bad-region-tolerance UNLIMITED
.
Would be good to know more about your specific case to see if it matches up with what I already know.
Cheers Dan
@dancooke, I will edit my original post with more details....
Does the issue you're aware of always result in Octopus printing messages like this?
[2019-12-16 20:02:45]
No variants were considered in 7:100636970-100644183 as the region was considered uncallable [2019-12-16 20:02:46] Skipping region 7:98408875-98408907 as there are too many haplotypes
I ask because I have gone through and noticed that the high-depth variants which seem to be missed are not in the regions with warnings in stdout.
I've tried a number of methods to get Octopus to call the "known" variants in question such as playing with --fast
/--very-fast
mode (and using neither), turning Octopus internal downsampling on/off, downsampling BAMs before Octopus parses them, yet nothing seems to get Octopus to call the variants in question. The only success I've had is with subsetting the BAM file around the known high-depth variant (only reads within +-1000bp of the variant appear in the BAM). We typically use Octopus on chromosomal BAMs (for example, the input BAMs contain all reads mapped to chromosome_7), which is where the issue seems to develop.
Just updated my original post with the Octopus parameters used...
@ahstram Is this targeted or capture data, and what is the target sequencing depth?
The feature that kicks out 'bad' regions currently only emits a warning if the detected 'bad' region is the entire calling region (which is automatically determined when using --threads
). If you find setting --bad-region-tolerance UNLIMITED
resolves the issue then it will almost certainly be this feature that's the problem.
@dancooke it's targeted sequencing data, most of the exome is ~1000x targeted, while the "high coverage" problematic regions have a target of ~5000x.
I'm doing a test run with --bad-region-tolerance UNLIMITED
now, I'll let you know in about an hour if that brings our known variants back!
Hi @dancooke, yes it appears that --bad-region-tolerance UNLIMITED
causes the variants in question to show up, however, at a serious increase in runtime, presumably due to "uncallable" regions no longer being skipped. On the one chromosome I ran, runtime has increased from 1h55m to 4h36m.
Do you have any suggestions on how we may be able to keep the sensitivity improvements with --bad-region-tolerance UNLIMITED
in high coverage regions, without seriously increasing runtime?
@dancooke -- also, are there any previous versions of Octopus which don't suffer from this issue that I may want to fall back on?
@ahstram There's always been some sort of functionality to do this that's been modified steadily; it's difficult to point you to one specific commit that will solve your issue. Unfortunately, it's a very tricky problem. The current algorithm first builds a set of summary statistics about the input reads for each sample/contig by taking random samples across each contig. These are then used to calculate the chance that a particular region is "bad". One of the factors considered is read depth: excessively high read depths compared with the contig average can indicate poorly mapped regions. Obviously this relies on the target depth distribution being reasonably unimodal, which is not the case for your data since you have two target coverages. I hadn't considered this possibility in truth. One thing I'm currently trying is to discount the depth component of the scoring function if the depth distribution indicates targeted sequencing, or non-unimodal. For now you could try setting --bad-region-tolerance HIGH
.
@dancooke Thanks for explaining. I've kicked off another test run with --bad-region-tolerance HIGH
to see if this would make the runtimes more acceptable while keeping high-depth calls.
Another possibility I'm exploring is running Octopus a second time for each contig with a --regions-file
flag set to specifically focus on the higher-depth targeted regions. I would then merge the two Octopus runs VCF files together. I am using --fast
mode, so I'm not too concerned about missing haplotype information. Do you see any issues with that idea?
@dancooke Does octopus use the entire BAM file when calculating summary statistics, even when told to limit its analysis to certain regions via --regions-file
? My current thinking is to run multiple instances of Octopus for each of the target depths, with the hope that it will only calculate summary statistics for those regions in the BAM file specified by --regions-file
, which in theory, should fix the issue.
@ahstram The statistics are computed only for the input regions, so your strategy should work.
Would you be able to try the latest develop branch commit with your normal settings to see if there's any improvement?
@dancooke I've tried re-running the full chromosome with commit 4e7e5d5 using my normal settings (i.e., --bad-region-tolerance UNLIMITED
is off). I did notice that the runtime seems to a bit faster, but the variant that I'm looking for, which appears at approximately 2% A.F. with 2000x sequencing depth is not detected. I also tried running with -t
flag set to my "boosted" targeted regions, but that didn't cause the variant to be picked up, likely as there are some regions outside of our "boosted" targeted regions which seem to have similarly high coverage.
Can you confirm that this issue should only cause variants in high depth areas of targeted sequencing to be missed? I am thinking that I can just supplement the Octopus output with a bcftools mpileup output for high-depth regions in order to improve overall sensitivity.
@ahstram It should only affect high depth regions. Is this paired or tumour-only calling? Any chance you'd be able to send a sample of the data (a whole contig should be sufficient so long as it includes the type of variance in coverage that is typical for your experiment)?
Hi @dancooke, yes I can send you an example dataset. Can you please provide SFTP server details?
Hi @ahstram, sorry for the delay. How big is the dataset?
Closing due to inactivity. Please re-open if needed.
Hi @dancooke ,
I am experiencing an issues where Octopus seems to miss variant calls in very-high depth regions (2000-5000x and higher).
The parameters typically used are:
and I should also note, we typically run Octopus in
--fast
mode with--threads 16
Have you experienced such an issue? Are there any parameters that may limit this?
I will post a more detailed bug report later today.
Thanks,
Alex