luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

Octopus seems to miss calls in high depth regions #102

Closed ahstram closed 4 years ago

ahstram commented 4 years ago

Hi @dancooke ,

I am experiencing an issues where Octopus seems to miss variant calls in very-high depth regions (2000-5000x and higher).

The parameters typically used are:

  caller=cancer
  disable-downsampling=yes
  min-kmer-prune=5
  min-bubble-score=20
  somatic-snv-mutation-rate=5e-3
  somatic-indel-mutation-rate=1e-04
  min-expected-somatic-frequency=0.0004
  min-credible-somatic-frequency=0.0001
  min-somatic-posterior=0.05
  max-somatic-haplotypes=3
  target-working-memory=50G
  tumour-germline-concentration=5

and I should also note, we typically run Octopus in --fast mode with --threads 16

Have you experienced such an issue? Are there any parameters that may limit this?

I will post a more detailed bug report later today.

Thanks,

Alex

dancooke commented 4 years ago

Hi @ahstram,

Yes, this is a problem that I'm currently working on. In the cases I've seen so far it's due to these regions being incorrectly flagged as "bad" and therefore ignored. This feature is intended to avoid wasting large amounts of time trying to analyse uncallable regions (i.e. regions with high mapping error rates). It can be disabled by setting --bad-region-tolerance UNLIMITED.

Would be good to know more about your specific case to see if it matches up with what I already know.

Cheers Dan

ahstram commented 4 years ago

@dancooke, I will edit my original post with more details....

Does the issue you're aware of always result in Octopus printing messages like this?

[2019-12-16 20:02:45] No variants were considered in 7:100636970-100644183 as the region was considered uncallable [2019-12-16 20:02:46] Skipping region 7:98408875-98408907 as there are too many haplotypes

I ask because I have gone through and noticed that the high-depth variants which seem to be missed are not in the regions with warnings in stdout.

I've tried a number of methods to get Octopus to call the "known" variants in question such as playing with --fast/--very-fast mode (and using neither), turning Octopus internal downsampling on/off, downsampling BAMs before Octopus parses them, yet nothing seems to get Octopus to call the variants in question. The only success I've had is with subsetting the BAM file around the known high-depth variant (only reads within +-1000bp of the variant appear in the BAM). We typically use Octopus on chromosomal BAMs (for example, the input BAMs contain all reads mapped to chromosome_7), which is where the issue seems to develop.

ahstram commented 4 years ago

Just updated my original post with the Octopus parameters used...

dancooke commented 4 years ago

@ahstram Is this targeted or capture data, and what is the target sequencing depth?

The feature that kicks out 'bad' regions currently only emits a warning if the detected 'bad' region is the entire calling region (which is automatically determined when using --threads). If you find setting --bad-region-tolerance UNLIMITED resolves the issue then it will almost certainly be this feature that's the problem.

ahstram commented 4 years ago

@dancooke it's targeted sequencing data, most of the exome is ~1000x targeted, while the "high coverage" problematic regions have a target of ~5000x.

I'm doing a test run with --bad-region-tolerance UNLIMITED now, I'll let you know in about an hour if that brings our known variants back!

ahstram commented 4 years ago

Hi @dancooke, yes it appears that --bad-region-tolerance UNLIMITED causes the variants in question to show up, however, at a serious increase in runtime, presumably due to "uncallable" regions no longer being skipped. On the one chromosome I ran, runtime has increased from 1h55m to 4h36m.

Do you have any suggestions on how we may be able to keep the sensitivity improvements with --bad-region-tolerance UNLIMITED in high coverage regions, without seriously increasing runtime?

ahstram commented 4 years ago

@dancooke -- also, are there any previous versions of Octopus which don't suffer from this issue that I may want to fall back on?

dancooke commented 4 years ago

@ahstram There's always been some sort of functionality to do this that's been modified steadily; it's difficult to point you to one specific commit that will solve your issue. Unfortunately, it's a very tricky problem. The current algorithm first builds a set of summary statistics about the input reads for each sample/contig by taking random samples across each contig. These are then used to calculate the chance that a particular region is "bad". One of the factors considered is read depth: excessively high read depths compared with the contig average can indicate poorly mapped regions. Obviously this relies on the target depth distribution being reasonably unimodal, which is not the case for your data since you have two target coverages. I hadn't considered this possibility in truth. One thing I'm currently trying is to discount the depth component of the scoring function if the depth distribution indicates targeted sequencing, or non-unimodal. For now you could try setting --bad-region-tolerance HIGH.

ahstram commented 4 years ago

@dancooke Thanks for explaining. I've kicked off another test run with --bad-region-tolerance HIGH to see if this would make the runtimes more acceptable while keeping high-depth calls.

Another possibility I'm exploring is running Octopus a second time for each contig with a --regions-file flag set to specifically focus on the higher-depth targeted regions. I would then merge the two Octopus runs VCF files together. I am using --fast mode, so I'm not too concerned about missing haplotype information. Do you see any issues with that idea?

ahstram commented 4 years ago

@dancooke Does octopus use the entire BAM file when calculating summary statistics, even when told to limit its analysis to certain regions via --regions-file? My current thinking is to run multiple instances of Octopus for each of the target depths, with the hope that it will only calculate summary statistics for those regions in the BAM file specified by --regions-file, which in theory, should fix the issue.

dancooke commented 4 years ago

@ahstram The statistics are computed only for the input regions, so your strategy should work.

Would you be able to try the latest develop branch commit with your normal settings to see if there's any improvement?

ahstram commented 4 years ago

@dancooke I've tried re-running the full chromosome with commit 4e7e5d5 using my normal settings (i.e., --bad-region-tolerance UNLIMITED is off). I did notice that the runtime seems to a bit faster, but the variant that I'm looking for, which appears at approximately 2% A.F. with 2000x sequencing depth is not detected. I also tried running with -t flag set to my "boosted" targeted regions, but that didn't cause the variant to be picked up, likely as there are some regions outside of our "boosted" targeted regions which seem to have similarly high coverage.

Can you confirm that this issue should only cause variants in high depth areas of targeted sequencing to be missed? I am thinking that I can just supplement the Octopus output with a bcftools mpileup output for high-depth regions in order to improve overall sensitivity.

dancooke commented 4 years ago

@ahstram It should only affect high depth regions. Is this paired or tumour-only calling? Any chance you'd be able to send a sample of the data (a whole contig should be sufficient so long as it includes the type of variance in coverage that is typical for your experiment)?

ahstram commented 4 years ago

Hi @dancooke, yes I can send you an example dataset. Can you please provide SFTP server details?

dancooke commented 4 years ago

Hi @ahstram, sorry for the delay. How big is the dataset?

dancooke commented 4 years ago

Closing due to inactivity. Please re-open if needed.