lima1 / PureCN

Copy number calling and variant classification using targeted short read sequencing
https://bioconductor.org/packages/devel/bioc/html/PureCN.html
Artistic License 2.0
128 stars 32 forks source link

Mutect2 drops MBQ to 20 in overlapped reads #320

Open sleyn opened 1 year ago

sleyn commented 1 year ago

Hello,

Just encountered the issue using Mutect2 (tried several versions) with PureCN 2.2.0. PureCN has failed for most samples with majority of variants filtered out by BQ<25 filter.

I've figured out what happen: I have a 150x paired end data with very short insert size (median insert size compared to read length). In this case majority of nucleotides in the data are in the overlapped parts of reads (on both forward and reverse read). In this case Mutect2 starts to be concerned about PCR errors. If the nucleotide came from PCR error and it is supported by two reads it could have very high variant quality. So for its error model Mutect2 adjust base quality (by default to min(20, original_base_quality) in overlapped parts of the reads. It results in drop of MBQ. 20 is a half of 40 which is Phred score of 10^-4 - default PCR error rate in Mutect2 (could be changed by --pcr-snv-qual parameter). In my case just most of variants had MBQ=20 and therefore were filtered out by PureCN.

Some links about this issue: https://github.com/broadinstitute/gatk/issues/4958 - quite old discussion, but Mutect2 code follows the same logic. https://gatk.broadinstitute.org/hc/en-us/community/posts/18479740841755-Mutect2-overlapping-reads-behavior-with-pcr-snv-qual-parameter - My question to GATK team about using --pcr-snv-qual.

I'm not sure what should be the best way to deal with it in case of PureCN: 1) Cut overalled part of the reads. Should it also improve coverage calculations by Coverage.R? 2) Turn off PCR error correction for setting --pcr-snv-qual hight. 3) Make MBQ filter in PureCN adjustable.

Best, Semen

lima1 commented 1 year ago

Oh. That explains some of the recent reports I got. I'll add the MBQ filter to PureCN.R. 20 is really low, but I see their concerns about PCR errors and inflated quality scores. Maybe replacing the 25 PureCN default with a dynamic one? Like min(xx_mbq_percentile, 25)? Not sure what the best unfiltered MBQ percentile would be though.

sleyn commented 1 year ago

It is a hard question as Mutect2 output seems does not have any information how distinguish where low quality came from - from sequencing errors or from Mutect2 PCR error adjustments.

lima1 commented 1 year ago

What I'll do is to skip the filtering based on MBQ when too many variants are filtered out. Then the user can either lower the cutoff or do the filtering upstream.

lima1 commented 1 year ago

I decided to just throw a warning for now. The linked commit adds --min-base-quality to PureCN.R which you can set to 20.