Open sleyn opened 1 year ago
Oh. That explains some of the recent reports I got. I'll add the MBQ filter to PureCN.R. 20 is really low, but I see their concerns about PCR errors and inflated quality scores. Maybe replacing the 25 PureCN default with a dynamic one? Like min(xx_mbq_percentile, 25)? Not sure what the best unfiltered MBQ percentile would be though.
It is a hard question as Mutect2 output seems does not have any information how distinguish where low quality came from - from sequencing errors or from Mutect2 PCR error adjustments.
What I'll do is to skip the filtering based on MBQ when too many variants are filtered out. Then the user can either lower the cutoff or do the filtering upstream.
I decided to just throw a warning for now. The linked commit adds --min-base-quality to PureCN.R which you can set to 20.
Hello,
Just encountered the issue using Mutect2 (tried several versions) with PureCN 2.2.0. PureCN has failed for most samples with majority of variants filtered out by BQ<25 filter.
I've figured out what happen: I have a 150x paired end data with very short insert size (median insert size compared to read length). In this case majority of nucleotides in the data are in the overlapped parts of reads (on both forward and reverse read). In this case Mutect2 starts to be concerned about PCR errors. If the nucleotide came from PCR error and it is supported by two reads it could have very high variant quality. So for its error model Mutect2 adjust base quality (by default to min(20, original_base_quality) in overlapped parts of the reads. It results in drop of MBQ. 20 is a half of 40 which is Phred score of 10^-4 - default PCR error rate in Mutect2 (could be changed by
--pcr-snv-qual
parameter). In my case just most of variants had MBQ=20 and therefore were filtered out by PureCN.Some links about this issue: https://github.com/broadinstitute/gatk/issues/4958 - quite old discussion, but Mutect2 code follows the same logic. https://gatk.broadinstitute.org/hc/en-us/community/posts/18479740841755-Mutect2-overlapping-reads-behavior-with-pcr-snv-qual-parameter - My question to GATK team about using
--pcr-snv-qual
.I'm not sure what should be the best way to deal with it in case of PureCN: 1) Cut overalled part of the reads. Should it also improve coverage calculations by Coverage.R? 2) Turn off PCR error correction for setting
--pcr-snv-qual
hight. 3) Make MBQ filter in PureCN adjustable.Best, Semen