google / deepvariant

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.
BSD 3-Clause "New" or "Revised" License
3.12k stars 702 forks source link

Highest mapping quality = 42 in bowtie2 #809

Closed Wenfei-Xian closed 2 months ago

Wenfei-Xian commented 2 months ago

Hello, Since the highest mapping quality in bowtie2 42, does it affect the the mapping quality channel in deepvariant ?

AndrewCarroll commented 2 months ago

Hi @Wenfei-Xian,

A max MAPQ score of 42 will likely have some effect, but I expect not an enormous one. I suspect that MAPQ at the lower end of the ranges would be more important, since if well-calibrated a difference between PHRED=42 and PHRED=60 is a very low additional absolute error probability.

I have some bowtie mapped reads handy for a GIAB sample. I think I can conduct a quick experiment to see if that intuition is right.

AndrewCarroll commented 2 months ago

Hi @Wenfei-Xian

I finished the experiments. There is certainly a noticeable effect from MAPQ limits, more than I expected. For my experiment, I rewrote the BAM file, setting the MAPQ to 60 for any read with MAPQ of 36 or higher (I observed 44 as the highest MAPQ value an more variability to MAPQ values than seen with BWA.

Experiment SNP Recall SNP Precision SNP F1 INDEL Recall Indel Precision Indel F1
Default BAM 0.9673 0.9967 0.9817 0.9717 0.9956 0.9835
MAPQ 36+ -> 60 0.9758 0.9964 0.9859 0.9829 0.9960 0.9894

This implies you will get better performance with DeepVariant if you set those higher MAPQ values to 60. Note that in general, DeepVariant hasn't been trained with Bowtie2 data and you'd likely get better performance overall by a re-training for it.

Wenfei-Xian commented 2 months ago

Hello Andrew,

      Many thanks !!!

Best, Wenfei