jts / nanopolish

Signal-level algorithms for MinION data
MIT License
550 stars 160 forks source link

Better intuition for the meaning of QUAL #1109

Closed smsaladi closed 9 months ago

smsaladi commented 9 months ago

We (@boospooky and I) are trying to better understand the meaning of the QUAL scores that are calculated by nanopolish. A while back @jts explained that they are not "well calibrated" and that it would be better to filter variants based on SupportFraction.

However, while looking more deeply into the artic pipeline, we see that variants are accepted as follows:

        if qual / total_reads < 3:
            return False

https://github.com/artic-network/fieldbioinformatics/commit/3ac65e755ecd5eda523d6615b54f31d5d8e0c20d#diff-356e807121a2f2a33814ad0994b7880291a0f9fbe3dbf6a3363cb08943e49fdfL16-L17

Given that that @nickloman wrote that code, we think it's likely there is some piece of intuition that we're missing in understanding the meaning of nanopolish's QUAL.

HMMs output scores seem to be summed up to give QUAL. Can we think of QUAL as the sum of read-quality scores given by the basecaller, e.g. minknow? Something else? https://github.com/jts/nanopolish/blob/28e774088b4a780b86066571896348e790f4113c/src/common/nanopolish_variant.cpp#L784-L797

Many thanks for the input here.

jts commented 9 months ago

This is a great question and I'll try to give an intuitive answer. Nanopolish internally uses an HMM which evaluates the probability of observing the reads from the variant sequence and also from the reference sequence. I'll call these P(R|V) and P(R|G) (V = variant, G = reference genome sequence). The QUAL field is simply:

sum over R log(P(R|V)) - log(P(R|G))

The log(P(R|V)) - log(P(R|G)) term is essentially how much stronger (or not) the variant is supported by a read. Since we sum over all reads, this value will grow linearly with read depth. Very high "QUAL" values can be observed for extreme depth, even if any individual read very weakly supports the variant allele. The spurious support could be caused by model mis-training/specification, alignment issues, base modifications, etc, which is why I cautioned against treating the value nanopolish puts in QUAL as a true quality score (which should be the probability the variant call is incorrect).

The filter in the artic pipeline measures how strongly any individual read supports the variant, which is fairly good at filtering out spurious variants that have very high QUAL scores due to extreme depth (which is common for virus amplicon sequencing)

I hope that makes sense and helps. Jared

smsaladi commented 9 months ago

Absolutely very helpful! And thanks for your quick response!