nanoporetech / medaka

Sequence correction provided by ONT Research
https://nanoporetech.com
Other
391 stars 73 forks source link

how to add Base Quality (BQ) tags to vcf output in medaka_haploid_variant #462

Closed avilella closed 8 months ago

avilella commented 9 months ago

Is your feature request related to a problem? Please describe. I would like to add the Base Quality (BQ) tags to the vcf output of medaka_haploid_variant.

Describe the solution you'd like Add a flag to the medaka_haploid_variant tool so that BQ can be included in the vcf output.

Describe alternatives you've considered None, but I can see how BQ is a reserved tag in the medaka code.

Additional context The BQ tag would give more context about how sure we are about the data available for a given position.

cjw85 commented 9 months ago

Hi Albert!

BQ is described in the VCF spec. as:

BQ : RMS base quality at this position

I'm going to guess that means root-mean-squared (sorry I'm completely ignorant of how this field is normally used). The per-base qualities is not something that medaka deals in; there's no infrastructure through the code to manage them through to the production of a VCF. So this would be quite an under taking and without evidence that its actually useful, its not something that's likely to be implemented.

Going deeper for a second. There have been numerous attempts over the years to encorporate base qualities into variant calling. None of these have ever had a meaninful impact on the veracity of variant calls; which leads me further to suspect that providing a summary metric such as BQ isn't going to be all that useful either.

avilella commented 9 months ago

Thanks for the detailed answer. I've seen in another ticket that there is a way of obtaining the 'gvcf' entries after having run medaka_haploid_calling, so that covers most of the ground I was interested in. It seems like the QUAL of a gvcf entry would really be the same as the BQ for that position.

On Wed, Oct 4, 2023 at 2:40 PM Chris Wright @.***> wrote:

Hi Albert!

BQ is described in the VCF spec. as:

BQ : RMS base quality at this position

I'm going to guess that means root-mean-squared (sorry I'm completely ignorant of how this field is normally used). The per-base qualities is not something that medaka deals in; there's no infrastructure through the code to manage them through to the production of a VCF. So this would be quite an under taking and without evidence that its actually useful, its not something that's likely to be implemented.

Going deeper for a second. There have been numerous attempts over the years to encorporate base qualities into variant calling. None of these have ever had a meaninful impact on the veracity of variant calls; which leads me further to suspect that providing a summary metric such as BQ isn't going to be all that useful either.

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/medaka/issues/462#issuecomment-1746902467, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGSN3OS35NZN3H7SLXXK3X5VRMNAVCNFSM6AAAAAA5SRTXXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWHEYDENBWG4 . You are receiving this because you authored the thread.Message ID: @.***>

cjw85 commented 9 months ago

The QUAL entry is derived from the scores output by the RNN. They are our best estimate of the reliability of a variant call, and so can be used to filter lower scoring calls, if you're into that sort of thing.

If my understanding of BQ is correct, then it is unrelated to QUAL (for any variant caller).

avilella commented 9 months ago

Thanks. QUAL is enough. Is there a document explaining how the QUAL is calculated?

On Wed, Oct 4, 2023 at 2:56 PM Chris Wright @.***> wrote:

The QUAL entry is derived from the scores output by the RNN. They are our best estimate of the reliability of a variant call, and so can be used to filter lower scoring calls, if you're into that sort of thing.

If my understanding of BQ is correct, then it is unrelated to QUAL (for any variant caller).

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/medaka/issues/462#issuecomment-1746931058, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGSN47XXSYIKMHRWWWDDDX5VTIFAVCNFSM6AAAAAA5SRTXXOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWHEZTCMBVHA . You are receiving this because you authored the thread.Message ID: @.***>

cjw85 commented 9 months ago

The best description is, of course, the code:

https://github.com/nanoporetech/medaka/blob/master/medaka/labels.py#L942

A a simple level it is derived as a ratio of the RNN scores for the reference allele and the reference allele.