openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
69 stars 21 forks source link

Combining nearby allelic variants, or not #467

Open leicray opened 1 year ago

leicray commented 1 year ago

Describe the bug This issue is prompted by a query from a user regarding how the variant NC_000017.10:g.[7576880del;7576883T>G] ought to be handled. VariantValidator validates allelic variants as separate entities but ought to consider the distance between the two variants and combine the descriptions if the variants are close to one another.

At present, VariantValidator correctly projects the individual variants into the TP53 MANE Select transcript NM_000546.6 yielding the respective variant descriptions NM_000546.6:c.966del and NM_000546.6:c.963A>C.

The current HGVS variant nomenclature recommendations say: "... two variants separated by one or more nucleotides should preferably be described individually and not as a "delins" (unless they together affect one amino acid)".

Variant NM_000546.6:c.963A>C affects amino acid 321 (lysine) and NM_000546.6:c.966del affects amino acid 322 (proline). The two variants are three nucleotides apart and they affect two different amino acids. Hence, they do not meet the HGVS criteria for being combined into a single variant description.

However, the variant can also be written in vcf style as chr17-7576879-GTGGT-GGGG and VariantValidator projects this to the single variant description NM_000546.6:c.963_966delinsCCC which results in NP_000537.3:p.(Lys321AsnfsTer24). Interestingly, the amino acid level description indicates that only one amino acid is affected, hence the two variant descriptions may be validly combined.

This requires considerable discussion.

ifokkema commented 1 year ago

My initial thoughts were that discussing this will open up Pandora's box - two very large insertions only five bases apart are likely one even larger insertion. A duplication of a large stretch with a few nucleotides changed in the inserted sequence will automatically be described as a very large insertion, but it probably shouldn't. There are no proper rules for this.

Then again, I looked a bit closer at your example.

However, the variant can also be written in vcf style as chr17-7576879-GTGGT-GGGG and VariantValidator projects this to the single variant description NM_000546.6:c.963_966delinsCCC which results in NP_000537.3:p.(Lys321AsnfsTer24). Interestingly, the amino acid level description indicates that only one amino acid is affected, hence the two variant descriptions may be validly combined.

VV describes this as one variant because one variant was given. The resulting protein description, however, does not indicate only one amino acid is affected. Instead, the fs protein level description states that only the first changed amino acid should be mentioned. There are a lot more affected due to the frameshift. So I think these two variants are correctly kept apart.

I doubt it will help to implement logic into VV that splits variants when possible. I expect it to complicate the code a lot and likely, it will result in a lot of false positives. Also, I'm not sure if the nomenclature is really "fixed" for this situation - in other words, there may be changes in the (near) future.

Peter-J-Freeman commented 1 year ago

What VV does currently is to merge any 2 variants that can normalize to adjacent positons. Each variant is shifted 5 prime and 3 prime, and if they happen to become adjacent, they are merged.

VV describes this as one variant because one variant was given. The resulting protein description, however, does not indicate only one amino acid is affected. Instead, the fs protein level description states that only the first changed amino acid should be mentioned. There are a lot more affected due to the frameshift. So I think these two variants are correctly kept apart.

Correct, if a single variant description is submitted, VV will make sense of it assuming it is a single variant. We do not yet have code that splits variants.

I doubt it will help to implement logic into VV that splits variants when possible. I expect it to complicate the code a lot and likely, it will result in a lot of false positives. Also, I'm not sure if the nomenclature is really "fixed" for this situation - in other words, there may be changes in the (near) future.

I know the mutalyzer team are proud of their splitting tool, but I agree that this could be a very large overhead for little gain. For example, a single variant description is input from a VCF file and we can convert it into a simple and reproducible description that does not rely on alignment algorithms and maths to split it into 2 or more descriptions. I guess another relevant question is, what is the variation between different variant callers when it comes to outputting the VCF for each of these descriptions? If they all do pretty much the same thing and VV always spits out a reproducible answer that is correct in terms of the reference sequences, then is the answer actually incorrect based on a meer technicality????

ifokkema commented 1 year ago

I guess another relevant question is, what is the variation between different variant callers when it comes to outputting the VCF for each of these descriptions?

Actually, they don't all do the same. GATK is terrible at detecting indels. Or well, I should be more precise, whatever configuration and setup of GATK our diagnostic centers use, is terrible at detecting indels. We'd get a substitution with directly adjoined a deletion or an insertion. The BAM file shows that the alleles are the same and both variants are found in the same reads, so this should have been a single deletion-insertion. This happens quite a lot, we have a very low amount of deletion-insertion variants called.

Peter-J-Freeman commented 1 year ago

I'm not sure whether GATK is bad at indels, or it's just that no one follows the correct procedures for variant quality score recalibration. It's a little complicated and often omitted because labs tend to like to focus on SNVs

The thing is though that we can only deal with what we are passed. If we are passed an allele description where the indel would normalize to the SNP, VV will merge into an indel. If they are passed separately, we can't.

If a large indel is passed that should be split into an indel and a SNP, then we do not have the code to do this. I'm really not sure what to do with this one. Do we invest time in writing a splitting algorithm? I'm not yet fully convinced on the benefit.

ifokkema commented 1 year ago

The thing is though that we can only deal with what we are passed. If we are passed an allele description where the indel would normalize to the SNP, VV will merge into an indel. If they are passed separately, we can't.

Yeah, and LOVD isn't sending allele descriptions. We could, since we know the allele and we know the positions, but that would be something we should then handle on our side. Merging two variants should be optional, not mandated (since a variant has multiple fields that we'd all need to merge), so would require a decision-making interface on our side.

If a large indel is passed that should be split into an indel and a SNP, then we do not have the code to do this. I'm really not sure what to do with this one. Do we invest time in writing a splitting algorithm? I'm not yet fully convinced on the benefit.

In a perfect world - sure. But honestly, I think too many other things are just more important than this issue, even if it's just the fact that they're more common issues.