openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Transcript fusion r. description #594

Open leicray opened 3 months ago

leicray commented 3 months ago

Describe the bug A user has submitted the variant description NM_001987.5:c.?_33::NM_021947.3:c.-4_? which fails to validate and which triggers an ERROR email to the sysadmins.

Transforming the description to NM_001987.5:c.?_33::NM_021947.3:c.-4_? also fails but without triggering the ERROR email, but with the warning NM_001987.5:c.?_33::NM_021947.3:c.-4_?: char 14: expected one of (, *, or a digit. This is strange as character 14 is the dot immediately after the c.

The user appears to want to express a presumed gene-fusion variant at the RNA level which might be reasonable if the RNA has been analysed. However, the presence of two instances of ? suggests that RNA sequencing has not been carried out.

Taken at face value, the variant description suggests a fusion such that the 5' end of NM_001987.5 up to nucleotide 33 is fused to the 5' UTR of NM_021947.3 immediately before nucleotide -4.

ifokkema commented 3 months ago

I don't see the difference between the first and second variant descriptions; according to my diff, they are the same.

The page on RNA fusion shows the same format, by the way;

when only the sequence adjacency and not the entire transcript has been analysed, the format NM_152263.2:r.?_775::NM_002609.3:r.1580_? should be used.

leicray commented 3 months ago

Mea culpa. I muddled the original description during editing and then pasted the c. version into the first instance in the message when trying to fix it.

It looks like the syntax is correct but the variant description is not being parsed correctly by VV.

ifokkema commented 3 months ago

No worries! Interestingly, though, it seems the HGVS nomenclature website is missing information on what to do on the protein level. As such, I'm not sure what VV can do beyond checking if the given positions (NM_001987.5:c.33 and NM_021947.3:c.-4) exist. There seems to be no protein-level annotation for this... unless I'm not looking well enough. We can obviously make something up, but I doubt that's the idea here, especially since this notation doesn't even tell us whether the adjacent nucleotides NM_001987.5:c.32 and NM_021947.3:c.-3 have been sequenced. I'm also not sure what to do with frameshifts, etc.