mutalyzer / mutalyzer2

HGVS variant nomenclature checker
https://mutalyzer.nl
Other
98 stars 23 forks source link

The genbank parser fails for certain references #468

Closed mihailefter closed 5 years ago

mihailefter commented 5 years ago

When trying to add mRNA and CDS features to their corresponding gene lists, if there is no gene feature present in the record with the same name as found in the sub-features, the parser breaks.

For UD_150167851083 such a case occurs:

gene    <1..13127
        /gene="UGT1A"
        /gene_synonym="GNT1; UGT; UGT1; UGT1A@"
        /db_xref="HGNC:HGNC:12529"

mRNA    join(<6862..6993,7677..7764,8048..8267,12090..13127)
        /gene="UGT1A1"
        /gene_synonym="BILIQTL1; GNT1; HUG-BR1; UDPGT; UDPGT 1-1;
        UGT1; UGT1A"
        /db_xref="HGNC:HGNC:12530"
        /db_xref="MIM:191740"

Most likely the mRNA should be added to the gene, but the /gene qualifier value is different between the two.

ifokkema commented 5 years ago

Note that these are different entities!

UGT1A: HGNC, NCBI

UGT1A1: HGNC, NCBI

mihailefter commented 5 years ago

It seems so from the HGNC also. However, UGT1A is present in the in the mRNA /gene_synonym qualifier.

ifokkema commented 5 years ago

Synonyms change and can get reassigned to other genes. In this case, there is an obvious relationship between the two, but they are not the same.

However, I just saw that UGT1A1 does actually have a /gene tag in UD_150167851083...?

mihailefter commented 5 years ago

Indeed, the UGT1A1 gene is there, but located in the file after the mRNA feature. The bug arises then because the parser tries in one iteration over the record features: (1) to generate the gene, mRNA and CDS objects, and (2) to add the latter (mRNA and CDS) to the previous (gene).

Still, the code would also fail if no UGT1A1 would have been present, so a check whether the gene actually exists should be performed whenever mRNA and CDS objects are added to a gene object.