openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

gene2transcripts and gene2transcripts_v2 don't like HGNC IDs. #578

Closed ifokkema closed 5 months ago

ifokkema commented 6 months ago

Describe the bug API endpoints gene2transcripts and gene2transcripts_v2 allow for genes to be passed as "HGNC:2197". That's great for genes that have recently changed their symbols, and I'm going to use this now. However, the "HGNC:" addition is required but undocumented. If sent as "2197", calls return an HTTP 500. It actually took me some time to realize I needed to add "HGNC:" and I was preparing this bug report as an "it doesn't work" when I realized what the required format was.

To Reproduce Steps to reproduce the behavior:

  1. Sending a gene symbol works and lists the transcripts. https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/COL1A1/False/refseq/GRCh37?content-type=application%2Fjson
  2. Sending "HGNC:2197" works and lists the transcripts. https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/HGNC:2197/False/refseq/GRCh37?content-type=application%2Fjson
  3. Sending just the numeric ID doesn't work and returns an HTTP 500. https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/2197/False/refseq/GRCh37?content-type=application%2Fjson

Expected behavior

Thank you!

EDIT

Peter-J-Freeman commented 5 months ago

This will be a documentation change @ifokkema. I will not just accept the numeric value as it may get confused witht the numeric value of NIH gene IDs.

Peter-J-Freeman commented 5 months ago

Also; when using MT-ATP6, VV uses "MT" as the letter for chromosome "M".

This is annotation provided by Ensembl directly. Not from us. Ensembl need to be more responsible for their standards. We correct as much as we can :)

Peter-J-Freeman commented 5 months ago

OK, all these are fixed. Goint to close, but need to update the server still @ifokkema , so please nudge me next weeek

ifokkema commented 5 months ago

This will be a documentation change @ifokkema. I will not just accept the numeric value as it may get confused witht the numeric value of NIH gene IDs.

Makes sense! And a doc fix is just fine!

This is annotation provided by Ensembl directly. Not from us. Ensembl need to be more responsible for their standards. We correct as much as we can :)

Weeiiiird! OK, thanks!

OK, all these are fixed. Goint to close, but need to update the server still @ifokkema , so please nudge me next weeek

Excellent, thanks a lot! There's no rush, but when you do update the server, please let me know and I'll have another look!

Peter-J-Freeman commented 5 months ago

I want to say weird, but Ensembl do this sort of thing

There was a bug though. The gene symbol was coming out as MT not MT-ATP6 which is now fixed. Also, some slight changes that will happen now that I fixed the code once we update the databases in the next few days.

Hope to release the new software version next week

ifokkema commented 2 weeks ago

Hi Pete! Just to be sure:

Peter-J-Freeman commented 2 weeks ago

r.e. HGNC:7414, looks like there is another issue that is causing MT to migrate into the db instead of MT- something. I will look at this. See if I can patch rather than do a new release.

The numeric HGNC entry should return an error. But do we want it to. The main reason we may want to add HGNC is that we may in the furue WANT TO use other numeric gene searches???

ifokkema commented 2 weeks ago

r.e. HGNC:7414, looks like there is another issue that is causing MT to migrate into the db instead of MT- something. I will look at this. See if I can patch rather than do a new release.

Cool, thanks!

The numeric HGNC entry should return an error. But do we want it to. The main reason we may want to add HGNC is that we may in the furue WANT TO use other numeric gene searches???

Personally, in LOVD, I consider all numeric references to genes as HGNC IDs. My logic is simply that the HGNC hands out the gene symbols and they name the genes. They're the representative source, so I use their IDs. I actually don't know why they prefix their numeric IDs with "HGNC:" as I've never seen other resources prefix their numeric IDs. I do see the benefit, of course, as it identifies the ID. However, I also see a downside, as it causes inconsistent use of the prefix and, therefore, ambiguity in the ID. Either way, I show NCBI gene IDs, but I don't use them as keys or so. So they don't clash in LOVD. NCBI gene IDs are only used for linking to the NCBI. If you want to keep the possibility open to use multiple numeric identifiers, by all means, don't accept the numeric input. However, I would recommend returning an error rather than an HTTP 500.

Peter-J-Freeman commented 2 weeks ago

The only other I am aware of if GenBank gene IDs. But we do not currently use these. So I see no issue with dropping the HGNC from the input really

Peter-J-Freeman commented 2 weeks ago

OK, the code is fixed, but I think it will need another database build for HGNC:7414. This is not a quick process. I need to liase with @John-F-Wagstaff.

John-F-Wagstaff commented 2 weeks ago

We may still want to allow users to include 'HGNC:', even if we do allow just the plain number, as others including the NCBI do (in their genbank records for example you get '/db_xref="HGNC:HGNC:25180"'). Also the number of users that write things down without context, unless prompted, is large enough that I would prefer to keep the 'HGNC:' prefix on the output too.

The only transcript we include currently for this in the underlying VVTA is ENST00000361899.2. The RefSeq record that can be found in the HGNC record for this is a 'YP_' with a DBSOURCE of " REFSEQ: accession NC_012920.1", it is currently a "PROVISIONAL REFSEQ" and has no associated transcript. We don't include any protein sequences without transcripts so this is missed out.

I am intending to build a new version of the VVTA soon, @Peter-J-Freeman should I bump this up the priority queue?

Peter-J-Freeman commented 2 weeks ago

I think new versions of all db's is needed. I found some more errors in validator. New line characters in some fields. Explains why the updates weren't successful! 🙄

John-F-Wagstaff commented 2 weeks ago

The RefSeq alignments have moved to https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/historical/GRCh38/current/ but have not updated since 2023-09-18 . Hence me not moving too fast on this, new RefSeq transcript data is all well and good but without alignments to go with it not much use for the alignment database, I can't load identifiers that don't have alignments.

I can get you updated RefSeqGene data, HGNC gene data for 2024/06 and, Ensembl Release 112(which was done in may) though. I will try to get it done before Friday.

Have you seen any issues in the VVTA data other than the Ensembl stuff that we already patched?

Peter-J-Freeman commented 2 weeks ago

Nope, no additional errors I am aware of. Just the patch to make sure of. The RefSeqGene FE data will be handled via the validator database. We just need to make sure the sequences are in SeqRepo which I believe they are now. Perhaps we need to contact RefSeq and find out why they stopped producing alignments. We need this data and it is vital. Surely they need it too

Peter-J-Freeman commented 2 weeks ago

Also @John-F-Wagstaff once done, please can I have the file of all transcript IDs in VVTA. Thanks

leicray commented 2 weeks ago

The link to the RefSeq alignments just does not look right. The URL path includes the directory historical and that simply feels wrong. Why would the current alignments by classified as historical?

I wonder if, instead, the current data are to be found in Homo_sapiens.gene_info.gz

John-F-Wagstaff commented 2 weeks ago

@leicray They archived the current alignments and put a link in the ftp to this location, after starting to produce this newer file set in parallel for a while. It is called historical because it includes all historic transcript variants back to a certain date cut off, as well as the current data. Yes the naming is bad. I have checked elsewhere and the RefSeq annotation pipeline last ran to completion on a human genome at that date too, so there should not be newer alignment data either way. Unfortunately the gene_info files only includes map locations per whole gene, in the form of 19q13.43 which does not work for us.

@Peter-J-Freeman I will get the transcript ID's to you as soon as the database is finished.

ifokkema commented 2 weeks ago

Thanks, guys!

We may still want to allow users to include 'HGNC:', even if we do allow just the plain number, as others including the NCBI do (in their genbank records for example you get '/db_xref="HGNC:HGNC:25180"'). Also the number of users that write things down without context, unless prompted, is large enough that I would prefer to keep the 'HGNC:' prefix on the output too.

Oh, yes, never remove allowed input or change the formatting of a variable in the output in a "live" API that doesn't have versioning! I'm personally OK with adding additional fields to a JSON API, as I assume that existing implementations won't crash if additional data is returned. Other implementers are more strict and even increase the version number when adding fields. In any case, allowing more diverse input doesn't change existing implementations ever, so IMO never requires an increment of the version number.

Peter-J-Freeman commented 2 weeks ago

The updated code will accept "HGNC:1234" or "1234" and return the same result.

Just not pushing yet because having a few database difficulties :)

Peter-J-Freeman commented 2 weeks ago

We will be updating the version numbers for all tools because recent changes to the VV engine required breaking changes, and I like to keep all major versions of all tools the same. May not be engineering correct, but prevents my brain fron hurting

Peter-J-Freeman commented 1 week ago

Sending a numeric HGNC ID still returns an HTTP 500 - is this intended, or should it show a warning/error now?

Now working and active on the server @ifokkema

on my system, this

import json
import VariantValidator
vval = VariantValidator.Validator()
gene = '7414'
select_transcripts = None
g_and_t = vval.gene2transcripts(gene, validator=vval, select_transcripts=select_transcripts, transcript_set="ensembl")
print(json.dumps(g_and_t, sort_keys=True, indent=4, separators=(',', ': ')))

will now return

{
    "current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
    "current_symbol": "MT-ATP6",
    "hgnc": "HGNC:7414",
    "previous_symbol": "MTATP6,RP",
    "requested_symbol": "MT-ATP6",
    "transcripts": [
        {
            "annotations": {
                "chromosome": "MT",
                "db_xref": {
                    "CCDS": null,
                    "ensemblgene": "ENSG00000198899",
                    "hgnc": "HGNC:7414",
                    "ncbigene": null,
                    "select": "Ensembl"
                },
                "ensembl_select": true,
                "mane_plus_clinical": false,
                "mane_select": false,
                "map": "mitochondria",
                "note": "mitochondrially encoded ATP synthase membrane subunit 6",
                "refseq_select": false,
                "variant": "201"
            },
            "coding_end": 681,
            "coding_start": 1,
            "description": "ATP6-201",
            "genomic_spans": {},
            "length": 681,
            "reference": "ENST00000361899.2",
            "translation": "ENSP00000354632.2"
        }
    ]
}

We will roll out new database builds ASAP to make this work on the server. This is to show what a patch would look like, but we want to make a full db release

Peter-J-Freeman commented 1 week ago

Hmm, seems I need to fix the alignments. They are missing!!! Will look into this since it works for other genes e.g. COL1A1

Peter-J-Freeman commented 1 week ago

Now also fixed, but again, will not work until the dbs are recreated. Will take a few weeks

ifokkema commented 6 days ago

We will be updating the version numbers for all tools because recent changes to the VV engine required breaking changes, and I like to keep all major versions of all tools the same. May not be engineering correct, but prevents my brain fron hurting

I meant the API version, e.g., /api/v1/method?arguments vs /api/v2/method?arguments. The versions in the meta data of the output are a different thing. I meant that as long as the endpoint isn't versioned, stuff from the output shouldn't be removed, input requirements shouldn't be changed, but additions to the output are generally OK.

Now also fixed, but again, will not work until the dbs are recreated. Will take a few weeks

Excellent, thanks!