populationgenomics / seqr

web-based analysis tool for rare disease genomics
GNU Affero General Public License v3.0
3 stars 1 forks source link

VEP version used for production seqr #159

Open cassimons opened 2 years ago

cassimons commented 2 years ago

CPG seqr currently displays gene symbols that are several years out of date.

Good examples are any of the *ARS genes, eg OLD>NEW: AARS > AARS1, YARS > YARS1

Taking the AARS1 example, ensembl 105 uses the modern version of the symbol AARS1 while ensembl 95 uses the older form AARS

CPG seqr currently displays the old version AARS and does not allow the new version in gene lists: image

To my knowledge, several months ago we updated the seqr loading pipeline to use an up-to-date version of VEP (>= 105). Am I misunderstanding what/when we updated VEP, or are the gene symbols being sourced from a different place that we have failed to update (eg is this from the gene/transcript info tables in Postgres)?

lgruen commented 2 years ago

I just checked the contents of the Elasticsearch index for this variant:

GET /validation-genome-2022_0810_2358_474tt/_search
{"query": {"bool": {"filter": [{"term": {"variantId": "16-70252728-T-A"}}]}}}

The sortedTranscriptConsequences do indeed contain the new gene symbols (AARS1):

"sortedTranscriptConsequences" : [
            {
              "biotype" : "protein_coding",
              "canonical" : 1,
              "cdna_start" : 3007,
              "cdna_end" : 3007,
              "codons" : "aAg/aTg",
              "gene_id" : "ENSG00000090861",
              "gene_symbol" : "AARS1",
              "hgvsc" : "ENST00000261772.13:c.2900A>T",
              "hgvsp" : "ENSP00000261772.8:p.Lys967Met",
              "transcript_id" : "ENST00000261772",
              "amino_acids" : "K/M",
              "lof" : null,
              "lof_filter" : null,
              "lof_flags" : null,
              "lof_info" : null,
              "polyphen_prediction" : "possibly_damaging",
              "protein_id" : "ENSP00000261772",
              "protein_start" : 967,
              "sift_prediction" : "deleterious_low_confidence",
              "consequence_terms" : [
                "missense_variant"
              ],
              "domains" : null,
              "major_consequence" : "missense_variant",
              "category" : "missense",
              "hgvs" : "p.Lys967Met",
              "major_consequence_rank" : 11,
              "transcript_rank" : 0
            },
    ...

Similarly, the "mainTranscript_gene_symbol" : "AARS1" also looks good.

So maybe it's indeed coming from the Postgres table.

lgruen commented 2 years ago

Maybe we need to run update_gencode.py? (Note that currently the version is limited to 32 -- not sure why.)

It gets called for a list of versions in update_all_reference_data.py.

@illusional Not sure how adventurous you're feeling, but you could try increasing that limit and adding Gencode 39 to that list above and run ./manage.py update_all_reference_data --use-cached-omim?

cassimons commented 2 years ago

My guess (like yours) would be that the limit is to tie the gencode version to relevant the vep version? If so then Gencode 39 is what we want if we are still on VEP 105. It would be great if we can give this a go.

illusional commented 2 years ago

Hey @cassimons, can you confirm that this gene symbol has been updated in seqr-staging:validation? I can't search for AARS anymore, but can for AARS1. If you're happy with this, I can push to seqr-prod.

cassimons commented 2 years ago

Thanks @illusional! Yes this seems to be working as expected to me. Go for Prod 🚀