opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Investigating the source of invalid gene identifiers in the genetics pipeline #2286

Closed DSuveges closed 3 years ago

DSuveges commented 3 years ago

It has been noticed that a small number of disease/target evidence is failing the validation when ingested by the OpenTargets Platform ETL due to invalid target identifiers. (39 genes and ~200 evidence)

The full list of all genes and their source can be found in the attached table under issue #1554

DSuveges commented 3 years ago

What happened to these gene identifiers?

The following table summarises the history of those Ensembl gene identifiers, which were failing the platform ETL check:

targetFromSourceId evidenceCount status version
ENSG00000269881 48 retired 97
ENSG00000130489 45 retired 97
ENSG00000254462 43 retired 99
ENSG00000283932 42 retired 97
ENSG00000278272 36 retired 97
ENSG00000270898 34 retired 97
ENSG00000278500 32 retired 97
ENSG00000261833 23 retired 97
ENSG00000274267 22 retired 97
ENSG00000204683 20 retired 101
ENSG00000285258 20 retired 100
ENSG00000267545 20 retired 97
ENSG00000286169 18 retired 102
ENSG00000263264 17 retired 103
ENSG00000181013 14 retired 98
ENSG00000243444 14 retired 98
ENSG00000241978 13 retired 98
ENSG00000268861 13 retired 103
ENSG00000183729 11 retired 101
ENSG00000260300 8 retired 98
ENSG00000130723 8 retired 103
ENSG00000286094 6 retired 101
ENSG00000183791 5 retired 101
ENSG00000284041 5 retired 97
ENSG00000267697 5 retired 100
ENSG00000213865 4 retired 97
ENSG00000274897 4 retired 101
ENSG00000285441 3 retired 101
ENSG00000260869 2 retired 97
ENSG00000277669 2 retired 97
ENSG00000213029 2 retired 100
ENSG00000281028 2 retired 97
ENSG00000274744 2 retired 101
ENSG00000262621 2 retired 98
ENSG00000278674 2 retired 101
ENSG00000272949 1 retired 97
ENSG00000286261 1 retired 102
ENSG00000277726 1 retired 103
ENSG00000255863 1 retired 97

Apparently all failed IDs were retired at some point in the past. The earliest release where any of the IDs were retired was v97, suggesting the gene set for the last Genetics Portal release was based on Ensembl v96, which was released on April 2019.

DSuveges commented 3 years ago

Where is the data coming from?

I identified the source of this information the V2G pipeline, more specifically the snakemake configuration:

# Gene information
genes: gs://genetics-portal-input/luts/19.06_gene_symbol_synonym_map.json

The creation date of the file (19.06) is in accordance with the suspected v96 origin of the file set. This file is a simple JSON file with some gene related information on genomic location and the gene names/synonyms.

{
  "gene_id": "ENSG00000205002",
  "gene_name": "AARD",
  "gene_synonyms": [
    "LOC441376",
    "C8orf85"
  ],
  "gene_chrom": "8",
  "gene_start": 116938199,
  "gene_end": 116944487
}

When joining this dataset with the retired gene identifiers, we get perfect match: all retired ids can be found in this file.

targetFromSourceId evidenceCount status version gene_name gene_chrom gene_start gene_end
ENSG00000269881 48 retired 97 AC004754.1 16 249547 269943
ENSG00000130489 45 retired 97 SCO2 22 50523568 50525606
ENSG00000254462 43 retired 99 TMX2-CTNND1 11 57712605 57791586
ENSG00000283932 42 retired 97 AL121722.1 20 22560553 22584261
ENSG00000278272 36 retired 97 HIST1H3C 6 26045411 26045821
ENSG00000270898 34 retired 97 GPR75-ASB3 2 53670293 53860160
ENSG00000278500 32 retired 97 AC009336.2 2 176151085 176173097
ENSG00000261833 23 retired 97 AC104151.1 16 76553417 76819624
ENSG00000274267 22 retired 97 HIST1H3B 6 26031650 26032060
ENSG00000204683 20 retired 101 C10orf113 10 21125763 21146559
ENSG00000285258 20 retired 100 ATXN7 3 63864557 64003462
ENSG00000267545 20 retired 97 AC005779.2 19 45179822 45202444
ENSG00000286169 18 retired 102 AHRR 5 271670 438291
ENSG00000263264 17 retired 103 AC119396.1 19 7348943 7383385
ENSG00000181013 14 retired 98 C17orf47 17 58541587 58544368
ENSG00000243444 14 retired 98 PALM2 9 109640788 109951476
ENSG00000241978 13 retired 98 AKAP2 9 110048598 110172512
ENSG00000268861 13 retired 103 AC008878.3 19 7382834 7472477
ENSG00000183729 11 retired 101 NPBWR1 8 52938431 52941117
ENSG00000260300 8 retired 98 AC009119.2 16 83908132 83951445
ENSG00000130723 8 retired 103 PRRC2B 9 131373636 131500197
ENSG00000286094 6 retired 101 AC026740.3 5 716808 766919
ENSG00000183791 5 retired 101 ELOA3 18 47028202 47030078
ENSG00000284041 5 retired 97 AC073111.3 7 150368790 150396915
ENSG00000267697 5 retired 100 LUZP6 7 135927274 135927450
ENSG00000213865 4 retired 97 C8orf44 8 66667615 66685564
ENSG00000274897 4 retired 101 PANO1 11 797511 799185
ENSG00000285441 3 retired 101 SOD2 6 159679119 159762529
ENSG00000260869 2 retired 97 AC002310.4 16 30534227 30558279
ENSG00000277669 2 retired 97 AC009086.2 16 29663279 29695144
ENSG00000213029 2 retired 100 SPHAR 1 229304857 229305504
ENSG00000281028 2 retired 97 AC104662.2 4 25160663 25277306
ENSG00000274744 2 retired 101 ELOA3D 18 46962768 46964408
ENSG00000262621 2 retired 98 AC025283.2 16 3382113 3397745
ENSG00000278674 2 retired 101 ELOA3B 18 47022287 47023927
ENSG00000272949 1 retired 97 AC093668.2 7 102483344 102543764
ENSG00000286261 1 retired 102 AC022137.3 19 53431984 53461862
ENSG00000277726 1 retired 103 AL109811.3 1 11012662 11030528
ENSG00000255863 1 retired 97 AC073610.2 12 48921963 48939663

There's one caveat of this analysis: I cannot tell for sure where else this file is used. However given the simplicity of this file, I would suggest to add and extra rule to the pipeline to generate the gene_symbol_synonym_map.json.

Jeremy37 commented 3 years ago

It's worth noting that there may be other locations where a specific version of the gene mapping has been used. E.g. this is in Miguel's config file:

ensembl { lut = ${input}"/lut/homo_sapiens_core_96_38_genes.json" }

It's located here: gs://genetics-portal-data/lut/homo_sapiens_core_96_38_genes.json

This file is also used in the L2G pipeline: https://github.com/opentargets/genetics-l2g-scoring/blob/master/1_feature_engineering/1_prepare_inputs.py