Closed DSuveges closed 3 years ago
The following table summarises the history of those Ensembl gene identifiers, which were failing the platform ETL check:
targetFromSourceId | evidenceCount | status | version |
---|---|---|---|
ENSG00000269881 | 48 | retired | 97 |
ENSG00000130489 | 45 | retired | 97 |
ENSG00000254462 | 43 | retired | 99 |
ENSG00000283932 | 42 | retired | 97 |
ENSG00000278272 | 36 | retired | 97 |
ENSG00000270898 | 34 | retired | 97 |
ENSG00000278500 | 32 | retired | 97 |
ENSG00000261833 | 23 | retired | 97 |
ENSG00000274267 | 22 | retired | 97 |
ENSG00000204683 | 20 | retired | 101 |
ENSG00000285258 | 20 | retired | 100 |
ENSG00000267545 | 20 | retired | 97 |
ENSG00000286169 | 18 | retired | 102 |
ENSG00000263264 | 17 | retired | 103 |
ENSG00000181013 | 14 | retired | 98 |
ENSG00000243444 | 14 | retired | 98 |
ENSG00000241978 | 13 | retired | 98 |
ENSG00000268861 | 13 | retired | 103 |
ENSG00000183729 | 11 | retired | 101 |
ENSG00000260300 | 8 | retired | 98 |
ENSG00000130723 | 8 | retired | 103 |
ENSG00000286094 | 6 | retired | 101 |
ENSG00000183791 | 5 | retired | 101 |
ENSG00000284041 | 5 | retired | 97 |
ENSG00000267697 | 5 | retired | 100 |
ENSG00000213865 | 4 | retired | 97 |
ENSG00000274897 | 4 | retired | 101 |
ENSG00000285441 | 3 | retired | 101 |
ENSG00000260869 | 2 | retired | 97 |
ENSG00000277669 | 2 | retired | 97 |
ENSG00000213029 | 2 | retired | 100 |
ENSG00000281028 | 2 | retired | 97 |
ENSG00000274744 | 2 | retired | 101 |
ENSG00000262621 | 2 | retired | 98 |
ENSG00000278674 | 2 | retired | 101 |
ENSG00000272949 | 1 | retired | 97 |
ENSG00000286261 | 1 | retired | 102 |
ENSG00000277726 | 1 | retired | 103 |
ENSG00000255863 | 1 | retired | 97 |
Apparently all failed IDs were retired at some point in the past. The earliest release where any of the IDs were retired was v97, suggesting the gene set for the last Genetics Portal release was based on Ensembl v96, which was released on April 2019.
I identified the source of this information the V2G pipeline, more specifically the snakemake configuration:
# Gene information
genes: gs://genetics-portal-input/luts/19.06_gene_symbol_synonym_map.json
The creation date of the file (19.06
) is in accordance with the suspected v96 origin of the file set. This file is a simple JSON file with some gene related information on genomic location and the gene names/synonyms.
{
"gene_id": "ENSG00000205002",
"gene_name": "AARD",
"gene_synonyms": [
"LOC441376",
"C8orf85"
],
"gene_chrom": "8",
"gene_start": 116938199,
"gene_end": 116944487
}
When joining this dataset with the retired gene identifiers, we get perfect match: all retired ids can be found in this file.
targetFromSourceId | evidenceCount | status | version | gene_name | gene_chrom | gene_start | gene_end |
---|---|---|---|---|---|---|---|
ENSG00000269881 | 48 | retired | 97 | AC004754.1 | 16 | 249547 | 269943 |
ENSG00000130489 | 45 | retired | 97 | SCO2 | 22 | 50523568 | 50525606 |
ENSG00000254462 | 43 | retired | 99 | TMX2-CTNND1 | 11 | 57712605 | 57791586 |
ENSG00000283932 | 42 | retired | 97 | AL121722.1 | 20 | 22560553 | 22584261 |
ENSG00000278272 | 36 | retired | 97 | HIST1H3C | 6 | 26045411 | 26045821 |
ENSG00000270898 | 34 | retired | 97 | GPR75-ASB3 | 2 | 53670293 | 53860160 |
ENSG00000278500 | 32 | retired | 97 | AC009336.2 | 2 | 176151085 | 176173097 |
ENSG00000261833 | 23 | retired | 97 | AC104151.1 | 16 | 76553417 | 76819624 |
ENSG00000274267 | 22 | retired | 97 | HIST1H3B | 6 | 26031650 | 26032060 |
ENSG00000204683 | 20 | retired | 101 | C10orf113 | 10 | 21125763 | 21146559 |
ENSG00000285258 | 20 | retired | 100 | ATXN7 | 3 | 63864557 | 64003462 |
ENSG00000267545 | 20 | retired | 97 | AC005779.2 | 19 | 45179822 | 45202444 |
ENSG00000286169 | 18 | retired | 102 | AHRR | 5 | 271670 | 438291 |
ENSG00000263264 | 17 | retired | 103 | AC119396.1 | 19 | 7348943 | 7383385 |
ENSG00000181013 | 14 | retired | 98 | C17orf47 | 17 | 58541587 | 58544368 |
ENSG00000243444 | 14 | retired | 98 | PALM2 | 9 | 109640788 | 109951476 |
ENSG00000241978 | 13 | retired | 98 | AKAP2 | 9 | 110048598 | 110172512 |
ENSG00000268861 | 13 | retired | 103 | AC008878.3 | 19 | 7382834 | 7472477 |
ENSG00000183729 | 11 | retired | 101 | NPBWR1 | 8 | 52938431 | 52941117 |
ENSG00000260300 | 8 | retired | 98 | AC009119.2 | 16 | 83908132 | 83951445 |
ENSG00000130723 | 8 | retired | 103 | PRRC2B | 9 | 131373636 | 131500197 |
ENSG00000286094 | 6 | retired | 101 | AC026740.3 | 5 | 716808 | 766919 |
ENSG00000183791 | 5 | retired | 101 | ELOA3 | 18 | 47028202 | 47030078 |
ENSG00000284041 | 5 | retired | 97 | AC073111.3 | 7 | 150368790 | 150396915 |
ENSG00000267697 | 5 | retired | 100 | LUZP6 | 7 | 135927274 | 135927450 |
ENSG00000213865 | 4 | retired | 97 | C8orf44 | 8 | 66667615 | 66685564 |
ENSG00000274897 | 4 | retired | 101 | PANO1 | 11 | 797511 | 799185 |
ENSG00000285441 | 3 | retired | 101 | SOD2 | 6 | 159679119 | 159762529 |
ENSG00000260869 | 2 | retired | 97 | AC002310.4 | 16 | 30534227 | 30558279 |
ENSG00000277669 | 2 | retired | 97 | AC009086.2 | 16 | 29663279 | 29695144 |
ENSG00000213029 | 2 | retired | 100 | SPHAR | 1 | 229304857 | 229305504 |
ENSG00000281028 | 2 | retired | 97 | AC104662.2 | 4 | 25160663 | 25277306 |
ENSG00000274744 | 2 | retired | 101 | ELOA3D | 18 | 46962768 | 46964408 |
ENSG00000262621 | 2 | retired | 98 | AC025283.2 | 16 | 3382113 | 3397745 |
ENSG00000278674 | 2 | retired | 101 | ELOA3B | 18 | 47022287 | 47023927 |
ENSG00000272949 | 1 | retired | 97 | AC093668.2 | 7 | 102483344 | 102543764 |
ENSG00000286261 | 1 | retired | 102 | AC022137.3 | 19 | 53431984 | 53461862 |
ENSG00000277726 | 1 | retired | 103 | AL109811.3 | 1 | 11012662 | 11030528 |
ENSG00000255863 | 1 | retired | 97 | AC073610.2 | 12 | 48921963 | 48939663 |
There's one caveat of this analysis: I cannot tell for sure where else this file is used. However given the simplicity of this file, I would suggest to add and extra rule to the pipeline to generate the gene_symbol_synonym_map.json
.
It's worth noting that there may be other locations where a specific version of the gene mapping has been used. E.g. this is in Miguel's config file:
ensembl { lut = ${input}"/lut/homo_sapiens_core_96_38_genes.json" }
It's located here: gs://genetics-portal-data/lut/homo_sapiens_core_96_38_genes.json
This file is also used in the L2G pipeline: https://github.com/opentargets/genetics-l2g-scoring/blob/master/1_feature_engineering/1_prepare_inputs.py
It has been noticed that a small number of disease/target evidence is failing the validation when ingested by the OpenTargets Platform ETL due to invalid target identifiers. (39 genes and ~200 evidence)
The full list of all genes and their source can be found in the attached table under issue #1554