Investigating the source of invalid gene identifiers in the genetics pipeline

DSuveges commented 3 years ago

It has been noticed that a small number of disease/target evidence is failing the validation when ingested by the OpenTargets Platform ETL due to invalid target identifiers. (39 genes and ~200 evidence)

The full list of all genes and their source can be found in the attached table under issue #1554

[x] Inverstigating why these genes are failing? When was the last time they were in the canonical Esnembl release?
[x] Invetstigate where the genetics portal pipeline fetches the gene list when generating l2g scores or other steps.

DSuveges commented 3 years ago

What happened to these gene identifiers?

The following table summarises the history of those Ensembl gene identifiers, which were failing the platform ETL check:

targetFromSourceId	evidenceCount	status	version
ENSG00000269881	48	retired	97
ENSG00000130489	45	retired	97
ENSG00000254462	43	retired	99
ENSG00000283932	42	retired	97
ENSG00000278272	36	retired	97
ENSG00000270898	34	retired	97
ENSG00000278500	32	retired	97
ENSG00000261833	23	retired	97
ENSG00000274267	22	retired	97
ENSG00000204683	20	retired	101
ENSG00000285258	20	retired	100
ENSG00000267545	20	retired	97
ENSG00000286169	18	retired	102
ENSG00000263264	17	retired	103
ENSG00000181013	14	retired	98
ENSG00000243444	14	retired	98
ENSG00000241978	13	retired	98
ENSG00000268861	13	retired	103
ENSG00000183729	11	retired	101
ENSG00000260300	8	retired	98
ENSG00000130723	8	retired	103
ENSG00000286094	6	retired	101
ENSG00000183791	5	retired	101
ENSG00000284041	5	retired	97
ENSG00000267697	5	retired	100
ENSG00000213865	4	retired	97
ENSG00000274897	4	retired	101
ENSG00000285441	3	retired	101
ENSG00000260869	2	retired	97
ENSG00000277669	2	retired	97
ENSG00000213029	2	retired	100
ENSG00000281028	2	retired	97
ENSG00000274744	2	retired	101
ENSG00000262621	2	retired	98
ENSG00000278674	2	retired	101
ENSG00000272949	1	retired	97
ENSG00000286261	1	retired	102
ENSG00000277726	1	retired	103
ENSG00000255863	1	retired	97

Apparently all failed IDs were retired at some point in the past. The earliest release where any of the IDs were retired was v97, suggesting the gene set for the last Genetics Portal release was based on Ensembl v96, which was released on April 2019.

DSuveges commented 3 years ago

Where is the data coming from?

I identified the source of this information the V2G pipeline, more specifically the snakemake configuration:

# Gene information
genes: gs://genetics-portal-input/luts/19.06_gene_symbol_synonym_map.json

The creation date of the file (19.06) is in accordance with the suspected v96 origin of the file set. This file is a simple JSON file with some gene related information on genomic location and the gene names/synonyms.

{
  "gene_id": "ENSG00000205002",
  "gene_name": "AARD",
  "gene_synonyms": [
    "LOC441376",
    "C8orf85"
  ],
  "gene_chrom": "8",
  "gene_start": 116938199,
  "gene_end": 116944487
}

When joining this dataset with the retired gene identifiers, we get perfect match: all retired ids can be found in this file.

targetFromSourceId	evidenceCount	status	version	gene_name	gene_chrom	gene_start	gene_end
ENSG00000269881	48	retired	97	AC004754.1	16	249547	269943
ENSG00000130489	45	retired	97	SCO2	22	50523568	50525606
ENSG00000254462	43	retired	99	TMX2-CTNND1	11	57712605	57791586
ENSG00000283932	42	retired	97	AL121722.1	20	22560553	22584261
ENSG00000278272	36	retired	97	HIST1H3C	6	26045411	26045821
ENSG00000270898	34	retired	97	GPR75-ASB3	2	53670293	53860160
ENSG00000278500	32	retired	97	AC009336.2	2	176151085	176173097
ENSG00000261833	23	retired	97	AC104151.1	16	76553417	76819624
ENSG00000274267	22	retired	97	HIST1H3B	6	26031650	26032060
ENSG00000204683	20	retired	101	C10orf113	10	21125763	21146559
ENSG00000285258	20	retired	100	ATXN7	3	63864557	64003462
ENSG00000267545	20	retired	97	AC005779.2	19	45179822	45202444
ENSG00000286169	18	retired	102	AHRR	5	271670	438291
ENSG00000263264	17	retired	103	AC119396.1	19	7348943	7383385
ENSG00000181013	14	retired	98	C17orf47	17	58541587	58544368
ENSG00000243444	14	retired	98	PALM2	9	109640788	109951476
ENSG00000241978	13	retired	98	AKAP2	9	110048598	110172512
ENSG00000268861	13	retired	103	AC008878.3	19	7382834	7472477
ENSG00000183729	11	retired	101	NPBWR1	8	52938431	52941117
ENSG00000260300	8	retired	98	AC009119.2	16	83908132	83951445
ENSG00000130723	8	retired	103	PRRC2B	9	131373636	131500197
ENSG00000286094	6	retired	101	AC026740.3	5	716808	766919
ENSG00000183791	5	retired	101	ELOA3	18	47028202	47030078
ENSG00000284041	5	retired	97	AC073111.3	7	150368790	150396915
ENSG00000267697	5	retired	100	LUZP6	7	135927274	135927450
ENSG00000213865	4	retired	97	C8orf44	8	66667615	66685564
ENSG00000274897	4	retired	101	PANO1	11	797511	799185
ENSG00000285441	3	retired	101	SOD2	6	159679119	159762529
ENSG00000260869	2	retired	97	AC002310.4	16	30534227	30558279
ENSG00000277669	2	retired	97	AC009086.2	16	29663279	29695144
ENSG00000213029	2	retired	100	SPHAR	1	229304857	229305504
ENSG00000281028	2	retired	97	AC104662.2	4	25160663	25277306
ENSG00000274744	2	retired	101	ELOA3D	18	46962768	46964408
ENSG00000262621	2	retired	98	AC025283.2	16	3382113	3397745
ENSG00000278674	2	retired	101	ELOA3B	18	47022287	47023927
ENSG00000272949	1	retired	97	AC093668.2	7	102483344	102543764
ENSG00000286261	1	retired	102	AC022137.3	19	53431984	53461862
ENSG00000277726	1	retired	103	AL109811.3	1	11012662	11030528
ENSG00000255863	1	retired	97	AC073610.2	12	48921963	48939663

There's one caveat of this analysis: I cannot tell for sure where else this file is used. However given the simplicity of this file, I would suggest to add and extra rule to the pipeline to generate the gene_symbol_synonym_map.json.

Jeremy37 commented 3 years ago

It's worth noting that there may be other locations where a specific version of the gene mapping has been used. E.g. this is in Miguel's config file:

ensembl { lut = ${input}"/lut/homo_sapiens_core_96_38_genes.json" }

It's located here: gs://genetics-portal-data/lut/homo_sapiens_core_96_38_genes.json

This file is also used in the L2G pipeline: https://github.com/opentargets/genetics-l2g-scoring/blob/master/1_feature_engineering/1_prepare_inputs.py

opentargets / issues

Investigating the source of invalid gene identifiers in the genetics pipeline #2286

What happened to these gene identifiers?

Where is the data coming from?