opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Targets with multiple Ensembl IDs #696

Closed MichaelaEBI closed 4 years ago

MichaelaEBI commented 4 years ago

More details to be added for more of the targets - and thinking about action to be taken for the different examples.

Looking at the gene list file, there are 147 targets with more than 1 Ensembl ID, many of which are rRNAs, ncRNAs, snoRNAs, long intergenic non-protein coding RNA (LINCxxx) etc. There are <30 protein coding targets that are the subject of this ticket.

Here are some detailed examples:

  1. ATF7 is in the OT platform with Ensembl IDs ENSG00000170653 and ENSG00000267281. These two 'versions' have 22 and 24 disease associations, respectively. The second Ensembl ID actually refers to the ATF7-NPFF readthrough transcript and it looks like EPMC is using the first ID while Expression Atlas and PhenoDigm are using the second one. GWAS has data for both Ensembl IDs.
  2. SOD2 can be found in the OT platform with ENSG00000285441 and ENSG00000112096 with 240 and 319 disease associations, respectively. The first ID is used by PhenoDigm, the second by EPMC, CRISPR, UniPRot and Expression Atlas. GWAS uses both IDs. --> This seems to be an Ensembl issue since Ensembl ID ENSG00000285441 refers to NCBI entry with Acc 6648 and Ensembl ID ENSG00000112096 refers to HGNC:11180 which then refers to the same NCBI entry with Acc 6648.
  3. ABCF2 can be found in the platform with Ensembl IDs ENSG00000285292 and ENSG00000033050 with 18 and 3 disease associations. Again, GWAS uses both Ensembl IDs whereas text mining and expression data use one ID each. Both Ensembl IDs link to HGNC:71.
  4. AHRR can be found in the platform with Ensembl IDs ENSG00000286169 and ENSG00000063438 with 32 and 86 disease associations. GWAS and Expression Atlas use both Ensembl IDs whereas text mining, SLAPEnrich and PheWAS use one ID and PhenoDigm uses the other one. Both Ensembl IDs link to HGNC:346.

This is the complete list: 28 (mostly) non-RNA targets that appear twice in the platform (bold gene names indicate that the gene has been looked at):

ABCF2 (both Ensembl IDs linked to HGNC:71, see details above) AHRR (both Ensembl IDs linked to HGNC:346, see details above) ARMCX5-GPRASP2 (readthrough from NCBI & HGNC, HGNC links to same NCBI Acc) ATF7 (one is a readthrough transcript from NCPB, but [NCBI link does not work] (https://www.ncbi.nlm.nih.gov/gene?cmd=Retrieve&dopt=Graphics&list_uids=114108587), see details above) ATXN7 (both Ensembl IDs linked to HGNC:10560) CCDC39 (both Ensembl IDs linked to HGNC:25244) DIABLO (both Ensembl IDs linked to HGNC:21528) DUXAP8 (both Ensembl IDs linked to HGNC:32187) GGT1 GOLGA8M HSPA14 IGF2 ITFG2-AS1 MATR3 PDE11A PINX1 POLR2J3 POLR2J4 PRSS50
RMRP SCARNA4 SCO2 SFTA3 SOD2 (see details above) SPDYE17 TBCE TMSB15B ZNF883

MichaelaEBI commented 4 years ago

One more observation: for the more detailed examples above, we assign HGNC ID and UniProt accession to one of the two Ensembl IDs only:

  1. ATF7 - HGNC ID and UniProt accessions for one entry only:

    4203  ENSG00000170653 ATF7 ATF7 HGNC:792
    57347 ENSG00000267281 ATF7 ATF7         
                                                    V5 V6
    4203  P17544|A5D6Y4|B2RMP1|B4DQL4|Q13814|Q8IVR8|Q9UD83 63
    57347                                                  90
  2. SOD2 - HGNC ID and UniProt accessions for one entry only:

    5578  ENSG00000112096 SOD2 SOD2 HGNC:11180
    21413 ENSG00000285441 SOD2 SOD2           
                                                                                V5
    5578  P04179|B2R7R1|B3KUK2|B4DL20|B4E3K9|E1P5A9|P78434|Q16792|Q5TCM1|Q96EE6|Q9P2Z3
    21413                                                                             
  3. ABCF2 - UniProt Accessions for both, but inconsistent:

    33085 ENSG00000285292 ABCF2 ABCF2         Q9UG63|O60864|Q75MJ0|Q75MJ1|Q96TE8 48
    47919 ENSG00000033050 ABCF2 ABCF2 HGNC:71                             Q9UG63 10
  4. AHRR - HGNC ID and UniProt accessions for one entry only:

    3609  ENSG00000063438 AHRR AHRR HGNC:346 A9YTQ3|A7MBN5|D6RAZ1|Q9HAZ3|Q9ULI6 216
    21868 ENSG00000286169 AHRR AHRR                                             155
iandunham commented 4 years ago

The four that you highlight (ATF7, SOD2, ABCF2 and AHRR) are all because of an alternate readthrough transcript that his picked up some of the assignments. Could be true of ARMCX5-GPRASP2 too. Is this the case for all the others also? If there are other causes we will need to investigate those, but for readthrough assignments....

The question is whether the assignment is intentional or not. I suspect it isn't. So the action is to go to the data suppliers (or our pipeline) and investigate why this happens...noting that we do the assignment for GWAS and phenodigm.... and then come up with a solution. One possible solution would be to avoid assigning to readthrough transcript genes automatically if there is a way to identify them - are they tagged in some way in the biotype?

iandunham commented 4 years ago

The uniprot assignments may be right - the readthrough transcript will code for a different protein and Uniprot identifier.

But part of the detail may be the way that the OT assignment pipeline for variants works

iandunham commented 4 years ago

Correction ENSG00000285441 is overlapping ACAT2 exons - but the logic still applies - we need to investigate whether there is a reason for these mappings, and whether it is biologically meaningful or an artefact of the mapping system, and then correct

deniseOme commented 4 years ago

Readthrough transcripts are a minefield really in the gene annotation world. They do cause a havoc when it comes to cross referencing the gene locus against HGNC, UniProt, Entrez, MGI, etc...

They're manually annotated by HAVANA (now part of Ensembl) and the majority of them will not have a protein sequence that is 100% identical to the transcript sequence. There is an annotation attribute called "overlapping locus" that could be used to filter this out. @iandunham, I don't think readthroughs are tagged in the biotypes but we can email Ensembl Helpdesk to see if this is available elsewhere and what it the easiest way to get this.

Readthrough transcripts will only partially match a high confident UniProt ID.

One way to identify readthrough loci is by looking at their name. Is it a combination of two HGNC names? ATF7-NPFF and ARMCX5-GPRASP2 would be easily identified as readthrough.

Note: we can remove ITFG2-AS1 from the list above. It's not protein coding. AS stands for antisense. It's an antisense to ITFG2. Biotype should be LncRNA.

Happy to help looking into these cases to see if the remaining are readthroughs or not. @MichaelaEBI and @iandunham. Just shout if you need me.

d0choa commented 4 years ago

Duplicated as work is undergoing in #801