Closed MichaelaEBI closed 4 years ago
One more observation: for the more detailed examples above, we assign HGNC ID and UniProt accession to one of the two Ensembl IDs only:
ATF7 - HGNC ID and UniProt accessions for one entry only:
4203 ENSG00000170653 ATF7 ATF7 HGNC:792
57347 ENSG00000267281 ATF7 ATF7
V5 V6
4203 P17544|A5D6Y4|B2RMP1|B4DQL4|Q13814|Q8IVR8|Q9UD83 63
57347 90
SOD2 - HGNC ID and UniProt accessions for one entry only:
5578 ENSG00000112096 SOD2 SOD2 HGNC:11180
21413 ENSG00000285441 SOD2 SOD2
V5
5578 P04179|B2R7R1|B3KUK2|B4DL20|B4E3K9|E1P5A9|P78434|Q16792|Q5TCM1|Q96EE6|Q9P2Z3
21413
ABCF2 - UniProt Accessions for both, but inconsistent:
33085 ENSG00000285292 ABCF2 ABCF2 Q9UG63|O60864|Q75MJ0|Q75MJ1|Q96TE8 48
47919 ENSG00000033050 ABCF2 ABCF2 HGNC:71 Q9UG63 10
AHRR - HGNC ID and UniProt accessions for one entry only:
3609 ENSG00000063438 AHRR AHRR HGNC:346 A9YTQ3|A7MBN5|D6RAZ1|Q9HAZ3|Q9ULI6 216
21868 ENSG00000286169 AHRR AHRR 155
The four that you highlight (ATF7, SOD2, ABCF2 and AHRR) are all because of an alternate readthrough transcript that his picked up some of the assignments. Could be true of ARMCX5-GPRASP2 too. Is this the case for all the others also? If there are other causes we will need to investigate those, but for readthrough assignments....
The question is whether the assignment is intentional or not. I suspect it isn't. So the action is to go to the data suppliers (or our pipeline) and investigate why this happens...noting that we do the assignment for GWAS and phenodigm.... and then come up with a solution. One possible solution would be to avoid assigning to readthrough transcript genes automatically if there is a way to identify them - are they tagged in some way in the biotype?
The uniprot assignments may be right - the readthrough transcript will code for a different protein and Uniprot identifier.
But part of the detail may be the way that the OT assignment pipeline for variants works
Correction ENSG00000285441 is overlapping ACAT2 exons - but the logic still applies - we need to investigate whether there is a reason for these mappings, and whether it is biologically meaningful or an artefact of the mapping system, and then correct
Readthrough transcripts are a minefield really in the gene annotation world. They do cause a havoc when it comes to cross referencing the gene locus against HGNC, UniProt, Entrez, MGI, etc...
They're manually annotated by HAVANA (now part of Ensembl) and the majority of them will not have a protein sequence that is 100% identical to the transcript sequence. There is an annotation attribute called "overlapping locus" that could be used to filter this out. @iandunham, I don't think readthroughs are tagged in the biotypes but we can email Ensembl Helpdesk to see if this is available elsewhere and what it the easiest way to get this.
Readthrough transcripts will only partially match a high confident UniProt ID.
One way to identify readthrough loci is by looking at their name. Is it a combination of two HGNC names? ATF7-NPFF and ARMCX5-GPRASP2 would be easily identified as readthrough.
Note: we can remove ITFG2-AS1 from the list above. It's not protein coding. AS
stands for antisense. It's an antisense to ITFG2. Biotype should be LncRNA
.
Happy to help looking into these cases to see if the remaining are readthroughs or not. @MichaelaEBI and @iandunham. Just shout if you need me.
Duplicated as work is undergoing in #801
More details to be added for more of the targets - and thinking about action to be taken for the different examples.
Looking at the gene list file, there are 147 targets with more than 1 Ensembl ID, many of which are rRNAs, ncRNAs, snoRNAs, long intergenic non-protein coding RNA (LINCxxx) etc. There are <30 protein coding targets that are the subject of this ticket.
Here are some detailed examples:
This is the complete list: 28 (mostly) non-RNA targets that appear twice in the platform (bold gene names indicate that the gene has been looked at):
ABCF2 (both Ensembl IDs linked to HGNC:71, see details above) AHRR (both Ensembl IDs linked to HGNC:346, see details above) ARMCX5-GPRASP2 (readthrough from NCBI & HGNC, HGNC links to same NCBI Acc) ATF7 (one is a readthrough transcript from NCPB, but [NCBI link does not work] (https://www.ncbi.nlm.nih.gov/gene?cmd=Retrieve&dopt=Graphics&list_uids=114108587), see details above) ATXN7 (both Ensembl IDs linked to HGNC:10560) CCDC39 (both Ensembl IDs linked to HGNC:25244) DIABLO (both Ensembl IDs linked to HGNC:21528) DUXAP8 (both Ensembl IDs linked to HGNC:32187) GGT1 GOLGA8M HSPA14 IGF2 ITFG2-AS1 MATR3 PDE11A PINX1 POLR2J3 POLR2J4 PRSS50
RMRP SCARNA4 SCO2 SFTA3 SOD2 (see details above) SPDYE17 TBCE TMSB15B ZNF883