opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Multiple ensembl IDs with the same approved symbol #3172

Open mkarmona opened 10 months ago

mkarmona commented 10 months ago

For example: HERC2P9 or AKAP17A among others. Looking by aggregation 340 approved symbols contain more than 1 Ensembl ID. Taking the AKAP17A example, looking it up on the Ensembl webpage, it comes as two matches with same approved Symbol; both human both flagged as Ensembl Canonical but only the one flagged with MANE Select is the right one for certain biotypes; which means one has a canonical transcript.

Here, my question is: is keeping this multiplicity an expected behaviour?

DSuveges commented 10 months ago

Oh, this is very interesting. I would expect symbols being unique to Ensembl gene identifiers except a few corner cases, when gene is mapped to scaffolds (which we are excluding from the target index) or X/Y chromosomes (case for AKAP17A, which I think is fine and expected). Interestingly, for HERC2P9 there are actually 4 Ensembl gene ids for the same symbol, two of which are mapped to scaffold (ENSG00000290513, ENSG00000282417, they are not in OT), but two are mapped to canonical chromosome (ENSG00000206149, ENSG00000291082) with slightly different coordinates. In the latter case, I see a point to contact Ensembl team to clarify the abiguity. I'll check if any of the implicated genes are actually protein coding.

We could add logic to disambiguate based on MANE annotation, but not 100% convinced, the expectation that gene symbols are unique, is arbitrary in my opinion. We'll explore.

mkarmona commented 10 months ago

In the case of AKAP17A, the both included in OT, both have biotype protein_coding[1, 2]. Regarding MANE annotation only works when is reference. It does not for certain RNAs like U2 or U3.

mkarmona commented 10 months ago

Also, the problem becomes more evident when different pieces of evidence are annotated making the overall score fluctuate and mis-rank associate diseases. See: SHOX and SHOX

DSuveges commented 10 months ago

In the case of AKAP17A, the both included in OT, both have biotype protein_coding

Yes, you are right, but given AKAP17A is on the sex chromosomes, it is fine and kind of expected.

Also, the problem becomes more evident when different pieces of evidence are annotated making the overall score fluctuate and mis-rank associate diseases.

I is hard ot say what does mis-ranking mean in this case. Which evidence belongs to which target id? It all depends on how the source identified the gene.

DSuveges commented 10 months ago

@mkarmona, apparently there are two symbols with more than 100 target identifiers:

+--------------+--------------------+-----------+
|approvedSymbol|             targets|targetCount|
+--------------+--------------------+-----------+
|         Y_RNA|[{ENSG00000222432...|        756|
|   Metazoa_SRP|[{ENSG00000280502...|        170|
|            U3|[{ENSG00000200538...|         50|
|            U6|[{ENSG00000275068...|         33|
|       SNORA70|[{ENSG00000252133...|         27|
|            U8|[{ENSG00000201398...|         22|
|            U2|[{ENSG00000274062...|         19|
|       5S_rRNA|[{ENSG00000277411...|          9|
|       SNORA72|[{ENSG00000207249...|          8|
|            U4|[{ENSG00000273744...|          7|
|           7SK|[{ENSG00000276626...|          7|
|       SNORA62|[{ENSG00000252443...|          7|
|       SNORA63|[{ENSG00000199473...|          7|
|            U7|[{ENSG00000275504...|          7|
|       SNORA75|[{ENSG00000212593...|          7|
|     5_8S_rRNA|[{ENSG00000283274...|          6|
|      DDX11L16|[{ENSG00000227159...|          4|
|       SNORD39|[{ENSG00000263723...|          4|
|         Vault|[{ENSG00000252485...|          4|
|        CD99P1|[{ENSG00000223773...|          4|
+--------------+--------------------+-----------+

This highlights that we have to be very careful with gene symbols.

DSuveges commented 10 months ago

So, within the target index, there are 339 ambiguous approved symbols spread across 1807 target identifier. If I exclude sex chromosomes, numbers slightly better: 284 symbols for 1633 ids. My assumption was that RNA genes in general has higher uncertainity, so we can expect ambiguity with the assigned gene symbols. When focusing on ambiguous symbols with protein coding biotype the picture become better: only 27 ambiguous symbols and 13 id.

This is the list:

+--------------+---------------+--------------+----------+---------+---------+
|approvedSymbol|id             |biotype       |chromosome|start    |end      |
+--------------+---------------+--------------+----------+---------+---------+
|SPATA13       |ENSG00000182957|protein_coding|13        |23979805 |24307074 |
|SPATA13       |ENSG00000228741|lncRNA        |13        |23979810 |24035027 |
|SFTA3         |ENSG00000257520|protein_coding|14        |36473207 |36521149 |
|SFTA3         |ENSG00000229415|lncRNA        |14        |36473288 |36513829 |
|LINC02203     |ENSG00000280709|protein_coding|15        |21552795 |21653276 |
|LINC02203     |ENSG00000284988|lncRNA        |15        |21552815 |21557161 |
|GOLGA8M       |ENSG00000188626|protein_coding|15        |28698583 |28738384 |
|GOLGA8M       |ENSG00000261480|lncRNA        |15        |28719377 |28738431 |
|NOX5          |ENSG00000290203|protein_coding|15        |68930504 |69062743 |
|NOX5          |ENSG00000255346|protein_coding|15        |69014695 |69062762 |
|SIGLEC5       |ENSG00000268500|protein_coding|19        |51610960 |51630401 |
|SIGLEC5       |ENSG00000105501|lncRNA        |19        |51630101 |51645545 |
|MKKS          |ENSG00000125863|protein_coding|20        |10401009 |10434222 |
|MKKS          |ENSG00000285508|protein_coding|20        |10413520 |10431922 |
|MKKS          |ENSG00000285723|protein_coding|20        |10420546 |10420737 |
|ELFN2         |ENSG00000243902|lncRNA        |22        |37339583 |37427445 |
|ELFN2         |ENSG00000166897|protein_coding|22        |37367960 |37427479 |
|HERC3         |ENSG00000287542|protein_coding|4         |88523810 |88708450 |
|HERC3         |ENSG00000138641|protein_coding|4         |88592434 |88708541 |
|MATR3         |ENSG00000280987|protein_coding|5         |139273752|139331671|
|MATR3         |ENSG00000015479|protein_coding|5         |139293674|139331677|
|POLR2J3       |ENSG00000168255|protein_coding|7         |102537918|102572653|
|POLR2J3       |ENSG00000285437|protein_coding|7         |102562133|102572583|
|KBTBD11-OT1   |ENSG00000283239|protein_coding|8         |1763888  |1958627  |
|KBTBD11-OT1   |ENSG00000253696|lncRNA        |8         |1971397  |1976478  |
|PINX1         |ENSG00000258724|protein_coding|8         |10725399 |10839847 |
|PINX1         |ENSG00000254093|protein_coding|8         |10764961 |10839884 |
+--------------+---------------+--------------+----------+---------+---------+

OK. Great. So, what's next? In most cases, although the id is different, in most cases the genomic coordinates show some overlap. It's a good question what makes them independent genes in the first place, but answering this question beyond our scope and I'm not sure what Ensembl can do about it. What can we do on our side? What is the impact of this ambiguity? Certainly it introduces ambiguity across the board, not only on the disease/target associations but all kind of places where targets are involved depending on the way the target is identified via symbol or Ensemble gene id. I'm not sure if there's a right way to handle them eg. a robust, biologically corrent way to disambiguate these genes/labels. If they are different genes (which is most likely the case with most genes), should we remove the symbol from one or the other? Based on what? We can also aggregate by symbol, but that solution seems to be the least correct to me, especially thinking about those rna genes where there are a large number of them sharing the same symbol.

I think we cannot strictly enforce the 1 symbol to 1 gene correspondance.

mkarmona commented 10 months ago

@DSuveges, brainstorming a bit, using the HGNC name for the MANE+Canonical labelled, and the Ensembl transcript name for the canonical ones? (this only helps for more than one protein) Preserving the Ensembl transcript name will help in cases like this. The thing is, how do you explicitly expose this idiosyncrasy to the user? not easy to reach and clear to select when searching.

Thanks for the unfolding!

DSuveges commented 9 months ago

There's this issue identified by @mjfalaguera :

guys how can it be that we’re loosing pieces of evidence from 23.02 to 23.09 onwards? check:

spark.read.parquet("/Users/mariaf/OT_platform/23.06/evidence/sourceId=europepmc").filter((F.col("diseaseId")=="EFO_0010580")&(F.col("targetId")=="ENSG00000185291")).count() # 85
spark.read.parquet("/Users/mariaf/OT_platform/23.09/evidence/sourceId=europepmc").filter((F.col("diseaseId")=="EFO_0010580")&(F.col("targetId")=="ENSG00000185291")).count() # 0

Apparently both the papers and the cooccurrences are in the new release, however by digging a bit I could identified the root of the problem: there are two genes ENSG00000185291 and ENSG00000292332 both with the same symbol: IL3RA.

Due to the hyerachical nature of the entity grounding logic, the less relevant gene is picked because it has CD123 listed as synonym. This is not nice...