glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

ID mapper issue #1870

Open katewarner opened 3 weeks ago

katewarner commented 3 weeks ago

I think I worked out the problem with the ID mapper.

It looks like the ID mapper currently works like this: From ID ---> Internal anchor ID (UniProtKB Canonical Ac) ---> To ID: Canonical protein and all mapped isoforms (whether or not they contain the "From ID")

This can be incorrect because the xref ("From ID") is not always present in the unreviewed isoform entries. For example, the screenshot below contains the results for mapping GeneCards IDs (using the default example) to UniProt IDs: https://www.glygen.org/mapper-result/4be4d5683d0541c382642f75a9fe75dd

So the GeneCards ID "ENO3" maps to the canonical protein "P13929" and all the unreviewed isoforms mapped to the canonical protein. In UniProt "ENO3" only maps to "P13929", and this is also the case in our human EBI NT file.

[k.warner1@glygen-vm-dev 2024_06_20]$ grep 'genecards/ENO3' uniprot-proteome-homo-sapiens.nt
<http://purl.uniprot.org/uniprot/P13929> <http://www.w3.org/2000/01/rdf-schema#seeAlso> <http://purl.uniprot.org/genecards/ENO3> .
<http://purl.uniprot.org/genecards/ENO3> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.uniprot.org/core/Resource> .
<http://purl.uniprot.org/genecards/ENO3> <http://purl.uniprot.org/core/database> <http://purl.uniprot.org/database/GeneCards> .

Some IDs do map to the canonical protein and all the unreviewed isoforms, so the ID mapper should probably work like this: From ID ---> Internal anchor ID (UniProtKB Canonical Ac) ---> To ID: Any protein mapped to the "From ID"

jeet-vora commented 3 weeks ago

@katewarner Seems there is another issue I was also referring to Genecards to AlphafoldDB. It does not seem to be correct.

We can talk more when we meet next.

katewarner commented 3 weeks ago

@jeet-vora Yes sorry, I forgot to add that bit. The AlphaFold ID problem is that a GeneCard ID will map to the canonical protein and the unreviewed isoforms (as in the example above), but the canonical protein and each of the unreviewed isoforms have a unique AlphaFold ID.

This should be fixed at the next release because for the 2.7 AlphaFold datasets Robel is only mapping to the canonical protein, whereas in the last release he was mapping to the canonical proteins and the unreviewed proteins.

katewarner commented 5 days ago

@rykahsay Add drop-down – “UniProtKB canonical only” or “all isoforms”