saezlab / pypath

Python module for prior knowledge integration. Builds databases of signaling pathways, enzyme-substrate interactions, complexes, annotations and intercellular communication roles.
http://omnipathdb.org/
GNU General Public License v3.0
134 stars 47 forks source link

Converting UniProt IDs to Entrez IDs #6

Closed ManavalanG closed 7 years ago

ManavalanG commented 7 years ago

How do I convert uniprot IDs to Entrez ID using mapping.py? It seems Entrez IDs can be converted to UniProt IDs using this tool, but not the other way round. I tried UniProt's ID mapping for such conversion, but a subset of UniProt ID nodes do not have Entrez ID mapped (reason was obsolete uniprot ID, etc.). Is there a lossless way of doing this conversion directly using PyPath?

deeenes commented 7 years ago

Hi Manavalan, pypath uses data downloaded from UniProt to convert the IDs. As most of the time the target of mapping is UniProt ID, it does not load the dictionaries for the opposite mapping in order to be economic with memory usage. However if necessary one can change this behaviour:

import pypath # this sets the mapping table bidirectional pypath.maps.mapListUniprot[('entrez', 'uniprot')].bi = True m = pypath.mapping.Mapper() m.map_name('P00533', 'uniprot', 'entrez') # returns ['1956']

pypath does some tricks when translating to UniProt: it handles secondary UniProt IDs and often successfully maps obsolate UniProts via GeneSymbols. But when you translate between any other ID types it simply does a lookup in the tables downloaded from UniProt (or sometimes NCBI or miRBase, etc).

I hope this helps.

Best, Denes

mschubert commented 7 years ago

@deeenes, do you think it makes sense to split the ID mapper to a separate package?

This is useful also when not working with pathways.

deeenes commented 7 years ago

yes I completely agree, just like some other submodules of pypath should be split into standalone modules. it only matter of when I will have a little time to do it

ManavalanG commented 7 years ago

@deeenes Thanks for the solution. Some edges were lost (419 edges to be exact) during the conversion but besides that, It worked well.

ManavalanG commented 7 years ago

On second look, I am actually loosing 419 edges using pypath ID conversion compared to 169 edges lost when direct UniProt's ID mapping was used. I haven't tested how much of the lost edges in both cases overlap.

deeenes commented 7 years ago

thanks @ManavalanG for pointing this out. is there a way to find at least one pair of IDs which is mapped correctly by UniProt but not pypath? this way I could trace back the reason.

ManavalanG commented 7 years ago

@deeenes Glad to help but I have to do it indirectly!! I will be away from the computer that has pypath for a while. So instead, I'm attaching the output files and script used. Hope this helps.

Archive.zip

deeenes commented 7 years ago

thanks @ManavalanG , I compared the 2 lists, and I found only one Entrez ID which is missing from your translation with pypath and present in the one by UniProt. this ID is 114815, I could translate it to UniProt from Entrez and the other way around, so I don't see pypath is unable to translate this particular ID. see the comparison of your lists in the first, and the translation of the missing ID in the second code block:

#!/usr/bin/env python

fnPp = 'entrez_idconv/omnipath_interactions_entrezIDs_using_pypath.tsv'
fnUp = 'entrez_idconv/omnipath_interactions_entrezIDs_direct_mapping.tsv'

with open(fnPp, 'r') as fp:

    lPp = set([tuple(sorted(p.split()))
               for p in fp.read().split('\n') if len(p)])

with open(fnUp, 'r') as fp:

    lUp = set([tuple(sorted(p.split()))
               for p in fp.read().split('\n') if len(p)])

print('pypath mapped pairs: %u\nUniProt mapped pairs: %u\nintersect: %u' % (
    len(lPp), len(lUp), len(lPp & lUp)))

lEpp = set(e for p in lPp for e in p)
lEup = set(e for p in lUp for e in p)

print('pypath mapped IDs: %u\nUniProt mapped IDs: %u\n'
      'intersect: %u\nthe IDs missing from pypath: %s' % (
    len(lEpp), len(lEup), len(lEpp & lEup), ', '.join(lEup - lEpp)))
import pypath
pypath.maps.mapListUniprot[('entrez', 'uniprot')].bi = True
m = pypath.mapping.Mapper()
m.map_name('114815', 'entrez', 'uniprot')
# returns ['Q8WY21']
m.map_name('Q8WY21',  'uniprot', 'entrez')
# returns ['114815']
ManavalanG commented 7 years ago

@deeenes Here are the edges that fail when converted using pypath. ('Q99873', 'A8K171'), ('Q05655', 'B3KTA3'), ('P32780', 'Q9BSD8'), ('Q13976', 'A8K9W7'), ('P05771', 'Q59F12'), ('P29350', 'A8K379'), ('P42574', 'B2RBL9'), ('Q14790', 'V9HWE1'), ('Q05209', 'Q59GM6'), ('Q02763', 'Q59HG2'), ('Q13164', 'B4DW78'), ('P28482', 'B2RAH2'), ('Q15139', 'Q4LE43'), ('P06239', 'B3KP83'), ('O96017', 'Q59FS6'), ('P68400', 'Q9UME6'), ('P19525', 'Q53GA5'), ('P48729', 'A0A024R693'), ('P17612', 'Q59GL5'), ('P42574', 'Q59FS6'), ('Q13554', 'Q547U9'), ('O15530', 'B3KVH4'), ('P68400', 'Q9BSK3'), ('P06241', 'Q53ES7'), ('P29466', 'B3KT21'), ('Q06124', 'Q9HA84'), ('P00747', 'B3KPF0'), ('Q8N2W9', 'Q53EL4'), ('P17612', 'B2RAU8'), ('Q13555', 'Q59GL5'), ('P09619', 'Q59F04'), ('P50281', 'Q53H33'), ('Q06187', 'A8K9W7'), ('P06213', 'Q59GM6'), ('P68400', 'B3KPF0'), ('Q13547', 'Q53GA5'), ('P12931', 'Q59FK4'), ('P06213', 'Q9P084'), ('Q15418', 'Q53EL4'), ('P50750', 'Q53GA5'), ('P06213', 'Q9BRL5'), ('P00734', 'B3KPF0'), ('P43405', 'Q9UFY1'), ('P00533', 'Q9BRL5'), ('P28482', 'B7Z1Q3'), ('Q15139', 'Q9UFY1'), ('P06493', 'V9HWH0'), ('P07858', 'Q59GC5'), ('P12931', 'Q59GM6'), ('Q05397', 'Q59GM6'), ('O14757', 'Q53GA5'), ('Q00535', 'Q9H6U9'), ('P11802', 'Q9P0T0'), ('P08581', 'Q9HA84'), ('P24941', 'B4DLA6'), ('Q96S53', 'V9HWA6'), ('P53667', 'V9HWI5'), ('P10144', 'B3KT21'), ('P24941', 'Q53GA5'), ('P55212', 'V9HWE1'), ('P27361', 'B7Z1Q3'), ('P17252', 'V9HWE1'), ('P17612', 'Q59FG2'), ('Q8TAI7', 'B3KVH4'), ('P63165', 'Q59FX5'), ('Q99538', 'Q96CY7'), ('P06493', 'Q9UME6'), ('P67870', 'Q53GA5'), ('P48729', 'B3KT21'), ('Q00535', 'Q53GA5'), ('P07288', 'B3KPF0'), ('P06213', 'B3KP83'), ('P17252', 'B3KXA2'), ('P07949', 'B3KP83'), ('P08575', 'A8K379'), ('Q13315', 'Q53GA5'), ('P48736', 'V9HW25'), ('P17612', 'A8K270'), ('P42679', 'A8K379'), ('P07949', 'Q4LE43'), ('P19784', 'Q9UME6'), ('Q13315', 'B3KMX5'), ('Q00535', 'Q59F91'), ('Q16539', 'Q53GA5'), ('P00519', 'Q59FK4'), ('P20151', 'B4DDH1'), ('Q16539', 'B2RAH2'), ('P68400', 'K9JA46'), ('P06493', 'B4DLA6'), ('P12931', 'B3KVH4'), ('P06493', 'B4DGH1'), ('Q13315', 'Q59FS6'), ('O60674', 'Q59GY7'), ('P28482', 'B4DHI4'), ('Q16816', 'A0A024RDL4'), ('P17252', 'A0A024R1D6'), ('O14965', 'Q6MZW8'), ('P45983', 'Q53GA5'), ('P19784', 'A0A024R693'), ('Q99986', 'Q53GA5'), ('P39900', 'Q8IZZ5'), ('Q00535', 'A8K3N4'), ('P49137', 'V9HW43'), ('Q13557', 'V9HWE1'), ('P27361', 'B4DU40'), ('P68400', 'Q6MZW8'), ('Q13535', 'Q53GA5'), ('P68400', 'Q86SX1'), ('P28482', 'B4DU40'), ('Q9H2X6', 'Q53GA5'), ('P09958', 'B2RB33'), ('P48729', 'Q53GA5'), ('P49674', 'B3KT21'), ('Q12913', 'Q59F04'), ('P17612', 'Q6AHX3'), ('P53355', 'B4DHI4'), ('P24941', 'Q59F91'), ('P49841', 'B4DNC7'), ('P06241', 'B3KXJ4'), ('Q8WYL5', 'V9HWI5'), ('Q00987', 'Q53GA5'), ('Q13418', 'B3KVH4'), ('Q05513', 'V9HWD6'), ('O96017', 'Q9BSD8'), ('P17252', 'Q96FS5'), ('P53671', 'V9HWI5'), ('P09958', 'Q59FG2'), ('Q66K89', 'Q53GA5'), ('P06493', 'Q9H6U9'), ('Q04759', 'A0A024R7C1'), ('Q13555', 'Q96FS5'), ('P17252', 'B2R7U3'), ('Q9H4B4', 'Q53GA5'), ('P09958', 'B3KWN9'), ('Q7Z2Y5', 'V9HWI5'), ('P31749', 'Q96FS5'), ('P68400', 'A0A024R693'), ('Q16644', 'V9HW43'), ('P06493', 'Q59F91'), ('P42574', 'B3KVH4'), ('Q15139', 'Q7Z322'), ('P55211', 'V9HWE1'), ('Q9UHI8', 'Q59FG9'), ('P06493', 'B3KMX0'), ('P07949', 'Q59GM6'), ('Q05209', 'Q59FK4'), ('P63279', 'Q59FX5'), ('P12931', 'B2RBL9'), ('P06239', 'Q59GK3'), ('P12931', 'Q547U9'), ('P07948', 'Q4LE43'), ('Q92831', 'A8K171'), ('O60674', 'A8K9W7'), ('P06239', 'B2R8B5'), ('P49841', 'Q53GA5'), ('P17252', 'Q59GL5'), ('P19784', 'Q59HH7'), ('P00533', 'Q4LE43'), ('P42574', 'V9HWE1'), ('P07948', 'A8K379'), ('P12931', 'Q59E85'), ('Q16539', 'Q9P0T0'), ('Q13131', 'A8K0H7'), ('P24941', 'A8K3N4'), ('P06213', 'Q9HA84'), ('Q9UBE8', 'Q659G9'), ('P00533', 'Q59F12'), ('Q15118', 'B3KVH4'), ('Q99683', 'Q15607'), ('P16591', 'Q53HG7'), ('P17252', 'Q59EA4'), ('P06493', 'Q59FK4'), ('P00533', 'Q9HA84'), ('P00533', 'Q9UFY1'), ('Q8TD19', 'B2R8K8'), ('P49841', 'Q8WYR3'), ('P28482', 'Q9P0T0'), ('Q9P286', 'B0AZM9'), ('P17252', 'Q547U9'), ('Q9UQM7', 'D3DX95'), ('P08631', 'Q4LE43'), ('P42574', 'Q96BA7'), ('Q09013', 'B2RAH5'), ('P17612', 'V9HWE1'), ('P06241', 'Q59EH3'), ('Q14012', 'Q59GJ0'), ('P06241', 'V9HWA5'), ('P12931', 'Q53HG7'), ('Q9UNH5', 'Q53GA5'), ('Q15569', 'Q8NFJ4'), ('P08631', 'Q9UFY1'), ('Q13177', 'V9HWE1'), ('P68400', 'Q53GA5'), ('P17612', 'A8K8W7'), ('P41240', 'A8K379'), ('P07948', 'Q9UFY1'), ('P68400', 'Q59HH7'), ('Q09472', 'Q59FK4'), ('P06493', 'Q53FE8'), ('P00519', 'Q86T74'), ('P06493', 'Q9BSD8'), ('P19784', 'B3KT21'), ('P13497', 'Q59EE7'), ('P08311', 'B3KPF0'), ('P67775', 'B3KVH4'), ('P46108', 'Q59GM6'), ('P68400', 'B3KT21'), ('Q15569', 'V9HWI5'), ('Q9GZR1', 'B0AZM1'), ('P43405', 'Q4LE43'), ('Q09472', 'Q53GA5'), ('P06493', 'A8K3N4'), ('P12931', 'A8K9W7'), ('P63279', 'A8K503'), ('Q14012', 'B4DG68'), ('P67870', 'B3KT21'), ('P23946', 'Q53G95'), ('Q13418', 'B2RAH5'), ('Q05397', 'B2RBL9'), ('O60729', 'Q53GA5'), ('P06493', 'V9HWE1'), ('Q06187', 'Q59GY7'), ('Q86XK2', 'Q53GA5'), ('P42575', 'B3KT21'), ('P78527', 'Q53GA5'), ('P06239', 'Q59EH3'), ('P12931', 'Q59FG2'), ('Q13464', 'Q53SB5'), ('P30307', 'Q53GA5'), ('P12931', 'Q4LE43'), ('P17706', 'Q59F04'), ('P14635', 'V9HWH0'), ('Q9UQM7', 'Q547U9'), ('Q13177', 'A8K341'), ('P31749', 'Q86T74'), ('Q13464', 'V9HWE1'), ('Q92793', 'A0A024R8X1'), ('Q13177', 'A8K5M4'), ('P07949', 'Q9UFY1'), ('Q13464', 'B2RAH5'), ('P28482', 'Q53GA5'), ('O75928', 'V9HWC2'), ('P07384', 'Q53GA5'), ('Q96EB6', 'Q53GA5'), ('Q96S53', 'V9HWI5'), ('P12931', 'Q9UFY1'), ('P18031', 'Q59F04'), ('P00533', 'Q59GK3'), ('P24941', 'B3KMX0'), ('P24941', 'Q9P0T0')

For example, Entrez ID for A8K171 is 8204, but pypath doesn't resolve this. Entrez ID 114815, which you found to be missing from pypath converted edges, was actually mapped from B3KWN9, which pypath doesn't resolve; not Q8WY21. Hope this helps.

Edit1: Just wanted to add that 169 edges, which are lost when converted using direct UniProt mapping, are also lost when converted using pypath. This is expected but wanted to mention in case you were wondering.

deeenes commented 7 years ago

Hi @ManavalanG, thanks for the follow up. In your list I found 107 IDs which pypath could not translate from UniProt to Entrez.

All of these, including the 2 highlighted by you (A8K171 and B3KWN9) are unreviewed proteins with evidence only at transcript level. I think these are not good idea to include in a protein-protein interaction network. According to UniProt, B3KWN9 is a cDNA similar to SORCS1, while Q8WY21 is SORCS1 itself and has the Entrez ID 114815.

By default pypath uses only Swissprot IDs at the translation. To change this behaviour, you need to set the UniprotMapping.swissprot attribute to None:

import pypath
pypath.maps.mapListUniprot[('entrez', 'uniprot')].swissprot = None
m = pypath.mapping.Mapper()

After this it could translate all the 107 IDs you missed above.

Indeed it is not documented, but now I will include soon as might be useful for others.