Closed ManavalanG closed 7 years ago
Hi Manavalan, pypath uses data downloaded from UniProt to convert the IDs. As most of the time the target of mapping is UniProt ID, it does not load the dictionaries for the opposite mapping in order to be economic with memory usage. However if necessary one can change this behaviour:
import pypath
# this sets the mapping table bidirectional
pypath.maps.mapListUniprot[('entrez', 'uniprot')].bi = True
m = pypath.mapping.Mapper()
m.map_name('P00533', 'uniprot', 'entrez')
# returns ['1956']
pypath does some tricks when translating to UniProt: it handles secondary UniProt IDs and often successfully maps obsolate UniProts via GeneSymbols. But when you translate between any other ID types it simply does a lookup in the tables downloaded from UniProt (or sometimes NCBI or miRBase, etc).
I hope this helps.
Best, Denes
@deeenes, do you think it makes sense to split the ID mapper to a separate package?
This is useful also when not working with pathways.
yes I completely agree, just like some other submodules of pypath should be split into standalone modules. it only matter of when I will have a little time to do it
@deeenes Thanks for the solution. Some edges were lost (419 edges to be exact) during the conversion but besides that, It worked well.
On second look, I am actually loosing 419 edges using pypath ID conversion compared to 169 edges lost when direct UniProt's ID mapping was used. I haven't tested how much of the lost edges in both cases overlap.
thanks @ManavalanG for pointing this out. is there a way to find at least one pair of IDs which is mapped correctly by UniProt but not pypath? this way I could trace back the reason.
@deeenes Glad to help but I have to do it indirectly!! I will be away from the computer that has pypath for a while. So instead, I'm attaching the output files and script used. Hope this helps.
thanks @ManavalanG , I compared the 2 lists, and I found only one Entrez ID which is missing from your translation with pypath
and present in the one by UniProt. this ID is 114815, I could translate it to UniProt from Entrez and the other way around, so I don't see pypath is unable to translate this particular ID. see the comparison of your lists in the first, and the translation of the missing ID in the second code block:
#!/usr/bin/env python
fnPp = 'entrez_idconv/omnipath_interactions_entrezIDs_using_pypath.tsv'
fnUp = 'entrez_idconv/omnipath_interactions_entrezIDs_direct_mapping.tsv'
with open(fnPp, 'r') as fp:
lPp = set([tuple(sorted(p.split()))
for p in fp.read().split('\n') if len(p)])
with open(fnUp, 'r') as fp:
lUp = set([tuple(sorted(p.split()))
for p in fp.read().split('\n') if len(p)])
print('pypath mapped pairs: %u\nUniProt mapped pairs: %u\nintersect: %u' % (
len(lPp), len(lUp), len(lPp & lUp)))
lEpp = set(e for p in lPp for e in p)
lEup = set(e for p in lUp for e in p)
print('pypath mapped IDs: %u\nUniProt mapped IDs: %u\n'
'intersect: %u\nthe IDs missing from pypath: %s' % (
len(lEpp), len(lEup), len(lEpp & lEup), ', '.join(lEup - lEpp)))
import pypath
pypath.maps.mapListUniprot[('entrez', 'uniprot')].bi = True
m = pypath.mapping.Mapper()
m.map_name('114815', 'entrez', 'uniprot')
# returns ['Q8WY21']
m.map_name('Q8WY21', 'uniprot', 'entrez')
# returns ['114815']
@deeenes Here are the edges that fail when converted using pypath.
('Q99873', 'A8K171'), ('Q05655', 'B3KTA3'), ('P32780', 'Q9BSD8'), ('Q13976', 'A8K9W7'), ('P05771', 'Q59F12'), ('P29350', 'A8K379'), ('P42574', 'B2RBL9'), ('Q14790', 'V9HWE1'), ('Q05209', 'Q59GM6'), ('Q02763', 'Q59HG2'), ('Q13164', 'B4DW78'), ('P28482', 'B2RAH2'), ('Q15139', 'Q4LE43'), ('P06239', 'B3KP83'), ('O96017', 'Q59FS6'), ('P68400', 'Q9UME6'), ('P19525', 'Q53GA5'), ('P48729', 'A0A024R693'), ('P17612', 'Q59GL5'), ('P42574', 'Q59FS6'), ('Q13554', 'Q547U9'), ('O15530', 'B3KVH4'), ('P68400', 'Q9BSK3'), ('P06241', 'Q53ES7'), ('P29466', 'B3KT21'), ('Q06124', 'Q9HA84'), ('P00747', 'B3KPF0'), ('Q8N2W9', 'Q53EL4'), ('P17612', 'B2RAU8'), ('Q13555', 'Q59GL5'), ('P09619', 'Q59F04'), ('P50281', 'Q53H33'), ('Q06187', 'A8K9W7'), ('P06213', 'Q59GM6'), ('P68400', 'B3KPF0'), ('Q13547', 'Q53GA5'), ('P12931', 'Q59FK4'), ('P06213', 'Q9P084'), ('Q15418', 'Q53EL4'), ('P50750', 'Q53GA5'), ('P06213', 'Q9BRL5'), ('P00734', 'B3KPF0'), ('P43405', 'Q9UFY1'), ('P00533', 'Q9BRL5'), ('P28482', 'B7Z1Q3'), ('Q15139', 'Q9UFY1'), ('P06493', 'V9HWH0'), ('P07858', 'Q59GC5'), ('P12931', 'Q59GM6'), ('Q05397', 'Q59GM6'), ('O14757', 'Q53GA5'), ('Q00535', 'Q9H6U9'), ('P11802', 'Q9P0T0'), ('P08581', 'Q9HA84'), ('P24941', 'B4DLA6'), ('Q96S53', 'V9HWA6'), ('P53667', 'V9HWI5'), ('P10144', 'B3KT21'), ('P24941', 'Q53GA5'), ('P55212', 'V9HWE1'), ('P27361', 'B7Z1Q3'), ('P17252', 'V9HWE1'), ('P17612', 'Q59FG2'), ('Q8TAI7', 'B3KVH4'), ('P63165', 'Q59FX5'), ('Q99538', 'Q96CY7'), ('P06493', 'Q9UME6'), ('P67870', 'Q53GA5'), ('P48729', 'B3KT21'), ('Q00535', 'Q53GA5'), ('P07288', 'B3KPF0'), ('P06213', 'B3KP83'), ('P17252', 'B3KXA2'), ('P07949', 'B3KP83'), ('P08575', 'A8K379'), ('Q13315', 'Q53GA5'), ('P48736', 'V9HW25'), ('P17612', 'A8K270'), ('P42679', 'A8K379'), ('P07949', 'Q4LE43'), ('P19784', 'Q9UME6'), ('Q13315', 'B3KMX5'), ('Q00535', 'Q59F91'), ('Q16539', 'Q53GA5'), ('P00519', 'Q59FK4'), ('P20151', 'B4DDH1'), ('Q16539', 'B2RAH2'), ('P68400', 'K9JA46'), ('P06493', 'B4DLA6'), ('P12931', 'B3KVH4'), ('P06493', 'B4DGH1'), ('Q13315', 'Q59FS6'), ('O60674', 'Q59GY7'), ('P28482', 'B4DHI4'), ('Q16816', 'A0A024RDL4'), ('P17252', 'A0A024R1D6'), ('O14965', 'Q6MZW8'), ('P45983', 'Q53GA5'), ('P19784', 'A0A024R693'), ('Q99986', 'Q53GA5'), ('P39900', 'Q8IZZ5'), ('Q00535', 'A8K3N4'), ('P49137', 'V9HW43'), ('Q13557', 'V9HWE1'), ('P27361', 'B4DU40'), ('P68400', 'Q6MZW8'), ('Q13535', 'Q53GA5'), ('P68400', 'Q86SX1'), ('P28482', 'B4DU40'), ('Q9H2X6', 'Q53GA5'), ('P09958', 'B2RB33'), ('P48729', 'Q53GA5'), ('P49674', 'B3KT21'), ('Q12913', 'Q59F04'), ('P17612', 'Q6AHX3'), ('P53355', 'B4DHI4'), ('P24941', 'Q59F91'), ('P49841', 'B4DNC7'), ('P06241', 'B3KXJ4'), ('Q8WYL5', 'V9HWI5'), ('Q00987', 'Q53GA5'), ('Q13418', 'B3KVH4'), ('Q05513', 'V9HWD6'), ('O96017', 'Q9BSD8'), ('P17252', 'Q96FS5'), ('P53671', 'V9HWI5'), ('P09958', 'Q59FG2'), ('Q66K89', 'Q53GA5'), ('P06493', 'Q9H6U9'), ('Q04759', 'A0A024R7C1'), ('Q13555', 'Q96FS5'), ('P17252', 'B2R7U3'), ('Q9H4B4', 'Q53GA5'), ('P09958', 'B3KWN9'), ('Q7Z2Y5', 'V9HWI5'), ('P31749', 'Q96FS5'), ('P68400', 'A0A024R693'), ('Q16644', 'V9HW43'), ('P06493', 'Q59F91'), ('P42574', 'B3KVH4'), ('Q15139', 'Q7Z322'), ('P55211', 'V9HWE1'), ('Q9UHI8', 'Q59FG9'), ('P06493', 'B3KMX0'), ('P07949', 'Q59GM6'), ('Q05209', 'Q59FK4'), ('P63279', 'Q59FX5'), ('P12931', 'B2RBL9'), ('P06239', 'Q59GK3'), ('P12931', 'Q547U9'), ('P07948', 'Q4LE43'), ('Q92831', 'A8K171'), ('O60674', 'A8K9W7'), ('P06239', 'B2R8B5'), ('P49841', 'Q53GA5'), ('P17252', 'Q59GL5'), ('P19784', 'Q59HH7'), ('P00533', 'Q4LE43'), ('P42574', 'V9HWE1'), ('P07948', 'A8K379'), ('P12931', 'Q59E85'), ('Q16539', 'Q9P0T0'), ('Q13131', 'A8K0H7'), ('P24941', 'A8K3N4'), ('P06213', 'Q9HA84'), ('Q9UBE8', 'Q659G9'), ('P00533', 'Q59F12'), ('Q15118', 'B3KVH4'), ('Q99683', 'Q15607'), ('P16591', 'Q53HG7'), ('P17252', 'Q59EA4'), ('P06493', 'Q59FK4'), ('P00533', 'Q9HA84'), ('P00533', 'Q9UFY1'), ('Q8TD19', 'B2R8K8'), ('P49841', 'Q8WYR3'), ('P28482', 'Q9P0T0'), ('Q9P286', 'B0AZM9'), ('P17252', 'Q547U9'), ('Q9UQM7', 'D3DX95'), ('P08631', 'Q4LE43'), ('P42574', 'Q96BA7'), ('Q09013', 'B2RAH5'), ('P17612', 'V9HWE1'), ('P06241', 'Q59EH3'), ('Q14012', 'Q59GJ0'), ('P06241', 'V9HWA5'), ('P12931', 'Q53HG7'), ('Q9UNH5', 'Q53GA5'), ('Q15569', 'Q8NFJ4'), ('P08631', 'Q9UFY1'), ('Q13177', 'V9HWE1'), ('P68400', 'Q53GA5'), ('P17612', 'A8K8W7'), ('P41240', 'A8K379'), ('P07948', 'Q9UFY1'), ('P68400', 'Q59HH7'), ('Q09472', 'Q59FK4'), ('P06493', 'Q53FE8'), ('P00519', 'Q86T74'), ('P06493', 'Q9BSD8'), ('P19784', 'B3KT21'), ('P13497', 'Q59EE7'), ('P08311', 'B3KPF0'), ('P67775', 'B3KVH4'), ('P46108', 'Q59GM6'), ('P68400', 'B3KT21'), ('Q15569', 'V9HWI5'), ('Q9GZR1', 'B0AZM1'), ('P43405', 'Q4LE43'), ('Q09472', 'Q53GA5'), ('P06493', 'A8K3N4'), ('P12931', 'A8K9W7'), ('P63279', 'A8K503'), ('Q14012', 'B4DG68'), ('P67870', 'B3KT21'), ('P23946', 'Q53G95'), ('Q13418', 'B2RAH5'), ('Q05397', 'B2RBL9'), ('O60729', 'Q53GA5'), ('P06493', 'V9HWE1'), ('Q06187', 'Q59GY7'), ('Q86XK2', 'Q53GA5'), ('P42575', 'B3KT21'), ('P78527', 'Q53GA5'), ('P06239', 'Q59EH3'), ('P12931', 'Q59FG2'), ('Q13464', 'Q53SB5'), ('P30307', 'Q53GA5'), ('P12931', 'Q4LE43'), ('P17706', 'Q59F04'), ('P14635', 'V9HWH0'), ('Q9UQM7', 'Q547U9'), ('Q13177', 'A8K341'), ('P31749', 'Q86T74'), ('Q13464', 'V9HWE1'), ('Q92793', 'A0A024R8X1'), ('Q13177', 'A8K5M4'), ('P07949', 'Q9UFY1'), ('Q13464', 'B2RAH5'), ('P28482', 'Q53GA5'), ('O75928', 'V9HWC2'), ('P07384', 'Q53GA5'), ('Q96EB6', 'Q53GA5'), ('Q96S53', 'V9HWI5'), ('P12931', 'Q9UFY1'), ('P18031', 'Q59F04'), ('P00533', 'Q59GK3'), ('P24941', 'B3KMX0'), ('P24941', 'Q9P0T0')
For example, Entrez ID for A8K171
is 8204
, but pypath doesn't resolve this. Entrez ID 114815
, which you found to be missing from pypath converted edges, was actually mapped from B3KWN9
, which pypath doesn't resolve; not Q8WY21
. Hope this helps.
Edit1: Just wanted to add that 169 edges, which are lost when converted using direct UniProt mapping, are also lost when converted using pypath. This is expected but wanted to mention in case you were wondering.
Hi @ManavalanG, thanks for the follow up. In your list I found 107 IDs which pypath could not translate from UniProt to Entrez.
All of these, including the 2 highlighted by you (A8K171
and B3KWN9
) are unreviewed proteins with evidence only at transcript level. I think these are not good idea to include in a protein-protein interaction network. According to UniProt, B3KWN9
is a cDNA similar to SORCS1, while Q8WY21
is SORCS1 itself and has the Entrez ID 114815
.
By default pypath
uses only Swissprot IDs at the translation. To change this behaviour, you need to set the UniprotMapping.swissprot
attribute to None
:
import pypath
pypath.maps.mapListUniprot[('entrez', 'uniprot')].swissprot = None
m = pypath.mapping.Mapper()
After this it could translate all the 107 IDs you missed above.
Indeed it is not documented, but now I will include soon as might be useful for others.
How do I convert uniprot IDs to Entrez ID using mapping.py? It seems Entrez IDs can be converted to UniProt IDs using this tool, but not the other way round. I tried UniProt's ID mapping for such conversion, but a subset of UniProt ID nodes do not have Entrez ID mapped (reason was obsolete uniprot ID, etc.). Is there a lossless way of doing this conversion directly using PyPath?