Open mengqvist opened 1 year ago
Hi, Yes, it is possible that there are SMILES in EnzymeMap that are not in the compound_to_smiles.json, namely in the following cases:
So your code and observation is correct, but a feature, not a bug :)
Also (just a note, I think you had consistently used version 1 above), I just updated GitHub to version 2 of EnzymeMap and will update Zenodo later today (so make sure to not accidentally mix up files today)
Thanks for the quick response! :) That makes a lot of sense. So essentially these are potential errors in the BRENDA database, which makes them even more interesting.
How should I go about getting a name / identifier for these that would link them to an actual chemical? Any fixed identifier would do. pubchempy resolves some of them, but far from all.
Running the mapping now, will attach to a comment later if you're interested.
Mmh, I would have said Pubchempy, too. You can try different options, e.g. send the SMILES or convert to INCHI first (sometimes that helps). You could also try the RCI resolver https://cactus.nci.nih.gov/chemical/structure (there's an API), it's a bit slow, but you only have to run it once I guess
For name to SMILES, opsin is also a good option, but I just checked, it doesn't work the other way around.
Yes, happy to see the new outcome
In the new file there are ~8900 SMILES without names. About 2000 of those could be mapped with pubchempy.
Now trying the strategy of converting the SMILES to INCHI using Rdkit and then mapping with pubchempy. Only have limited experience with Rdkit, but it seems to not like the SMILES very much (see below). I do get additional matches with this strategy though.
from rdkit import Chem
# convert smiles to inchi
inchi = Chem.MolToInchi(Chem.MolFromSmiles(smiles))
[16:32:08] WARNING: Accepted unusual valence(s): S(3) [16:32:10] WARNING: Charges were rearranged [16:32:11] WARNING: Omitted undefined stereo [16:32:16] WARNING: Omitted undefined stereo [16:32:03] WARNING: Omitted undefined stereo [16:32:05] WARNING: Omitted undefined stereo [16:32:06] WARNING: Charges were rearranged [16:32:08] WARNING: Charges were rearranged [16:32:10] WARNING: Charges were rearranged [16:32:11] WARNING: Omitted undefined stereo [16:32:16] WARNING: Omitted undefined stereo [16:32:31] WARNING: Omitted undefined stereo [16:32:03] WARNING: Omitted undefined stereo [16:32:05] WARNING: Omitted undefined stereo [16:32:06] WARNING: Charges were rearranged [16:32:08] WARNING: Charges were rearranged [16:32:10] WARNING: Charges were rearranged
Are there 17 warning for the ~8900 molecules? Then I think that is not too bad. Sometimes stereo information is deleted in the products when it is not defined in the reactants for atoms far away from the reactive center, and I simply remove the stereochemistry flag. So I guess the "Omitted undefined stereo" comes from there. I have never seen the "Charges were rearranged", have to dig into this to see what it means.
Also, did it take so long to resolve via PubchemPy (or were you just writing only now) - if this takes so long, I can send you a script that will resolve ~10000k molecules via PubchemPy in a few minutes
No, that's just a very small subset. Lots and lots of warnings. I'd say for most of them. If the stereochemistry flag is removed that could explain why the SMILES don't match.
Yes, it took that long. About 1.5 s per compound. Would love to get the script! :)
Mmh maybe we would have needed another round of standardization after removing the flag, I have to look into this.
I'll post the script in a second
mmh okay adapting the script takes a bit longer than I though (I only have name to smiles, thought it would be trivial to reverse). I can continue tomorrow - I guess you are using pubchempy at the moment (which is very slow) - I would recommend switching to PUG (although this involves some unpleasant XML), you can look into query_pubchem()
in helpers_resolve_smiles.py
if you want to get started already. The query needs to be changed + the desired output according to https://pubchem.ncbi.nlm.nih.gov/docs/power-user-gateway#section=FAQs
The Rdkit SMILES to INCHI transformation followed by PubChemPy only gave me two more molecules resolved. So sadly not a solution. :(
Will have a look at PUG (how can you go wrong with something called "Power User Gateway"?), but based on what I saw with Rdkit my expectations are low.
What are the relative merits of:
Okay then don’t spend time on Pubchem PUG, it will give you the same results only faster. I think I need to fix the underlying issue instead.
1) would be possible bit doesn’t work for suggested reactions
2) simply not removing/editing them will create lots of lost reactions, since BRENDA has a lot of stereochemistry problems due to the ambiguity in resolving names to SMILES. However, there might be a better way to do this, I will need to check why the SMILES are weird, and can possibly fix it.
3) Possible, but then we might have different SMILES for the same name (which we do have anyways, so not a big problem I guess). Will also not work for suggested reactions.
So I need to look into whether the main contributor to the non-standardized SMILES is the suggestion or stereochemistry correction (or something else). If the suggestion is not a problem, we can simply do 1 or 3. In any case, we shouldn’t get SMILES that are not resolvable via pubchempy. I also need to check whether this affects small, large and „exotic“ compounds alike.
Sounds like a plan! Agree that all smiles should be resolvable. Still hoping that there is some error on my end and that more of the 8.9 K are actually resolvable. Attached is a json of the 2.3 K that I could get. smiles_to_missing.zip
Thanks! Since I encountered the same problem once: your pubchempy requests do really go through for all compounds and just return an empty list, right? Because depending on how you set up your scripts, pubchem might block your IP address for a few minutes if you send a lot of requests and if you had just done eg a try/except you will never see the http error you get and it seems like compounds don’t resolve.
Nope, definitely did the bad version of what you describe. Sounds like there will be unseen http errors in there.
Haha I did exactly the same a few years ago :) then I guess the low number of resolvable compounds is due to this, and we might want to switch to PUG after all (power users don’t get blocked…). I nevertheless want to take a look at the inchi warnings, they give me a bad feeling that there is some standardization missing or going wrong (which wouldn’t affect any results in the paper, but would haunt me for the sake of correctness)
Okay so here's the PUG solution. This resolved the 2000 compounds in 15 seconds, yay (I also made it work for inchi, and different output types (e.g. 'iupac' for the name, 'synonyms' for all names, etc):
import os
from lxml import etree as ET
import requests
import shutil
import urllib.request as request
from contextlib import closing
import time
from collections import defaultdict
from rdkit import Chem
def download_ftp(ftp_link: str, outfile: str):
"""download_ftp.
Args:
ftp_link (str): ftp_link
outfile (str): outfile
"""
ftp_link=ftp_link.replace('ftp','https')
with closing(request.urlopen(ftp_link)) as r:
with open(outfile, 'wb') as f:
shutil.copyfileobj(r, f)
def query_pubchem_name(ids: list,
query_type: str = "smiles",
output_type: str = "iupac",
save_file: str = "pubchem_save.txt",
rm_save: bool = True,
encoding : str ="utf8",
return_single : bool = False) -> dict:
"""query_pubchem.
Args:
ids (list):
query_type (str): inchi, chebi, or synonym
save_file (str):
rm_save (bool): If true, delete the file saved afterward
encoding (str): Encoding to send request, defaults to utf-8
return_single (bool): If true, only return the top hit for smiles
Return:
dict mapping ids to smiles lists
"""
WAIT_TIME = 10
DOCHEADER = "<!DOCTYPE PCT-Data PUBLIC \"-//NCBI//NCBI PCTools/EN\" \"NCBI_PCTools.dtd\">"
URL = "https://pubchem.ncbi.nlm.nih.gov/pug/pug.cgi"
REQUIRED_HEADINGS = [
"PCT-Data_input", "PCT-InputData", "PCT-InputData_query", "PCT-Query",
"PCT-Query_type", "PCT-QueryType", "PCT-QueryType_id-exchange",
"PCT-QueryIDExchange"
]
QUERY_SUBTREE_NAMES = ["PCT-QueryIDExchange_input", "PCT-QueryUids"]
OUTPUT_HEADERS = [
"PCT-QueryIDExchange_operation-type", "PCT-QueryIDExchange_output-type",
"PCT-QueryIDExchange_output-method", "PCT-QueryIDExchange_compression"
]
OUTPUT_VALUES = ["same", output_type, "file-pair", "none"]
# Start building the query tree
root = ET.Element("PCT-Data")
cur_pos = root
for new_heading in REQUIRED_HEADINGS:
cur_pos = ET.SubElement(cur_pos, new_heading)
# Navigate down to where we add the inchis
query_subtree = cur_pos
for query_subtree_name in QUERY_SUBTREE_NAMES:
query_subtree = ET.SubElement(query_subtree, query_subtree_name)
if query_type == "smiles":
query_root, query_name = "PCT-QueryUids_smiles", "PCT-QueryUids_smiles_E"
query_subtree = ET.SubElement(query_subtree, query_root)
for id_ in ids:
new_id = ET.SubElement(query_subtree, query_name)
# give this the id text
try:
new_id.text = id_
except ValueError:
pass
elif query_type == "inchi":
query_root, query_name = "PCT-QueryUids_inchi", "PCT-QueryUids_inchi_E"
query_subtree = ET.SubElement(query_subtree, query_root)
for id_ in ids:
new_id = ET.SubElement(query_subtree, query_name)
# give this the id text
try:
new_id.text = id_
except ValueError:
pass
# Go back up to to current position holder
# Add the output specification
for output_header, output_value in zip(OUTPUT_HEADERS, OUTPUT_VALUES):
output_xml = ET.SubElement(cur_pos, output_header)
output_xml.set("value", output_value)
out_xml = ET.tostring(root,
encoding=encoding,
method="xml",
xml_declaration=True,
doctype=DOCHEADER).decode()
# Post the request!
resp = requests.post(URL, data=out_xml.encode('utf-8'))
# Handle response and build a request to check on status
resp_tree = ET.fromstring(resp.text)
waiting_id = resp_tree.xpath("//PCT-Waiting_reqid")
waiting_id = waiting_id[0].text if waiting_id else None
STATUS_CHECK_HEADERS = [
"PCT-Data_input", "PCT-InputData", "PCT-InputData_request",
"PCT-Request"
]
root = ET.Element("PCT-Data")
cur_pos = root
for header in STATUS_CHECK_HEADERS:
cur_pos = ET.SubElement(cur_pos, header)
req_id = ET.SubElement(cur_pos, "PCT-Request_reqid")
req_id.text = waiting_id
req_type = ET.SubElement(cur_pos, "PCT-Request_type")
req_type.set("value", "status")
query_xml = ET.tostring(root,
encoding=encoding,
method="xml",
xml_declaration=True,
doctype=DOCHEADER).decode()
download_link = None
waiting_time = 0
# TODO: Add stop timeout condition?
# Repeatedly query to see if the results are done, then sleep for WAITIME
# in case they aren't
while not download_link:
resp = requests.post(URL, data=query_xml.encode('utf-8'))
resp_tree = ET.fromstring(resp.text)
download_link = resp_tree.xpath("//PCT-Download-URL_url")
download_link = download_link[0].text if download_link else None
time.sleep(WAIT_TIME)
waiting_time += WAIT_TIME
# At conclusion, download the ftp file
download_ftp(download_link, save_file)
# Also parse this
ret_dict = defaultdict(lambda: [])
with open(save_file, "r") as fp:
for linenum, line in enumerate(fp):
line = line.strip()
split_line = line.split("\t")
if len(split_line) == 2:
mol, smiles = split_line
mol = mol.strip()
smiles = smiles.strip()
ret_dict[mol].append(smiles)
# If we should only return a single item and not a list
if return_single:
ret_dict = {k : v[0] for k,v in ret_dict.items() if len(v) > 0}
# Remove temp file
if os.path.exists(save_file) and rm_save:
os.remove(save_file)
return ret_dict
# Input SMILES, output iupac
d = query_pubchem_name(['OCC','C(O)C','CCO','C(CCN)C[C@@H](C(=O)O)N'],query_type='smiles',output_type='iupac')
for k in d.keys():
print(k, d[k])
# Input SMILES, output synonyms
#d = query_pubchem_name(['OCC','C(O)C','CCO','C(CCN)C[C@@H](C(=O)O)N'],query_type='smiles',output_type='synonyms')
#for k in d.keys():
# print(k, d[k])
# Input Inchi, output iupac
#l = [Chem.MolToInchiKey(Chem.MolFromSmiles(x)) for x in ['OCC','C(O)C','CCO','C(CCN)C[C@@H](C(=O)O)N']]
#d = query_pubchem_name(l,query_type='inchi',output_type='iupac')
#for k in d.keys():
# print(k, d[k])
import json
with open("smiles_to_missing.json") as f:
ll=json.load(f)
print(list(ll.keys())[:10])
d = query_pubchem_name(list(ll.keys()),query_type='smiles',output_type='iupac')
for k in d.keys():
print(k, d[k])
Can you try to query again all 8.9k compounds and report back how many are truly unresolvable (+eg a list)? I will then take a look where they origin from (suggested, stereochemistry correction, etc)
Okay I also cannot get more resolved, seems like your queries actually went through (maybe pubchempy improved on this in the last two years!)
So, I have now checked which reactions are affected. From a list of unmapped reactions (but processed) not containing duplicates, there are about 10k reactions with compounds not in the json file:
So, for nearly all of the missing SMILES strings, there is a good reason why they are missing. That doesn't solve your initial problem though, which is to resolve them. Apart from trying a few different resolvers, I don't have a good answer here. Maybe you also don't want to use "suggested" and "single from multi" entries, since they are putative anyways. Then you would only be left with the 3.9k changed stereochemistry ones, which we could possibly link back to the name with unchanged stereochemistry.
Cool! Really appreciate you looking into this! Agree that 200 out of such a large dataset is a rounding error. Will likely do as you suggest and remove "suggested" and "single from multi" enzymes. On the topic of resolvers I'm going to try CirPy in addition to the ones we've tried already. It seems to be a python wrapper for Cactus. Will report back whether I can resolve any more.
Poking around with the smiles yesterday I found some reactions that did not behave as would expect. Don't know how often these occur though. Perhaps these things are rare.
Clearly there will always be some records that get incorrectly parsed and that's just the way things are. Just thought I would share my findings all the same in case they are indicative of something that can be improved/changed.
Thanks! Looks like I messed up when fixing the isomerase bug (thats the -1 ids), thanks for reporting, will try to get this fixed asap. I fixed the issue for the 200 wrong ones today, and will try to investigate what you were reporting
Regarding reactions with rule_id=-1: In general, that rule_id is assigned if the achiral reactants equal the achiral products. In general, this is necessary to correctly map cis/trans and D/L isomerase, because there is no according rule in the Broadbelt reaction set (which does not account for chirality). So, rule_id=-1 doesn't mean that the reaction is wrong per-se. Now, regarding redox-reactions, I have filtered all EC=1...* with rule_id=-1 and find 17 unique reactions:
For isomerase reactions (filtered all EC=5..* with rule_id=-1) there are 1271 unique reactions. For 1163 of them, the SMILES is not the same on reactant and product side, but differs in stereochemistry, e.g. cis/trans or D/L, so that is correct from my point of view (maybe you did not mean those, or maybe you looked at achiral educts/products?). Many of those were wrong in version 1, because we didn't have the rule_id=-1 and then we got rules that changed some atom numbers which did not reflect reality. For the 108 remaining ones, the reactant and product side is the same. For some of them, the BRENDA entry also has the same educts/products. However, most of them actually do undergo some isomerisation in the BRENDA entry, but the names get resolved to the same SMILES. I am not sure how to deal with this in an automated way (apart from removing reactions were the educts equal the products altogether). Since these are not many reactions, it would also be doable to manually correct the SMILES in the next version.
So, I am glad that there is no problem in the mapping itself (so no bug introduced when I corrected the isomerase mappings), but the issues arise from resolving names to SMILES, or are already in BRENDA. Nevertheless, these should be corrected in the next version.
I will look at the other problems later today.
For the second question: The reaction could not be mapped via rule application. It therefore was assigned suggested products (the source='suggested' flag is set), meaning that the products do not reflect the original reaction text. Suggested products get assigned based on rules that are already in the same EC class. Here, for example reaction 319574 (rxn_idx=297591) says that 'S-(methylthiomethyl)cysteine 4-oxide = (methylthio)methane-sulfenic acid + 2-aminoacrylate' (nearly the same reactants apart from a side-chain far away from the reactants) - based on these and other examples the new products are suggested. Since there are many examples where only the S-C bond is cleaved, this reaction seems more likely than the one recorded in BRENDA. So I think this one is treated correctly. However, if you do not want suggested products, you can filter for them (in the 'source' tab).
For the third question. The mapping is also flagged as 'suggested', meaning that unfortunately the native BRENDA entry could not be mapped. So if you filter for suggested reactions, you would not see this one. I think I have to add a filter for suggested reactions if they produce the exact same product to not use them. So thanks a lot for bringing this up.
The reason that the reaction could not be mapped is that there is no balanced reaction (the product side is missing a water molecule). I.e., the BRENDA entry is 'D-glucose + kojibiose = 4-O-alpha-kojibiosyl-D-glucose {r}', but should have been 'D-glucose + kojibiose = 4-O-alpha-kojibiosyl-D-glucose + water {r}'. If I manually add the water, it resolved fine to the expected reaction. I will have to check whether other reactions are also affected by missing water molecules, maybe that could be added to the preprocessing.
In all cases, thank you very much for reporting! These are definitely things that can be improved in the next version, and I will happily add them. From what I have seen, these should not occur frequently, and not impact the overall usability/performance of EnzymeMap a lot, so for anyone using Version 2, they should not be a big problem.
I think you wanted to filter for suggested reactions anyways (and maybe it would also be good to filter everything with reactants=products for now), so most of these should not affect your work hopefully.
Regarding reactions with rule_id=-1: In general, that rule_id is assigned if the achiral reactants equal the achiral products. In general, this is necessary to correctly map cis/trans and D/L isomerase, because there is no according rule in the Broadbelt reaction set (which does not account for chirality). So, ruleid=-1 doesn't mean that the reaction is wrong per-se. Now, regarding redox-reactions, I have filtered all EC=1.._.* with rule_id=-1 and find 17 unique reactions:
* 8 are the same on reactant and product side because they are the same on both sides in BRENDA. In some of those cases, e.g. "NAD+ + NADH = NADH + NAD+", there obviously is a reaction which we are not detecting. However, these reactions might not be very interesting for a synthesis perspective anyways. * 5 correspond to cis/trans isomerase reactions, and correspond to the respective BRENDA entry (although that sounds like a misclassification to me) * 4 are the same on reactant/product, although the BRENDA entry is not the same, because sulfur and H2S, as well as sulfide all gets resolved to "H2S" when changing trivial names to SMILES, and another one resolves ferricytochrome c and ferrocytochrome c to the same SMILES. So for redox-reactions (EC=1._._.*) I didn't find anything unusual (although I agree they could have been filtered out). Did you mean other reactions? Since you mentioned explicit Hs, I somehow think that you are referring to other reactions? It should not matter whether the hydrogens are explicit or implicit, since a reduction/oxidation changes the number of hydrogens, and that is reflected in both version of SMILES (explicit/implicit Hs). We just do not track with hydrogen is which.
For isomerase reactions (filtered all EC=5*.. with rule_id=-1) there are 1271 unique reactions. For 1163 of them, the SMILES is not the same on reactant and product side, but differs in stereochemistry, e.g. cis/trans or D/L, so that is correct from my point of view (maybe you did not mean those, or maybe you looked at achiral educts/products?). Many of those were wrong in version 1, because we didn't have the rule_id=-1 and then we got rules that changed some atom numbers which did not reflect reality. For the 108 remaining ones, the reactant and product side is the same. For some of them, the BRENDA entry also has the same educts/products. However, most of them actually do undergo some isomerisation in the BRENDA entry, but the names get resolved to the same SMILES. I am not sure how to deal with this in an automated way (apart from removing reactions were the educts equal the products altogether). Since these are not many reactions, it would also be doable to manually correct the SMILES in the next version.
So, I am glad that there is no problem in the mapping itself (so no bug introduced when I corrected the isomerase mappings), but the issues arise from resolving names to SMILES, or are already in BRENDA. Nevertheless, these should be corrected in the next version.
I will look at the other problems later today.
Yes the redox things that looked are related to what you mentioned. A few examples below, all with reaction rule -1: S>>S NC@@HC(=O)O.S>>NC@@HC(=O)O.S COC1=C(OC)C(=O)C(C/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CCC=C(C)C)=C(C)C1=O.S>>COC1=C(OC)C(=O)C(C/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CCC=C(C)C)=C(C)C1=O.S
Many thanks for following up on all these different comments! Seems to me that the impact on the dataset itself is indeed quite small, and that these are edge cases. Suggest we close the issue.
I want to obtain the chemical name for each metabolite in the EnzymeMap dataset. Specifically I'm interested in the balanced reactions. However, with the files provided it seems that there are about 20 % of smiles for which there is no name mapping. The parsing is based on the assumption that the "non-bond" symbol "." is used to separate substrates and products.
The relevant files here are
processed_reactions.csv
andcompound_to_smiles.json
in thedata
folder. The code below prints out the following messages:Am I doing something wrong here? Is there a good reason for the missing smile to name mappings? Or is it perhaps a bug?