SMILES to name mapping, missing info?

mengqvist commented 1 year ago

I want to obtain the chemical name for each metabolite in the EnzymeMap dataset. Specifically I'm interested in the balanced reactions. However, with the files provided it seems that there are about 20 % of smiles for which there is no name mapping. The parsing is based on the assumption that the "non-bond" symbol "." is used to separate substrates and products.

The relevant files here are processed_reactions.csv and compound_to_smiles.json in the data folder. The code below prints out the following messages:

There are 43903 unique smiles in the reaction dataset. There are 52498 unique smiles in the name to smile mapping file. There are 8009 smiles in the enzymemap dataset that are not in the name to smile mapping file.

Am I doing something wrong here? Is there a good reason for the missing smile to name mappings? Or is it perhaps a bug?

from os.path import join
import pandas as pd
import json

### get all the smiles ###

# load the data
df = pd.read_csv(join('data', 'processed_reactions.csv'), 
                 sep=',')

# now iterate through each row
enzymemap_smiles = set()
for reaction in df.unmapped.values:

    # get the substrates and products
    substrates, products = reaction.split('>>')
    for smiles in [*substrates.split('.'), *products.split('.')]:
        enzymemap_smiles.add(smiles)

print(f'There are {len(enzymemap_smiles)} unique smiles in the reaction dataset.')

### get the names to smiles mapping ###

# obtain smile to name mapping
with open(join('data', 'compound_to_smiles.json'), 'r') as f:
    name_to_smiles = json.loads(f.read())

# invert to get smile to name mapping
smile_to_name = {}
for name, smiles in name_to_smiles.items():
    if smiles == []:
        continue

    for s in smiles:
        if smile_to_name.get(s) is None:
            smile_to_name[s] = []

        smile_to_name[s].append(name)

print(f'There are {len(smile_to_name)} unique smiles in the name to smile mapping file.')

### ensure that all the smiles in the enzymemap dataset have a name ###

# cross-reference with the enzymemap dataset
missing = enzymemap_smiles - set(smile_to_name.keys())

print(f'There are {len(missing)} smiles in the enzymemap dataset that are not in the name to smile mapping file.')

hesther commented 1 year ago

Hi, Yes, it is possible that there are SMILES in EnzymeMap that are not in the compound_to_smiles.json, namely in the following cases:

the reaction was corrected for stereochemistry
the reaction outcome was suggested due to an imbalanced or non-mapped reaction

So your code and observation is correct, but a feature, not a bug :)

Also (just a note, I think you had consistently used version 1 above), I just updated GitHub to version 2 of EnzymeMap and will update Zenodo later today (so make sure to not accidentally mix up files today)

mengqvist commented 1 year ago

Thanks for the quick response! :) That makes a lot of sense. So essentially these are potential errors in the BRENDA database, which makes them even more interesting.

How should I go about getting a name / identifier for these that would link them to an actual chemical? Any fixed identifier would do. pubchempy resolves some of them, but far from all.

mengqvist commented 1 year ago

Running the mapping now, will attach to a comment later if you're interested.

hesther commented 1 year ago

Mmh, I would have said Pubchempy, too. You can try different options, e.g. send the SMILES or convert to INCHI first (sometimes that helps). You could also try the RCI resolver https://cactus.nci.nih.gov/chemical/structure (there's an API), it's a bit slow, but you only have to run it once I guess

hesther commented 1 year ago

For name to SMILES, opsin is also a good option, but I just checked, it doesn't work the other way around.

Yes, happy to see the new outcome

mengqvist commented 1 year ago

In the new file there are ~8900 SMILES without names. About 2000 of those could be mapped with pubchempy.

Now trying the strategy of converting the SMILES to INCHI using Rdkit and then mapping with pubchempy. Only have limited experience with Rdkit, but it seems to not like the SMILES very much (see below). I do get additional matches with this strategy though.

from rdkit import Chem

# convert smiles to inchi
inchi = Chem.MolToInchi(Chem.MolFromSmiles(smiles))

[16:32:08] WARNING: Accepted unusual valence(s): S(3) [16:32:10] WARNING: Charges were rearranged [16:32:11] WARNING: Omitted undefined stereo [16:32:16] WARNING: Omitted undefined stereo [16:32:03] WARNING: Omitted undefined stereo [16:32:05] WARNING: Omitted undefined stereo [16:32:06] WARNING: Charges were rearranged [16:32:08] WARNING: Charges were rearranged [16:32:10] WARNING: Charges were rearranged [16:32:11] WARNING: Omitted undefined stereo [16:32:16] WARNING: Omitted undefined stereo [16:32:31] WARNING: Omitted undefined stereo [16:32:03] WARNING: Omitted undefined stereo [16:32:05] WARNING: Omitted undefined stereo [16:32:06] WARNING: Charges were rearranged [16:32:08] WARNING: Charges were rearranged [16:32:10] WARNING: Charges were rearranged

hesther commented 1 year ago

Are there 17 warning for the ~8900 molecules? Then I think that is not too bad. Sometimes stereo information is deleted in the products when it is not defined in the reactants for atoms far away from the reactive center, and I simply remove the stereochemistry flag. So I guess the "Omitted undefined stereo" comes from there. I have never seen the "Charges were rearranged", have to dig into this to see what it means.

Also, did it take so long to resolve via PubchemPy (or were you just writing only now) - if this takes so long, I can send you a script that will resolve ~10000k molecules via PubchemPy in a few minutes

mengqvist commented 1 year ago

No, that's just a very small subset. Lots and lots of warnings. I'd say for most of them. If the stereochemistry flag is removed that could explain why the SMILES don't match.

Yes, it took that long. About 1.5 s per compound. Would love to get the script! :)

hesther commented 1 year ago

Mmh maybe we would have needed another round of standardization after removing the flag, I have to look into this.

I'll post the script in a second

hesther commented 1 year ago

mmh okay adapting the script takes a bit longer than I though (I only have name to smiles, thought it would be trivial to reverse). I can continue tomorrow - I guess you are using pubchempy at the moment (which is very slow) - I would recommend switching to PUG (although this involves some unpleasant XML), you can look into query_pubchem() in helpers_resolve_smiles.py if you want to get started already. The query needs to be changed + the desired output according to https://pubchem.ncbi.nlm.nih.gov/docs/power-user-gateway#section=FAQs

mengqvist commented 1 year ago

The Rdkit SMILES to INCHI transformation followed by PubChemPy only gave me two more molecules resolved. So sadly not a solution. :(

Will have a look at PUG (how can you go wrong with something called "Power User Gateway"?), but based on what I saw with Rdkit my expectations are low.

What are the relative merits of:

Taking the name from the BRENDA raw text field (I know they are not in the same order as the smiles, but I'm sure this can be resolved somehow).
Not editing the SMILES by removing the stereochemistry flag.
Mapping the SMILES to names before stereochemistry editing and using that name also for the edited SMILE vector?

hesther commented 1 year ago

Okay then don’t spend time on Pubchem PUG, it will give you the same results only faster. I think I need to fix the underlying issue instead.

1) would be possible bit doesn’t work for suggested reactions

2) simply not removing/editing them will create lots of lost reactions, since BRENDA has a lot of stereochemistry problems due to the ambiguity in resolving names to SMILES. However, there might be a better way to do this, I will need to check why the SMILES are weird, and can possibly fix it.

3) Possible, but then we might have different SMILES for the same name (which we do have anyways, so not a big problem I guess). Will also not work for suggested reactions.

So I need to look into whether the main contributor to the non-standardized SMILES is the suggestion or stereochemistry correction (or something else). If the suggestion is not a problem, we can simply do 1 or 3. In any case, we shouldn’t get SMILES that are not resolvable via pubchempy. I also need to check whether this affects small, large and „exotic“ compounds alike.

mengqvist commented 1 year ago

Sounds like a plan! Agree that all smiles should be resolvable. Still hoping that there is some error on my end and that more of the 8.9 K are actually resolvable. Attached is a json of the 2.3 K that I could get. smiles_to_missing.zip

hesther commented 1 year ago

Thanks! Since I encountered the same problem once: your pubchempy requests do really go through for all compounds and just return an empty list, right? Because depending on how you set up your scripts, pubchem might block your IP address for a few minutes if you send a lot of requests and if you had just done eg a try/except you will never see the http error you get and it seems like compounds don’t resolve.

mengqvist commented 1 year ago

Nope, definitely did the bad version of what you describe. Sounds like there will be unseen http errors in there.

hesther commented 1 year ago

Haha I did exactly the same a few years ago :) then I guess the low number of resolvable compounds is due to this, and we might want to switch to PUG after all (power users don’t get blocked…). I nevertheless want to take a look at the inchi warnings, they give me a bad feeling that there is some standardization missing or going wrong (which wouldn’t affect any results in the paper, but would haunt me for the sake of correctness)

hesther commented 1 year ago

Okay so here's the PUG solution. This resolved the 2000 compounds in 15 seconds, yay (I also made it work for inchi, and different output types (e.g. 'iupac' for the name, 'synonyms' for all names, etc):

import os
from lxml import etree as ET
import requests
import shutil
import urllib.request as request
from contextlib import closing
import time
from collections import defaultdict
from rdkit import Chem

def download_ftp(ftp_link: str, outfile: str):
    """download_ftp.

    Args:
        ftp_link (str): ftp_link
        outfile (str): outfile
    """

    ftp_link=ftp_link.replace('ftp','https')
    with closing(request.urlopen(ftp_link)) as r:
        with open(outfile, 'wb') as f:
            shutil.copyfileobj(r, f)

def query_pubchem_name(ids: list,
                   query_type: str = "smiles",
                   output_type: str = "iupac",
                   save_file: str = "pubchem_save.txt",
                   rm_save: bool = True, 
                   encoding : str ="utf8",
                   return_single : bool = False) -> dict:
    """query_pubchem.

    Args:
        ids (list):
        query_type (str): inchi, chebi, or synonym
        save_file (str):
        rm_save (bool): If true, delete the file saved afterward
        encoding (str): Encoding to send request, defaults to utf-8
        return_single (bool): If true, only return the top hit for smiles

    Return:
        dict mapping ids to smiles lists
    """

    WAIT_TIME = 10
    DOCHEADER = "<!DOCTYPE PCT-Data PUBLIC \"-//NCBI//NCBI PCTools/EN\" \"NCBI_PCTools.dtd\">"
    URL = "https://pubchem.ncbi.nlm.nih.gov/pug/pug.cgi"
    REQUIRED_HEADINGS = [
        "PCT-Data_input", "PCT-InputData", "PCT-InputData_query", "PCT-Query",
        "PCT-Query_type", "PCT-QueryType", "PCT-QueryType_id-exchange",
        "PCT-QueryIDExchange"
    ]

    QUERY_SUBTREE_NAMES = ["PCT-QueryIDExchange_input", "PCT-QueryUids"]
    OUTPUT_HEADERS = [
        "PCT-QueryIDExchange_operation-type", "PCT-QueryIDExchange_output-type",
        "PCT-QueryIDExchange_output-method", "PCT-QueryIDExchange_compression"
    ]
    OUTPUT_VALUES = ["same", output_type, "file-pair", "none"]

    # Start building the query tree
    root = ET.Element("PCT-Data")
    cur_pos = root

    for new_heading in REQUIRED_HEADINGS:
        cur_pos = ET.SubElement(cur_pos, new_heading)

    # Navigate down to where we add the inchis
    query_subtree = cur_pos
    for query_subtree_name in QUERY_SUBTREE_NAMES:
        query_subtree = ET.SubElement(query_subtree, query_subtree_name)

    if query_type == "smiles":
        query_root, query_name = "PCT-QueryUids_smiles", "PCT-QueryUids_smiles_E"
        query_subtree = ET.SubElement(query_subtree, query_root)
        for id_ in ids:
            new_id = ET.SubElement(query_subtree, query_name)
            # give this the id text
            try:
                new_id.text = id_
            except ValueError:
                pass
    elif query_type == "inchi":
        query_root, query_name = "PCT-QueryUids_inchi", "PCT-QueryUids_inchi_E"
        query_subtree = ET.SubElement(query_subtree, query_root)
        for id_ in ids:
            new_id = ET.SubElement(query_subtree, query_name)
            # give this the id text
            try:
                new_id.text = id_
            except ValueError:
                pass

    # Go back up to to current position holder
    # Add the output specification
    for output_header, output_value in zip(OUTPUT_HEADERS, OUTPUT_VALUES):
        output_xml = ET.SubElement(cur_pos, output_header)
        output_xml.set("value", output_value)

    out_xml = ET.tostring(root,
                          encoding=encoding,
                          method="xml",
                          xml_declaration=True,
                          doctype=DOCHEADER).decode()

    # Post the request!
    resp = requests.post(URL, data=out_xml.encode('utf-8'))

    # Handle response and build a request to check on status
    resp_tree = ET.fromstring(resp.text)
    waiting_id = resp_tree.xpath("//PCT-Waiting_reqid")
    waiting_id = waiting_id[0].text if waiting_id else None

    STATUS_CHECK_HEADERS = [
        "PCT-Data_input", "PCT-InputData", "PCT-InputData_request",
        "PCT-Request"
    ]

    root = ET.Element("PCT-Data")
    cur_pos = root
    for header in STATUS_CHECK_HEADERS:
        cur_pos = ET.SubElement(cur_pos, header)
    req_id = ET.SubElement(cur_pos, "PCT-Request_reqid")
    req_id.text = waiting_id
    req_type = ET.SubElement(cur_pos, "PCT-Request_type")
    req_type.set("value", "status")
    query_xml = ET.tostring(root,
                            encoding=encoding,
                            method="xml",
                            xml_declaration=True,
                            doctype=DOCHEADER).decode()

    download_link = None
    waiting_time = 0

    # TODO: Add stop timeout condition?
    # Repeatedly query to see if the results are done, then sleep for WAITIME
    # in case they aren't
    while not download_link:
        resp = requests.post(URL, data=query_xml.encode('utf-8'))
        resp_tree = ET.fromstring(resp.text)
        download_link = resp_tree.xpath("//PCT-Download-URL_url")
        download_link = download_link[0].text if download_link else None
        time.sleep(WAIT_TIME)
        waiting_time += WAIT_TIME

    # At conclusion, download the ftp file
    download_ftp(download_link, save_file)

    # Also parse this
    ret_dict = defaultdict(lambda: [])
    with open(save_file, "r") as fp:
        for linenum, line in enumerate(fp):
            line = line.strip()
            split_line = line.split("\t")
            if len(split_line) == 2:
                mol, smiles = split_line
                mol = mol.strip()
                smiles = smiles.strip()
                ret_dict[mol].append(smiles)

    # If we should only return a single item and not a list
    if return_single: 
        ret_dict = {k : v[0] for k,v in ret_dict.items() if len(v) > 0}

    # Remove temp file
    if os.path.exists(save_file) and rm_save:
        os.remove(save_file)

    return ret_dict

# Input SMILES, output iupac
d = query_pubchem_name(['OCC','C(O)C','CCO','C(CCN)C[C@@H](C(=O)O)N'],query_type='smiles',output_type='iupac')
for k in d.keys():
    print(k, d[k])

# Input SMILES, output synonyms
#d = query_pubchem_name(['OCC','C(O)C','CCO','C(CCN)C[C@@H](C(=O)O)N'],query_type='smiles',output_type='synonyms')
#for k in d.keys():
#    print(k, d[k])

# Input Inchi, output iupac
#l = [Chem.MolToInchiKey(Chem.MolFromSmiles(x)) for x in ['OCC','C(O)C','CCO','C(CCN)C[C@@H](C(=O)O)N']]
#d = query_pubchem_name(l,query_type='inchi',output_type='iupac')
#for k in d.keys():
#    print(k, d[k])

import json
with open("smiles_to_missing.json") as f:
    ll=json.load(f)
print(list(ll.keys())[:10])
d = query_pubchem_name(list(ll.keys()),query_type='smiles',output_type='iupac')
for k in d.keys():
    print(k, d[k])

hesther commented 1 year ago

Can you try to query again all 8.9k compounds and report back how many are truly unresolvable (+eg a list)? I will then take a look where they origin from (suggested, stereochemistry correction, etc)

hesther commented 1 year ago

Okay I also cannot get more resolved, seems like your queries actually went through (maybe pubchempy improved on this in the last two years!)

hesther commented 1 year ago

So, I have now checked which reactions are affected. From a list of unmapped reactions (but processed) not containing duplicates, there are about 10k reactions with compounds not in the json file:

4985 are suggested reactions, and thus have products which are likely not in the json file. This is expected and correct
1411 are single step from multi step reactions, which can have reactants or products not in the json file for all intermediate steps. This is also expected and correct
3906 are reactions obtained directly from the entry (or the reversed entry). Out of those: -- 3668 have corrected stereochemistry, and manually checking a few dozen showed that they are ok. So, the stereochemistry checking did not introduce a weird error (like I feared for a hot second), and the SMILES are valid. By the way, I googled a bit about RDKits Inchi functionality and found a lot of people reporting these warnings. Looking manually at SMILES that produced warnings, it seemed like the warnings are caused by e.g. double bonds in rings that don't have a specified stereochemistry, but of course can only be cis since they are in a ring. -- 238 have undergone an unexpected change, and might be the cause of a bug, that I will now hunt down. I guess 200 for a dataset of that size is not a big problem, but for the sake of completeness, I will fix it nevertheless.

So, for nearly all of the missing SMILES strings, there is a good reason why they are missing. That doesn't solve your initial problem though, which is to resolve them. Apart from trying a few different resolvers, I don't have a good answer here. Maybe you also don't want to use "suggested" and "single from multi" entries, since they are putative anyways. Then you would only be left with the 3.9k changed stereochemistry ones, which we could possibly link back to the name with unchanged stereochemistry.

mengqvist commented 1 year ago

Cool! Really appreciate you looking into this! Agree that 200 out of such a large dataset is a rounding error. Will likely do as you suggest and remove "suggested" and "single from multi" enzymes. On the topic of resolvers I'm going to try CirPy in addition to the ones we've tried already. It seems to be a python wrapper for Cactus. Will report back whether I can resolve any more.

Poking around with the smiles yesterday I found some reactions that did not behave as would expect. Don't know how often these occur though. Perhaps these things are rare.

There's an issue with redox reactions with small molecules. I guess since the hydrogens are not explicit in the SMILES. But from what I gather you've removed these by assigning them the reaction id of -1, so it's easy to filter. Also many of the isomerase reactions don't work and have the same SMILES on both sides of the reaction (in the new file).
Some of the SMILES reactions seem to have been transformed from what I would expect based on the BRENDA text. For example "C=CCS@CC@HC(=O)O>>C=C(N)C(=O)O.C=CCSO" (rxn_idx 297711). According to the original reaction there should be ammonia and pyruvate as two of the three products. But here they are combined into one molecule, which looks weird to me.
For the reaction below, there are the same smiles on both sides, even though the text indicates that it should not be so. It's also a reaction I know does occur in reality (the text one), 4-O-alpha-kojibiosyl-D-glucose gets the wrong SMILE. "O=CC@H C@@H C@H C@HCO.O=CC@H C@@H C@H C@HCO>>O=CC@H C@@H C@H C@HCO.O=CC@H C@@H C@H C@HCO" (rxn_idx 222748). The rule id is not -1.

Clearly there will always be some records that get incorrectly parsed and that's just the way things are. Just thought I would share my findings all the same in case they are indicative of something that can be improved/changed.

hesther commented 1 year ago

Thanks! Looks like I messed up when fixing the isomerase bug (thats the -1 ids), thanks for reporting, will try to get this fixed asap. I fixed the issue for the 200 wrong ones today, and will try to investigate what you were reporting

hesther commented 1 year ago

Regarding reactions with rule_id=-1: In general, that rule_id is assigned if the achiral reactants equal the achiral products. In general, this is necessary to correctly map cis/trans and D/L isomerase, because there is no according rule in the Broadbelt reaction set (which does not account for chirality). So, rule_id=-1 doesn't mean that the reaction is wrong per-se. Now, regarding redox-reactions, I have filtered all EC=1...* with rule_id=-1 and find 17 unique reactions:

8 are the same on reactant and product side because they are the same on both sides in BRENDA. In some of those cases, e.g. "NAD+ + NADH = NADH + NAD+", there obviously is a reaction which we are not detecting. However, these reactions might not be very interesting for a synthesis perspective anyways.
5 correspond to cis/trans isomerase reactions, and correspond to the respective BRENDA entry (although that sounds like a misclassification to me)
4 are the same on reactant/product, although the BRENDA entry is not the same, because sulfur and H2S, as well as sulfide all gets resolved to "H2S" when changing trivial names to SMILES, and another one resolves ferricytochrome c and ferrocytochrome c to the same SMILES. So for redox-reactions (EC=1...*) I didn't find anything unusual (although I agree they could have been filtered out). Did you mean other reactions? Since you mentioned explicit Hs, I somehow think that you are referring to other reactions? It should not matter whether the hydrogens are explicit or implicit, since a reduction/oxidation changes the number of hydrogens, and that is reflected in both version of SMILES (explicit/implicit Hs). We just do not track with hydrogen is which.

For isomerase reactions (filtered all EC=5..* with rule_id=-1) there are 1271 unique reactions. For 1163 of them, the SMILES is not the same on reactant and product side, but differs in stereochemistry, e.g. cis/trans or D/L, so that is correct from my point of view (maybe you did not mean those, or maybe you looked at achiral educts/products?). Many of those were wrong in version 1, because we didn't have the rule_id=-1 and then we got rules that changed some atom numbers which did not reflect reality. For the 108 remaining ones, the reactant and product side is the same. For some of them, the BRENDA entry also has the same educts/products. However, most of them actually do undergo some isomerisation in the BRENDA entry, but the names get resolved to the same SMILES. I am not sure how to deal with this in an automated way (apart from removing reactions were the educts equal the products altogether). Since these are not many reactions, it would also be doable to manually correct the SMILES in the next version.

So, I am glad that there is no problem in the mapping itself (so no bug introduced when I corrected the isomerase mappings), but the issues arise from resolving names to SMILES, or are already in BRENDA. Nevertheless, these should be corrected in the next version.

I will look at the other problems later today.

hesther commented 1 year ago

For the second question: The reaction could not be mapped via rule application. It therefore was assigned suggested products (the source='suggested' flag is set), meaning that the products do not reflect the original reaction text. Suggested products get assigned based on rules that are already in the same EC class. Here, for example reaction 319574 (rxn_idx=297591) says that 'S-(methylthiomethyl)cysteine 4-oxide = (methylthio)methane-sulfenic acid + 2-aminoacrylate' (nearly the same reactants apart from a side-chain far away from the reactants) - based on these and other examples the new products are suggested. Since there are many examples where only the S-C bond is cleaved, this reaction seems more likely than the one recorded in BRENDA. So I think this one is treated correctly. However, if you do not want suggested products, you can filter for them (in the 'source' tab).

hesther commented 1 year ago

For the third question. The mapping is also flagged as 'suggested', meaning that unfortunately the native BRENDA entry could not be mapped. So if you filter for suggested reactions, you would not see this one. I think I have to add a filter for suggested reactions if they produce the exact same product to not use them. So thanks a lot for bringing this up.

The reason that the reaction could not be mapped is that there is no balanced reaction (the product side is missing a water molecule). I.e., the BRENDA entry is 'D-glucose + kojibiose = 4-O-alpha-kojibiosyl-D-glucose {r}', but should have been 'D-glucose + kojibiose = 4-O-alpha-kojibiosyl-D-glucose + water {r}'. If I manually add the water, it resolved fine to the expected reaction. I will have to check whether other reactions are also affected by missing water molecules, maybe that could be added to the preprocessing.

hesther commented 1 year ago

In all cases, thank you very much for reporting! These are definitely things that can be improved in the next version, and I will happily add them. From what I have seen, these should not occur frequently, and not impact the overall usability/performance of EnzymeMap a lot, so for anyone using Version 2, they should not be a big problem.

I think you wanted to filter for suggested reactions anyways (and maybe it would also be good to filter everything with reactants=products for now), so most of these should not affect your work hopefully.

mengqvist commented 1 year ago

Regarding reactions with rule_id=-1: In general, that rule_id is assigned if the achiral reactants equal the achiral products. In general, this is necessary to correctly map cis/trans and D/L isomerase, because there is no according rule in the Broadbelt reaction set (which does not account for chirality). So, ruleid=-1 doesn't mean that the reaction is wrong per-se. Now, regarding redox-reactions, I have filtered all EC=1.._.* with rule_id=-1 and find 17 unique reactions:
* 8 are the same on reactant and product side because they are the same on both sides in BRENDA. In some of those cases, e.g. "NAD+ + NADH = NADH + NAD+", there obviously is a reaction which we are not detecting. However, these reactions might not be very interesting for a synthesis perspective anyways.

* 5 correspond to cis/trans isomerase reactions, and correspond to the respective BRENDA entry (although that sounds like a misclassification to me)

* 4 are the same on reactant/product, although the BRENDA entry is not the same, because sulfur and H2S, as well as sulfide all gets resolved to "H2S" when changing trivial names to SMILES, and another one resolves ferricytochrome c and ferrocytochrome c to the same SMILES.
  So for redox-reactions (EC=1._._.*) I didn't find anything unusual (although I agree they could have been filtered out). Did you mean other reactions? Since you mentioned explicit Hs, I somehow think that you are referring to other reactions? It should not matter whether the hydrogens are explicit or implicit, since a reduction/oxidation changes the number of hydrogens, and that is reflected in both version of SMILES (explicit/implicit Hs). We just do not track with hydrogen is which.
For isomerase reactions (filtered all EC=5*.. with rule_id=-1) there are 1271 unique reactions. For 1163 of them, the SMILES is not the same on reactant and product side, but differs in stereochemistry, e.g. cis/trans or D/L, so that is correct from my point of view (maybe you did not mean those, or maybe you looked at achiral educts/products?). Many of those were wrong in version 1, because we didn't have the rule_id=-1 and then we got rules that changed some atom numbers which did not reflect reality. For the 108 remaining ones, the reactant and product side is the same. For some of them, the BRENDA entry also has the same educts/products. However, most of them actually do undergo some isomerisation in the BRENDA entry, but the names get resolved to the same SMILES. I am not sure how to deal with this in an automated way (apart from removing reactions were the educts equal the products altogether). Since these are not many reactions, it would also be doable to manually correct the SMILES in the next version.

So, I am glad that there is no problem in the mapping itself (so no bug introduced when I corrected the isomerase mappings), but the issues arise from resolving names to SMILES, or are already in BRENDA. Nevertheless, these should be corrected in the next version.

I will look at the other problems later today.

Yes the redox things that looked are related to what you mentioned. A few examples below, all with reaction rule -1: S>>S NC@@HC(=O)O.S>>NC@@HC(=O)O.S COC1=C(OC)C(=O)C(C/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CCC=C(C)C)=C(C)C1=O.S>>COC1=C(OC)C(=O)C(C/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CC/C=C(\C)CCC=C(C)C)=C(C)C1=O.S

mengqvist commented 1 year ago

Many thanks for following up on all these different comments! Seems to me that the impact on the dataset itself is indeed quite small, and that these are edge cases. Suggest we close the issue.

hesther / enzymemap

SMILES to name mapping, missing info? #3