hesther / enzymemap

Python package to atom map, correct and suggest enzymatic reactions
MIT License
35 stars 3 forks source link

reaction_string corresponds to multiple ec #9

Closed yufengwhy closed 1 month ago

yufengwhy commented 2 months ago

Is it rational that a reaction_string corresponds to multiple ec in your provided data cached_enzymemap.p ?

Such as:

O.O=[N+]([O-])c1ccc(OP(=O)(O)O)cc1>>O=P(O)(O)O.O=[N+]([O-])c1ccc(O)cc1 {'3.1.3.5', '3.1.3.41', '3.1.1.2', '3.1.8.1', '3.1.3.26', '3.1.3.8', '3.1.3.2', '3.6.1.1', '3.1.4.46', '3.1.3.21', '3.9.1.3', '3.1.3.25', '3.1.3.23', '3.1.6.6', '3.1.4.16', '3.1.3.9', '3.1.3.16', '3.1.3.62', '3.1.3.89', '3.9.1.2', '3.6.1.9', '3.1.3.75', '3.1.3.85', '3.1.3.48', '3.1.3.3', '3.1.3.1', '3.1.3.18', '3.1.3.58', '3.1.6.1', '3.1.3.73'} 30

hesther commented 1 month ago

Hi @yufengwhy, We process everything that is in BRENDA. For example, BRENDA reports 591 substrate/product pairs, as well as 64 natural substrates for EC 3.1.3.5 (see here). One of them is the reaction '4-nitrophenyl phosphate + H2O = 4-nitrophenol + phosphate' that you described above. Similarly, I have manually checked a few additional ones, and they all do report that reaction. In general, the EC numbers in EnzymeMap all come directly from Brenda.

If you look for reactions in Brenda, always make sure to go to the tab "Enzyme-Ligand Interaction" and then click both "Subtrate/Product" and "Natural Substrates" to see all reactions.

yufengwhy commented 1 month ago

thx @hesther for detail reply. I want to discuss that: EC is for enzyme, not for reaction, when we choose EC for some enzyme-reaction pair, we should choose the best suitable EC from enzyme to represent the enzyme-reaction pair for both enzyme and reaction, right ?

hesther commented 1 month ago

Hi @yufengwhy , I am not sure whether I understand your question correctly. The EC number is a classification number that depends on the main catalyzed reaction (and sometimes cofactors and mechanism and details on the protein). It clusters together several actual proteins from several organisms. So if you want to choose a protein for a reaction, you would want to choose the most promising EC, but also the most promising protein within that EC class. And for many reactions, there will be more than one EC number that might be able to conduct a certain reaction

yufengwhy commented 1 month ago

@hesther Exactly "EC number is a classification number that depends on the main catalyzed reaction". So how to choose one EC for an enzyme-reaction pair if the enzyme has more than one ECs?

hesther commented 1 month ago

@yufengwhy if you want to make it a classification, they are all true. If you want to find the „best“ one, you would have to look at the kinetics or yields.

yufengwhy commented 1 month ago

@hesther for an enzyme-reaction pair, the reactants and products of the reaction are fixed, so the reaction type or the EC are also fixed. There can only be one true,not "all true" ?

hesther commented 1 month ago

@yufengwhy are you thinking about a classification model? The classes are not mutually exclusive in the case of EC numbers, so you need to account for that in your loss function. For enzymes, there is no „only one is true“

yufengwhy commented 1 month ago

enzyme-EC can be one-to-many, but for an enzyme-reaction pair, the reaction should be belonging to one certain class ? to do so, I use "the top-three-level EC with the most occurrences" as the one certain class for the reaction, though reaction-EC is also one-to-many from the dataset. Do you think this way correct ?

hesther commented 1 month ago

No, I think the correct way would be a multilabel classification. If you must do a singlelabel classification at any cost, your plan is probably ok, but I don’t see how that would be useful in a real-world scenario. There are already good algorithms to predict not only the EC number for a reaction, but even proteins

yufengwhy commented 1 month ago

I redescribe the question below: The triplet enzyme-EC-reaction in the enzymeMap can be statistically analyzed to obtain a one-to-many relationship between enzyme-EC and reaction-EC. If the model predicts the EC categories of enzyme and reaction, are these multiple EC categories all correct? ( we only consider the rationality of modeling )

hesther commented 1 month ago

In a triplet enyzme-EC-reaction, what exactly would be the „enyzme“? Sequence? Organism?

yufengwhy commented 1 month ago

In a triplet enyzme-EC-reaction, there is uniprot_id, ec_num, rxn_text from each line of enzymemap_v2_brenda2023.csv:

rxn_idx,mapped,unmapped,orig_rxn_text,rule,rule_id,source,steps,quality,natural,organism,protein_refs,protein_db,ec_num 1,[CH3:1][CH:2]=[O:3].[H+].[NH2:4]C:5[C:7]1=[CH:8]N:9C@H:37[C@@H:39]3[OH:40])C@@H:41[C@H:43]2[OH:44])[CH:45]=[CH:46][CH2:47]1>>[CH3:1][CH2:2][OH:3].[NH2:4]C:5[c:7]1[cH:8]n+:9C@H:37[C@@H:39]3[OH:40])C@@H:41[C@H:43]2[OH:44])[cH:45][cH:46][cH:47]1,CC=O.NC(=O)C1=CN([C@@H]2OC@HC@H[C@@H]3O)C@@H[C@H]2O)C=CC1.[H+]>>CCO.NC(=O)c1cccn+C@H[C@@H]3O)C@@H[C@H]2O)c1,acetaldehyde + NADH + H+ = ethanol + NAD+ {r},[#6:1]1=[#6:2]-[#7:3]-[#6:4]=[#6:5]-[#6:6]-1.[#6:7]=[#8:8]>>[#6:7]-[#8:8].[#6:1]1:[#6:6]:[#6:5]:[#6:4]:[#7+:3]:[#6:2]:1,0,direct,single,0.9917081260364844,True,Ogataea angusta,['H9ZGN0'],uniprot,1.1.1.1

hesther commented 1 month ago

Then, for a given Uniprot-ID, there is only one correct EC, namely the one recorded in that line

yufengwhy commented 1 month ago

yes. but the given Uniprot-ID may be in multiple line with multiple ECs, the same applies to reactions. If the model predicts the EC categories of Uniprot-ID and reaction, are these multiple EC categories all correct ?

hesther commented 1 month ago

Then that is in BRENDA and you would have to ask that to the BRENDA developers. Or, you could go through a few examples and actually look through the literature references given in BRENDA, and cross check how the UniprotIDs are classified.

As I said, having multiple ECs associated to a reaction is very much expected. Having multiple ECs associated to a Uniprot-ID is not so expected (but we take that directly from BRENDA without verification), so I cannot really help with that other than advising you to check a few entries manually. ECs are also not written in stone, and quite often enyzmes get reclassified after a few years.

yufengwhy commented 1 month ago

having multiple ECs associated to a reaction is very much expected. Having multiple ECs associated to a Uniprot-ID is not so expected

would you kindly give some references?I am a beginner and very curious about that. I had the opposite idea before, I thought Uniprot may be more likely to have more ECs ...

hesther commented 1 month ago

I think I already answered that above, most enyzmes are promiscuous and process more than one substrate. So finding the same substrate in a different EC is rather common. However, once an enyzme is classified by the Enzyme Commission based on its main reaction, why should it also be assigned to a second class? I can‘t point you to a specific paper for that, but think your research or project would benefit from reading a few review articles on enzyme function and classification. Anyways, I wish you all the best for your project, but am closing the issue here since I think your initial question was answered and I unfortunately not have the time or resources to provide detailed insights on topics not directly related with EnzymeMap.