rxn4chemistry / rxnmapper

RXNMapper: Unsupervised attention-guided atom-mapping. Code complementing our Science Advances publication on "Extraction of organic chemistry grammar from unsupervised learning of chemical reactions" (https://advances.sciencemag.org/content/7/15/eabe4166).
http://rxnmapper.ai
MIT License
286 stars 68 forks source link

Error generating rxn_maps due to mismatch in array size #56

Closed starkAhmed43 closed 1 month ago

starkAhmed43 commented 1 month ago

Hello,

I am trying to generate atom mappings for some 30,000 reaction SMILES. These reaction SMILES are generated for the reactions from the Rhea database. I download the mol files of the participating metabolites for ChEBI, generate SMILES using RDKIT and then concatenate them using . and >> to generate reaction SMILES.

Using, BatchedMapper I am able to successfully generate atom maps for 24,600 of them. For the remaining 5000ish, when I use RxnMapper to identify the cause of error I get a few different error types:

Error: index X is out of bounds for axis 0 with size X
Error: could not broadcast input array from shape (A,) into shape (B,)
Error: The size of tensor a (D) must match the size of tensor b (512) at non-singleton dimension 1

Error 2 accounts for the vast majority of unsuccessful maps.

I am from a data science background and have zero chemistry knowledge and thus, while I understand the error messages, I do not understand what is causing them.

It would be really great if the authors of rxnMapper could help me out here. Please let me know if you need more details regarding the errors I am getting.

avaucher commented 1 month ago

Hi, I agree that the error messages are sometimes cryptic. Is it possible for you to share a few of these reactions?

My best guess is that the reaction SMILES strings are too long for the current model, but happy to check directly.

starkAhmed43 commented 1 month ago

Hi. Thank you so much for helping me out with this issue. Here are some SMILES examples for the errors I've mentioned above:

Error 1: index X is out of bounds for axis 0 with size X

[1*][C@@H]1O[C@H](CO)[C@@H](OP(=O)([O-])[O-])[C@H]1O.O>>*[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O.O=P([O-])([O-])O

[1*]C(=O)C(=O)[O-].[H+].NC(=O)C1=CN([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](O)[C@@H]3O)[C@@H](O)[C@H]2O)C=CC1.[NH4+]>>*[C@H]([NH3+])C(=O)[O-].O.NC(=O)c1ccc[n+]([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](O)[C@@H]3O)[C@@H](O)[C@H]2O)c1

[1*]C([2*])=O.[H+].NC(=O)C1=CN([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](O)[C@@H]3O)[C@@H](O)[C@H]2O)C=CC1>>*C(*)O.NC(=O)c1ccc[n+]([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](O)[C@@H]3O)[C@@H](O)[C@H]2O)c1

[NH3+][C@H]1[C@@H](O[C@H]2[C@H](O)[C@H](O)[C@H](O)[C@@H](O)[C@@H]2O)O[C@H](CO)[C@@H](O)[C@@H]1O.[1*]C(=O)OC[C@H](COP(=O)([O-])[O-])OC([2*])=O.[H+]>>*C(=O)OC[C@H](COP(=O)([O-])O[C@@H]1[C@H](O)[C@H](O)[C@@H](O)[C@H](O)[C@H]1O[C@H]1O[C@H](CO)[C@@H](O)[C@H](O)[C@H]1[NH3+])OC(*)=O.O

Error 2: could not broadcast input array from shape (A,) into shape (B,)

*N[C@@H](CCC(N)=O)C(=O)[O-].O>>*N[C@@H](CCC(=O)[O-])C(=O)[O-].[NH4+]

*[C@H]1C[C@H](OP(=O)([O-])[O-])[C@@H](CO)O1.O>>*[C@H]1C[C@H](O)[C@@H](CO)O1.O=P([O-])([O-])O

*N[C@@H](COP(=O)([O-])OCC(C)(C)[C@@H](O)C(=O)NCCC(=O)NCCSC(*)=O)C(*)=O.C[S+](CC[C@H]([NH3+])C(=O)[O-])C[C@H]1O[C@@H](n2cnc3c(N)ncnc32)[C@H](O)[C@@H]1O>>*C(=O)N[C@H]1CCOC1=O.[H+].*N[C@@H](COP(=O)([O-])OCC(C)(C)[C@@H](O)C(=O)NCCC(=O)NCCS)C(*)=O.CSC[C@H]1O[C@@H](n2cnc3c(N)ncnc32)[C@H](O)[C@@H]1O

*C(=O)OC1CCC2(C)C(CCC3C4CCC(*)C4(C)CCC32)C1.O>>*C(=O)[O-].*C1CCC2C3CCC4CC(O)CCC4(C)C3CCC12C.[H+]

Error 3: The size of tensor a (D) must match the size of tensor b (512) at non-singleton dimension 1

CC(=O)N[C@H]1[C@@H](OP(=O)([O-])OP(=O)([O-])OC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(\\C)CC/C=C(\\C)CCC=C(C)C)O[C@H](CO)[C@@H](O[C@@H]2O[C@H](CO)[C@@H](OP(=O)([O-])OC[C@H](O)COP(=O)([O-])OC[C@@H](O)[C@@H](O)[C@@H](O)COP(=O)([O-])OC[C@@H](O)[C@@H](O)[C@@H](O)CO)[C@H](O)[C@H]2NC(C)=O)[C@@H]1O.O=c1ccn([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])O[C@H]3O[C@H](CO)[C@@H](O)[C@H](O)[C@H]3O)[C@@H](O)[C@H]2O)c(=O)[nH]1>>CC(=O)N[C@H]1[C@@H](OP(=O)([O-])OP(=O)([O-])OC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(\\C)CC/C=C(\\C)CCC=C(C)C)O[C@H](CO)[C@@H](O[C@@H]2O[C@H](CO)[C@@H](OP(=O)([O-])OC[C@H](O)COP(=O)([O-])OC[C@@H](O)[C@@H](O)[C@@H](O)COP(=O)([O-])OC[C@@H](O[C@@H]3O[C@H](CO)[C@@H](O)[C@H](O)[C@H]3O)[C@@H](O)[C@@H](O)CO)[C@H](O)[C@H]2NC(C)=O)[C@@H]1O.[H+].O=c1ccn([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])[O-])[C@@H](O)[C@H]2O)c(=O)[nH]1

CC(C)(COP(=O)([O-])OP(=O)([O-])OC[C@H]1O[C@@H](n2cnc3c(N)ncnc32)[C@H](O)[C@@H]1OP(=O)([O-])[O-])[C@@H](O)C(=O)NCCC(=O)NCCSC(=O)C=Cc1ccc(O)cc1.[H+].CC(C)(COP(=O)([O-])OP(=O)([O-])OC[C@H]1O[C@@H](n2cnc3c(N)ncnc32)[C@H](O)[C@@H]1OP(=O)([O-])[O-])[C@@H](O)C(=O)NCCC(=O)NCCSC(=O)CC(=O)[O-].NC(=O)C1=CN([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](OP(=O)([O-])[O-])[C@@H]3O)[C@@H](O)[C@H]2O)C=CC1>>O=C=O.CC(C)(COP(=O)([O-])OP(=O)([O-])OC[C@H]1O[C@@H](n2cnc3c(N)ncnc32)[C@H](O)[C@@H]1OP(=O)([O-])[O-])[C@@H](O)C(=O)NCCC(=O)NCCS.O.O=C(/C=C/c1ccc(O)cc1)c1ccc(O)cc1[O-].NC(=O)c1ccc[n+]([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](OP(=O)([O-])[O-])[C@@H]3O)[C@@H](O)[C@H]2O)c1

Nc1ncnc2c1ncn2[C@@H]1O[C@H](COP(=O)([O-])OP(=O)([O-])OP(=O)([O-])[O-])[C@@H](O)[C@H]1O.CC1=C2[N-][C@H]([C@H](CC(=O)[O-])[C@@]2(C)CCC(=O)[O-])[C@]2(C)N=C(C(C)=C3N=C(C=C4N=C1[C@@H](CCC(=O)[O-])C4(C)C)[C@@H](CCC(=O)[O-])[C@]3(C)CC(N)=O)[C@@H](CCC(=O)[O-])[C@]2(C)CC(N)=O.[Co+2].Cc1cc2c(cc1C)N(C[C@H](O)[C@H](O)[C@H](O)COP(=O)([O-])OP(=O)([O-])OC[C@H]1O[C@@H](n3cnc4c(N)ncnc43)[C@H](O)[C@@H]1O)c1[nH]c(=O)[nH]c(=O)c1N2>>C/C1=C2/[N-][C@H]([C@H](CC(=O)[O-])[C@@]2(C)CCC(=O)[O-])[C@]2(C)N=C(/C(C)=C3\\N=C(/C=C4\\N=C1[C@@H](CCC(=O)[O-])C4(C)C)[C@@H](CCC(=O)[O-])[C@]3(C)CC([NH-])=O)[C@@H](CCC(=O)[O-])[C@]2(C)CC(N)=O.C[C@H]1O[C@@H](n2cnc3c(N)ncnc32)[C@H](O)[C@@H]1O.[Co+3].[H+].Cc1cc2nc3c(=O)[n-]c(=O)nc-3n(C[C@H](O)[C@H](O)[C@H](O)COP(=O)([O-])OP(=O)([O-])OC[C@H]3O[C@@H](n4cnc5c(N)ncnc54)[C@H](O)[C@@H]3O)c2cc1C.O=P([O-])([O-])OP(=O)([O-])OP(=O)([O-])[O-]

CC(=O)N[C@@H]1[C@H](O[C@H]2[C@H](O)[C@@H](NC(C)=O)[C@@H](OP(=O)([O-])OP(=O)([O-])OC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(\\C)CC/C=C(\\C)CCC=C(C)C)O[C@@H]2CO)O[C@H](CO)[C@@H](OP(=O)([O-])OC[C@H](O)COP(=O)([O-])OC[C@H](O)CO)[C@@H]1O.Nc1ccn([C@@H]2O[C@H](COP(=O)([O-])OP(=O)([O-])OC[C@@H](O)[C@@H](O)[C@@H](O)CO)[C@@H](O)[C@H]2O)c(=O)n1>>CC(=O)N[C@H]1[C@@H](OP(=O)([O-])OP(=O)([O-])OC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(/C)CC/C=C(\\C)CC/C=C(\\C)CCC=C(C)C)O[C@H](CO)[C@@H](O[C@@H]2O[C@H](CO)[C@@H](OP(=O)([O-])OC[C@H](O)COP(=O)([O-])OC[C@H](O)COP(=O)([O-])OC[C@@H](O)[C@@H](O)[C@@H](O)CO)[C@H](O)[C@H]2NC(C)=O)[C@@H]1O.Nc1ccn([C@@H]2O[C@H](COP(=O)([O-])[O-])[C@@H](O)[C@H]2O)c(=O)n1.[H+]

avaucher commented 1 month ago

Thanks for the examples!

I started looking into it and will shortly have a fix for Errors 1 and 2. This was caused by the presence of asterisks in the reaction SMILES, which sometimes stand for atom placeholders. After the fix, these reactions should succeed.

Error 3 seems to be caused by too long reaction SMILES. I'll improve the error message for these.

avaucher commented 1 month ago

Errors 1 and 2 fixed by #57.

avaucher commented 1 month ago

Error message for 3 improved in #58.

avaucher commented 1 month ago

The new version on PyPi (0.4.0) should cover all the above! Feel free to reopen this issue if needed.