rxn4chemistry / rxnmapper

RXNMapper: Unsupervised attention-guided atom-mapping. Code complementing our Science Advances publication on "Extraction of organic chemistry grammar from unsupervised learning of chemical reactions" (https://advances.sciencemag.org/content/7/15/eabe4166).
http://rxnmapper.ai
MIT License
286 stars 68 forks source link

Error while generating Atom-atom mapping #40

Closed parit closed 1 year ago

parit commented 1 year ago

Hello,

I am trying to generate AAM for the reaction https://www.rhea-db.org/rhea/56485.

smiles="O=O.O=O.O=O.CC1(C)CC[C@@]2(CC[C@]3(C)C(=CC[C@@H]4[C@@]5(C)CC[C@H](O)C(C)(C)[C@@H]5CC[C@@]34C)[C@@H]2C1)C([O-])=O.Cc1cc2Nc3c([nH]c(=O)[nH]c3=O)N(C[C@H](O)[C@H](O)[C@H](O)COP([O-])([O-])=O)c2cc1C.Cc1cc2Nc3c([nH]c(=O)[nH]c3=O)N(C[C@H](O)[C@H](O)[C@H](O)COP([O-])([O-])=O)c2cc1C.Cc1cc2Nc3c([nH]c(=O)[nH]c3=O)N(C[C@H](O)[C@H](O)[C@H](O)COP([O-])([O-])=O)c2cc1C>>CC1(C)CC[C@@]2(CC[C@]3(C)C(=CC[C@@H]4[C@@]5(C)CC[C@H](O)[C@](C)([C@@H]5CC[C@@]34C)C([O-])=O)[C@@H]2C1)C([O-])=O.[H+].[H+].[H+].[H+].O.O.O.O.Cc1cc2nc3c(nc(=O)[n-]c3=O)n(C[C@H](O)[C@H](O)[C@H](O)COP([O-])([O-])=O)c2cc1C.Cc1cc2nc3c(nc(=O)[n-]c3=O)n(C[C@H](O)[C@H](O)[C@H](O)COP([O-])([O-])=O)c2cc1C.Cc1cc2nc3c(nc(=O)[n-]c3=O)n(C[C@H](O)[C@H](O)[C@H](O)COP([O-])([O-])=O)c2cc1C"
print(len(smiles))
mapper=RXNMapper();
tokenize=mapper.tokenize_for_model(smiles)
print(len(tokenize))
mapped = mapper.get_attention_guided_atom_maps([smiles])
mapped

The tokenizer length comes out to be 504 but I still get the error stating: "Token indices sequence length is longer than the specified maximum sequence length for this model (513 > 512). Running this sequence through the model will result in indexing errors"

Could someone please check why the token index is 513 and not 504?

avaucher commented 1 year ago

Hi @parit

The reason is the reaction SMILES is being canonicalized first. This increases the number of tokens to 513.

You will see that doing

mapped = mapper.get_attention_guided_atom_maps([smiles], canonicalize_rxns=False)

will not fail.

However, in general it is advisable to leave the canonicalization on.