Can not reproduce the result of Fig 4.

jaewanlee93 commented 1 year ago

Hi all, I’m hoping to get some help/ information to reproduce the result of figure 4(a) in the paper.(Extraction of organic chemistry grammar from unsupervised learning of chemical reactions) I want to utilize rxnmapper for making a pipeline, but before using this, I tested the performance of rxnmapper by reproducing the figure 4(a). But the accuracy is lower than I expected. Therefore I want to ask 2 questions.

The way how to preprocessing the USPTO data. First I downloaded the data from https://ibm.box.com/v/RXNMapperData. And I used the ‘test_natcomm.json’ file.
In that file, there are ‘rxn’ and ‘CORRECT MAPPING’. I used ‘CORRECT MAPPING’ values as ground truths and used ‘rxn’ values as input for rxnmapper model. And I compared outputs of rxnmapper and ‘CORRECT MAPPING’ values to get accuracy.

with open(‘./RXNMapperData/Test/test_natcomm.json') as f:
    f_ = json.load(f)

#f_[’CORRECT MAPPING’][’24’] 
mapped_rxn = '[CH3:1][CH2:2][n:3]1[c:4](-[c:13]2[cH:14][cH:15][cH:16][c:17]3[cH:18][cH:19][cH:20][cH:21][c:22]23)[n:5][c:6]([F:7])[c:8]1[Si:9]([CH3:10])([CH3:11])[CH3:12]>>[CH3:1][CH2:2][n:3]1[c:4](-[c:13]2[cH:14][cH:15][cH:16][c:17]3[cH:18][cH:19][cH:20][cH:21][c:22]23)[n:5][c:6]([F:7])[cH:8]1.[CH4:10].[CH4:11].[CH4:12].[SiH4:9]'
#f_[’rxn’][’24’] 
rxn = 'CCn1c(-c2cccc3ccccc23)nc(F)c1[Si](C)(C)C>>C.C.C.CCn1cc(F)nc1-c1cccc2ccccc12.[Si]'

gt = process_reaction_with_product_maps_atoms(mapped_rxn,True)
result = rxn_mapper.get_attention_guided_atom_maps([rxn])
pred = process_reaction_with_product_maps_atoms(result[0]['mapped_rxn'], True)

if gt == pred :
    accuracy = True
else:
    accuracy = False

In this case accuracy was False. And there were 248 False case out of 682 cases. (process_reaction_with_product_maps_atoms function is imported from smiles_utils.py in the rxnmapper directory.) And accuracy is following(Among USPTO 281 cases used in figure 4(a)): Number of bond changes : accuracy 1: 78% 2: 88% 3: 74% 4: 78% 5: 58% 6: 87%

So, is there a difference between the way I compared and the way you compared? or did you do further preprocess? if so, can you let me know?

definition of Accuracy I'm curious how you compared the accuracy. If the smiles and atom indices from the process_reaction_with_product_maps_atoms function are the same, I considered them to be identical and defined accuracy as the ratio of identical cases out of the total cases. Any help would be appreciated.

pschwllr commented 1 year ago

Some of the reactions have multiple correct mappings (even after canonicalisation), e.g.:

CC(C)(C)N(CC(=O)[O-])C(=O)C1=C(O)C2(CCOCC2)c2c(Cl)cccc2C1=O>>C.C.C.C.O=C(O)CNC(=O)C1=C(O)C2(CCOCC2)c2c(Cl)cccc2C1=O

With 4 equivalent carbon atoms in the product.

We tried to capture that in correct_maps from the json file. Using those, you should be able to reproduce the results from Figure 4 (make sure to load the versions of the modules we had back then - no guarantee for newer versions of RDKit, etc..).

jaewanlee93 commented 1 year ago

@pschwllr Thanks for quick answering. But the question remains unsolved.

1) How did you get correct_maps then? I mean, there should be the preprocessing for making the correct_maps from reactions in the json file. And in the correct_maps from the json file, there are some empty lists and the length of lists are different (some are 0 some are 1 and some are 2). if list is empty, is that means no ground truth?

2) Then when you get the results of figure4, did you compare atom indices (output of process_reaction_with_product_maps_atoms) and correct_maps?

rxn4chemistry / rxnmapper

Can not reproduce the result of Fig 4. #34