rxn4chemistry / rxn-neb

Implementation of NEB retrosynthesis
MIT License
0 stars 0 forks source link

The last commandline can not get a result, Reproduce Failure. #2

Closed queliyong closed 3 months ago

queliyong commented 3 months ago

Hello, following the intructions of README, I finally get a forward and a backward model trained using [rxn-onmt-models] and USPTO FULL dataset, but no results are returned when run the last commandline, as follows: image The command line and models I used are: run-neb-retrosynthesis --product "NS(=O)(=O)c1nn(-c2ccccn2)cc1Br" --forward_model_path "./models/forward.pt" --backward_model_path "./models/backward.pt" --fingerprints_model_path "./rxnfp/models/transformers/bert_ft" --pca_model_filename "./sandbox/pca.pkl" --tree_data_dict_pca_filename "./sandbox/tree_data_dict_pca.pkl" --output_path "./test_retro.json" The models link is : https://figshare.com/articles/dataset/a_forward_and_a_backward_model/25954747 Any help would be appreciated! Thanks a lot.

drugilsberg commented 3 months ago

I believe this is because of the message you see above the last log entry. The model trained does not seem to generate valid SMILES strings. Did you follow all the data preparation steps reported in https://github.com/rxn4chemistry/rxn-onmt-models/tree/main? If the data used to train the model is not standardised according to that procedure, it might very well be that the resulting models will generate unsupported representations (e.g., the second molecule reported in the error O=c1cc[nH]n1 will fail at the kekulization step)

To overcome this without retraining the models you could update the code base adapting the SMILES checks here: https://github.com/rxn4chemistry/rxn-neb/blob/main/src/rxn/neb/utils/smiles.py. In this way you will not force the SMILES to follow the standard used in general by the models trained by rxn-onmt-models.

queliyong commented 3 months ago

Thanks a lot for your patient reply.

  1. I indeed followed all three data preparation steps: rxn-prepare-data, rxn-onmt-preprocess, and rxn-onmt-train. And though it was failed to get any results for several targets, and with the above invalidsmiles exception, I ran successfully for two small target product: "CCS(=O)(=O)OCCBr", "CC(C)CS(=O)(=O)OCCCl". But when testing this target: "Nc1cccc2cnc(Cl)cc12", a new error raised: 1717480406843_EDDCA9A0-4777-464c-91BC-993E30A7BA4A Is this normal when testing some targets with this project?
  2. I agree the SMILES sanitization checks in this function module, and I am confused how to update this SMILES checks and at the same time filter the unsanitized SMILES.
queliyong commented 3 months ago

And I was wondering if it has to do with the very small amount of data of reaction_trees.json in sample-data folder?

queliyong commented 3 months ago

These are some failure examples that raise invalidSMILES or unsanitized SMILES exception: target: 'CC(C)C(N)C(O)COc1ccccc1C1=NNC(=O)CC1' error: rxn.chemutils.exceptions.InvalidSmiles: "CC(C)C1C(COc2ccccc2C(=O)CCC(=O)O1.NN" is not a valid SMILES string. target:'Cc1ccc(C2=NNC(=O)CC2)cc1OCC(O)CNC(C)(C)C' error: rxn.chemutils.exceptions.InvalidSmiles: "CC(C)(C)N.Cc1ccc(C2=NNC(=O)CC2CO2)cc1OCC1CO1" is not a valid SMILES string

drugilsberg commented 3 months ago

And I was wondering if it has to do with the very small amount of data of reaction_trees.json in sample-data folder?

Surely the sample we provide does not allow to run realistic case studies, it's mostly to allow users to prepare their data according to the specification the package expects.

queliyong commented 3 months ago

And I was wondering if it has to do with the very small amount of data of reaction_trees.json in sample-data folder?

Surely the sample we provide does not allow to run realistic case studies, it's mostly to allow users to prepare their data according to the specification the package expects.

Thanks for your prompt reply. And when preparing my own dataset following reaction_trees.json, I was confused about the column "rxnfp_ids", with its disorder number even in one rxn record. Is this column necessary for this dataset? If so (checking the relevant code, probably), how to generate this column for my own dataset? Thanks.

drugilsberg commented 3 months ago

And I was wondering if it has to do with the very small amount of data of reaction_trees.json in sample-data folder?

Surely the sample we provide does not allow to run realistic case studies, it's mostly to allow users to prepare their data according to the specification the package expects.

Thanks for your prompt reply. And when preparing my own dataset following reaction_trees.json, I was confused about the column "rxnfp_ids", with its disorder number even in one rxn record. Is this column necessary for this dataset? If so (checking the relevant code, probably), how to generate this column for my own dataset? Thanks.

This is an identifier that is used to map the specific reaction to related fingerprints. Involving here @federicozipoli that is the main author and can give better guidance on the topic.

federicozipoli commented 3 months ago

Thank you @drugilsberg Correct. rxnfp_ids is list of identifiers (IDs) for the reactions, which correspond to the SMILES strings listed in the rxns array. These rxnfp_ids do not repeat within each record, ensuring that each reaction SMILES string in the rxns array has a unique identifier in the corresponding rxnfp_ids array. Additionally, these identifiers are unique not only within each record but also across different records.

For example, if in the record for patent US1 there are two reactions identified by rxnfp_ids = [0,1] the next record, let's say patent US2 will have records rxnfp_ids = [2,3].

queliyong commented 3 months ago

Thank you @drugilsberg Correct. rxnfp_ids is list of identifiers (IDs) for the reactions, which correspond to the SMILES strings listed in the rxns array. These rxnfp_ids do not repeat within each record, ensuring that each reaction SMILES string in the rxns array has a unique identifier in the corresponding rxnfp_ids array. Additionally, these identifiers are unique not only within each record but also across different records.

For example, if in the record for patent US1 there are two reactions identified by rxnfp_ids = [0,1] the next record, let's say patent US2 will have records rxnfp_ids = [2,3].

Soga!Thanks a lot again for all your patient explaination.