swansonk14 / SyntheMol

Combinatorial antibiotic generation
MIT License
94 stars 19 forks source link

Missing build_block smiles in results from the step of 'Generate molecules with SyntheMol' #19

Closed JieHou-SLU closed 1 month ago

JieHou-SLU commented 1 month ago

Dear developer,

We are getting the molecule results from synthemol in the step of molecule generation. However, we noticed that some high-score molecules have missing information for the building blocks used in molecule generation. For instance, we see the following molecule examples:

Smiles, score, building_block_1_1_id, building_block_1_1_smiles, building_block_1_2_id, building_block_1_2_smiles XXXXXXX, 0.98, 3000000 , XXXXXXXXXX , -1 , NaN

Could you please help explain what does -1 and missing smile mean for the columns building_block_1_2_id and building_block_1_2_smiles?

Thank you very much

swansonk14 commented 1 month ago

Hi @JieHou-SLU,

Thank you for bringing up this issue! The -1 means that the building block SMILES was not found in the mapping from SMILES to IDs (see here). This usually should not happen when generating molecules with a single reaction as all the building blocks should appear in that mapping, so it seems like something is going wrong. I've only seen this happen when I've made a mistake providing the building blocks and reaction to building blocks mapping to SyntheMol.

Can you provide a bit more context on your generation run? Would you be able to provide the specific synthemol command that you ran? And to check, are you using the Enamine REAL building blocks and reactions from the repo or are you using any custom building blocks or reactions?

Thanks, Kyle

JieHou-SLU commented 1 month ago

Thank you for the reply. I used the following command for molecule generation:

Compute model scores for building blocks

python SyntheMol/scripts/models/predict.py \ --data_path paper_data/Data/4_real_space/building_blocks.csv \ --save_path experiments/models/antibiotic_chemprop/building_blocks.csv \ --model_path experiments/models/antibiotic_chemprop \ --model_type chemprop \ --average_preds

Generate molecules

synthemol \ --model_path experiments/models/antibiotic_chemprop \ --model_type chemprop \ --building_blocks_path experiments/models/antibiotic_chemprop/building_blocks.csv \ --building_blocks_score_column chemprop_ensemble_preds \ --building_blocks_id_column Reagent_ID \ --reaction_to_building_blocks_path paper_data/Data/4_real_space/reaction_to_building_blocks_filtered.pkl \ --save_dir experiments/models/generations_chemprop \ --max_reactions 1 \ --n_rollout 20000 \ --replicate

We actually see many generated molecules have missing id/smiles for building blocks.

Thank you for your advice,

JieHou-SLU commented 1 month ago

Hi @swansonk14,

We were trying to investigate the building blocks to address the issue, we found several questions in this process, as detailed below:

  1. Based on the provided 'paper_data/Data/4_real_space/building_blocks.csv', there are 138060 building block smiles, with 132479 unique smiles (which matches the number in paper).
  2. However, when we get the unique smiles from 'Data/4_real_space/reaction_to_building_blocks_filtered.pkl' (based on the command example provided on github), there are only 91429 unique smiles.

We are wondering if this is the reason why the generated molecules has missing id for the used building blocks in final results?

If this is the reason, should we choose different files for building_blocks_path and reaction_to_building_blocks_path?

synthemol \ --model_path data/Models/antibiotic_chemprop \ --model_type chemprop \ --building_blocks_path data/Models/antibiotic_chemprop/building_blocks.csv \ --building_blocks_score_column chemprop_ensemble_preds \ --building_blocks_id_column Reagent_ID \ --reaction_to_building_blocks_path data/Data/4_real_space/reaction_to_building_blocks_filtered.pkl \ --save_dir data/Data/6_generations_chemprop \ --max_reactions 1 \ --n_rollout 20000 \ --replicate

Thank you for your help,

swansonk14 commented 1 month ago

Hi @JieHou-SLU,

Thank you for providing the additional information and apologies for the slow response!

You are correct to note that the number of unique building block SMILES in the reaction to building blocks mapping is less than the number in the building blocks file. This is because the building blocks file contains all possible building blocks, while the mapping file has been filtered to only those building blocks that are relevant to the 13 Enamine REAL reactions that we are currently using. In an upcoming new version of SyntheMol (perhaps in the next couple of months), we will introduce additional Enamine REAL reactions that will make use of more of those building blocks.

In terms of the missing SMILES issue, I did some debugging and realized that your intuition is right and there's a problem with the reaction_to_building_blocks_path file. When we originally did this work, we used one set of building blocks that lacked stereochemistry information, and then we later got an updated set of building blocks with stereochemistry. The version of building blocks in the Zenodo data is the old version without stereochemistry, so the reaction to building block mapping should have also been the non-stereochemistry version. However, I accidentally put in the new mapping instead of the old mapping.

In general, I would recommend using the new versions of the building blocks and the mapping for any projects you might be working on, and those new versions are built into SyntheMol here and are the defaults. However, if you want to specifically reproduce our work from our paper, then you'll need to use the old version of the mapping. I don't seem to have the option to upload a pickle file here, but you can recreate the old mapping with the following command.

python SyntheMol/scripts/data/filter_real_reactions_to_building_blocks.py \
    --reaction_to_building_blocks_path data/Data/4_real_space/reaction_to_building_blocks.pkl \
    --save_path data/Data/4_real_space/reaction_to_building_blocks_filtered_reproduce.pkl \
    --building_blocks_path data/Data/4_real_space/building_blocks.csv \
    --building_blocks_id_column Reagent_ID \
    --building_blocks_smiles_column smiles

Then, use this mapping in place of the one from the Zenodo data when running SyntheMol. This should also make SyntheMol faster since there won't be any missing building block SMILES during generation and so SyntheMol can make full use of the pre-computed building block scores.

Please let me know if this works for you! If so, then I'll upload the updated file to Zenodo for future users.

Best, Kyle

JieHou-SLU commented 1 month ago

Thank you very much for the detailed explanation. Yes, we have made some experiments by aligning the building blocks with reaction database, and it fully resolved the issues. Now all of generated molecules have complete ids. Thank you for your kind support. Really appreciate it.

swansonk14 commented 1 month ago

Okay great, I'm glad that fixed the issue!

JieHou-SLU commented 2 weeks ago

Hi @swansonk14 , Sorry for another request for support.

We have successfully downloaded the latest Real Space database by following commands provided on real.doc with support from Enamine staff. However, we get errors when processing the *.cxsmiles.bz2 files. We also would like to build customized enamine database.

Based on the script (scripts/data/map_real_reactions_to_building_blocks.py), *.cxsmiles.bz2 files should contain the following information for reaction and reagent used in Real space molecules:

['reaction', 'reagent1', 'reagent2', 'reagent3', 'reagent4']

However, the files we downloaded from Enamine contain the following columns:

['smiles', 'id', 'MW', 'HAC', 'sLogP', 'HBA', 'HBD', 'RotBonds', 'FSP3', 'TPSA', 'Type', 'InChIKey']

I am wondering if we downloaded incorrect database? Could you please help advise this issue?

Thank you very much, Best, Jie

swansonk14 commented 1 week ago

Hi @JieHou-SLU,

I think you've likely downloaded the correct database but perhaps a different version of it. We used the 2022 q1-2 version of the Enamine REAL space that we downloaded on August 30, 2022. However, I think they have changed the format of their newer releases, so now you can only see the SMILES and associated property information but not the reagents and reactions. Unfortunately, that means that some of our scripts for processing the REAL space may no longer work. You could try asking Enamine if they can provide this older version of the REAL space, but it's possible they no longer plan to release the reagents and reactions. Even so, you should still be able to use the reagents and reactions provided as part of SyntheMol. They might not match the very latest version of the REAL space, but they should still cover about 30 billion REAL molecules.

Best, Kyle