Reproducing the DRD2 case

charlesxu90 commented 1 year ago

Dear @jkwang93 ,

I'm trying to reproduce the DRD2 case using exactly your input. However, I met with an error when training the RNN model using the Transformer generated smiles.

First bug:


Traceback (most recent call last):
File "3_train_middle_model_dm.py", line 75, in <module>
train_middle(**arg_dict)
TypeError: train_middle() got an unexpected keyword argument 'save_model_dir'

2. Another bug:
```python
Traceback (most recent call last):
  File "3_train_middle_model_dm.py", line 75, in <module>
    train_middle(**arg_dict)
  File "3_train_middle_model_dm.py", line 22, in train_middle
    moldata = MolData(train_data, voc)
  File "/home/xiaopeng/Desktop/Chem_design/ref_works/MCMG_test/MCMG/MCMG_utils/data_structs.py", line 108, in __init__
    self.con = df[['drd2', 'qed', 'sa']]
  File "/home/xiaopeng/Desktop/Chem_design/env/lib/python3.6/site-packages/pandas/core/frame.py", line 2912, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
  File "/home/xiaopeng/Desktop/Chem_design/env/lib/python3.6/site-packages/pandas/core/indexing.py", line 1254, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
  File "/home/xiaopeng/Desktop/Chem_design/env/lib/python3.6/site-packages/pandas/core/indexing.py", line 1298, in _validate_read_indexer
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['drd2', 'qed', 'sa'], dtype='object')] are in the [columns]"

Seems there's a requirement for the dataset. Of course I can ingore it by adding a conditional checking. But I'm not so sure the resulting model is the right one.

Can I just using REINVENT by MarcusOlivecrona to reproduce the RNN prior and agent training process?

Best regards,

charlesxu90 commented 1 year ago

I tried to use the generated dataset to train the REINVENT prior. The running is very smooth.

But the problem is that the agent training is still different. I have to use the MCMG code. But the results is wireld.

Seems the activity is not improving during the agent training steps.

charlesxu90 commented 1 year ago

I read your paper, you evaluate the total successful molecules. But the activity of molecules during each step was not analyzed.

But based on a crude estimation, the success rate should be around 5/16 in each steps. That means at least 30% of the molecules should have a activity score greate than 0.5, in the extreme case, the averaged activity should be 0.156. But here, the score is continuously being 0.02.

jkwang93 commented 1 year ago

Dear @jkwang93 ,

I'm trying to reproduce the DRD2 case using exactly your input. However, I met with an error when training the RNN model using the Transformer generated smiles.

First bug:

Traceback (most recent call last):
  File "3_train_middle_model_dm.py", line 75, in <module>
    train_middle(**arg_dict)
TypeError: train_middle() got an unexpected keyword argument 'save_model_dir'

Another bug:

Traceback (most recent call last):
  File "3_train_middle_model_dm.py", line 75, in <module>
    train_middle(**arg_dict)
  File "3_train_middle_model_dm.py", line 22, in train_middle
    moldata = MolData(train_data, voc)
  File "/home/xiaopeng/Desktop/Chem_design/ref_works/MCMG_test/MCMG/MCMG_utils/data_structs.py", line 108, in __init__
    self.con = df[['drd2', 'qed', 'sa']]
  File "/home/xiaopeng/Desktop/Chem_design/env/lib/python3.6/site-packages/pandas/core/frame.py", line 2912, in __getitem__
    indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
  File "/home/xiaopeng/Desktop/Chem_design/env/lib/python3.6/site-packages/pandas/core/indexing.py", line 1254, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
  File "/home/xiaopeng/Desktop/Chem_design/env/lib/python3.6/site-packages/pandas/core/indexing.py", line 1298, in _validate_read_indexer
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['drd2', 'qed', 'sa'], dtype='object')] are in the [columns]"

Seems there's a requirement for the dataset. Of course I can ingore it by adding a conditional checking. But I'm not so sure the resulting model is the right one.

Can I just using REINVENT by MarcusOlivecrona to reproduce the RNN prior and agent training process?

Best regards,

Please take a good look at the README, the first one requires you to enter the location where the file is saved.
You need to follow the steps to train the transformer first, and let it generate smiles. Use these smiles as the training set for step 3.

jkwang93 commented 1 year ago

I read your paper, you evaluate the total successful molecules. But the activity of molecules during each step was not analyzed.

But based on a crude estimation, the success rate should be around 5/16 in each steps. That means at least 30% of the molecules should have a activity score greate than 0.5, in the extreme case, the averaged activity should be 0.156. But here, the score is continuously being 0.02.

You need to check your operation process carefully, because even if you don't use the complete process of MCMG and just use REINVENT directly, the score will increase, but the speed of convergence will be slower. We have engineered these processes and integrated them into our platform: https://drugflow.com/

raycaohmu commented 1 year ago

for the first bug, you should modify the save_model_dir parameter in the train_middle function to save_model. for the second bug, you should use functions in utils.py to calculate the corresponding drd2, qed, and sa score, and concat them to the generated test.csv dataframe.

jkwang93 / MCMG

Reproducing the DRD2 case #10