mims-harvard / TDC

Therapeutics Commons (TDC-2): Multimodal Foundation for Therapeutic Science
https://tdcommons.ai
MIT License
1.01k stars 174 forks source link

How to load an existing model? #144

Closed spadavec closed 2 years ago

spadavec commented 2 years ago

Sorry for being so dumb, but I see a lot of documentation on how to create and save models, but not load them and use them for new predictions--is that documented somewhere?

For example, I have the following adme models:

(p36) user@computer :~/TDC/adme_models$ ls
bioavailability_ma_model       cyp1a2_veith_model                    cyp2d6_substrate_carbonmangels_model  half_life_obach_model               pgp_broccatelli_model
caco2_wang_model               cyp2c19_veith_model                   cyp2d6_veith_model                    hia_hou_model                       ppbr_az_model
clearance_hepatocyte_az_model  cyp2c9_substrate_carbonmangels_model  cyp3a4_substrate_carbonmangels_model  hydrationfreeenergy_freesolv_model  solubility_aqsoldb_model
clearance_microsome_az_model   cyp2c9_veith_model                    cyp3a4_veith_model                    lipophilicity_astrazeneca_model     vdss_lombardo_model

(p36) user@computer:~/TDC/adme_models$ cd caco2_wang_model/
(p36) user@computer:~/TDC/adme_models/caco2_wang_model$ ls
config.pkl  model.pt

I'd like to load the caco2_wang_model and then load new smiles compounds for predictions. Any pointers would be appreciated!

futianfan commented 2 years ago

i don't think it is automatically downloaded. can you please provide the command you use? thanks!

spadavec commented 2 years ago

Hi @futianfan I didn't automatically download a model. I generated them all via:

from DeepPurpose import utils, CompoundPred
from tdc.single_pred import ADME
from tqdm import tqdm 

from tdc.utils import retrieve_dataset_names
adme_datasets = retrieve_dataset_names('ADME')

for dataset_name in tqdm(adme_datasets):
    X, y = ADME(name = dataset_name).get_data(format = 'DeepPurpose')
    drug_encoding = 'Morgan'
    train, val, test = utils.data_process(X_drug = X, 
                                      y = y, 
                                      drug_encoding = drug_encoding,
                                      random_seed = 2)
    config = utils.generate_config(drug_encoding = drug_encoding, 
                         train_epoch = 20, 
                         LR = 0.001, 
                         batch_size = 128,
                         mpnn_hidden_size = 32
                        )
    model = CompoundPred.model_initialize(**config)
    model.train(train, val, test)
    model.save_model('adme_models/' + dataset_name + '_model')

The models I listed above were all generated using the above script. I'd like to now (for example) load the caco2 model I generated, load some new SMILES patterns, and make predictions. How would I go about doing that?

Also, is there a way to modify the above code so that the model is generated using all of the data? I don't want to automatically lose ~20/30% of the data to get a validation set that I dont need.

kexinhuang12345 commented 2 years ago

Hi, this seems to be an issue for DeepPurpose, instead of TDC. TDC has no support for loading the pretrained model. For DeepPurpose (https://github.com/kexinhuang12345/DeepPurpose), you can load a pretrained model via

net = CompoundPred.model_pretrained('./cyp1a2_veith_model')

You can also specify the fraction of train/valid/test via specifying frac=[0.9,0.0,0.1] in data_process function. if you want no test set at all, do split_method = 'no_split' in the data_process function