Open zoey731 opened 2 months ago
Hi,
I've updated the README with this detail.
Essentially, you can access the prepared dataset for pretraining here.
However, if you have a different torch-geometric version than the one we used then it may not work. In such case, you can prepare the dataset yourself from the raw SMILES strings. To do so, download and unzip the data from chem data. Navigate to dataset/chembl_filtered/processed/smiles.csv
. This file should contains the SMILES strings of all the molecules used in pretraining.
Hopefully I didn't miss anything. Let me know if you need more help.
Based on dataset/chembl_filtered/processed/smiles.csv, should I use mol/prepare_data_old.py to preprocess the dataset?
Based on dataset/chembl_filtered/processed/smiles.csv, should I use mol/prepare_data_old.py to preprocess the dataset?
Yes. Actually, I have just commited this file as mol/chembl_pretraining_data.csv. If you want to prepare this training set, you can run:
python prepare_data_old --root <root path> --data_file_path chembl_pretraining_data.csv --smiles_column smiles --vocab_file_path vocab.txt
Thanks for your great jobs! I noticed your paper mentioned "We use a processed subset containing 456K molecules from the ChEMBL database [24] for pretraining." Could you please release your pretraining data or give detailed instructions how to obtain it. Thanks!