How to obtain the pretraining data?

lvkd84 / GraphFP

Implementation of Fragment-based Pretraining and Finetuning on Molecular Graphs (NeurIPS 2023)

10 stars 0 forks source link

How to obtain the pretraining data? #2

Open zoey731 opened 2 months ago

zoey731 commented 2 months ago

Thanks for your great jobs! I noticed your paper mentioned "We use a processed subset containing 456K molecules from the ChEMBL database [24] for pretraining." Could you please release your pretraining data or give detailed instructions how to obtain it. Thanks!

lvkd84 commented 2 months ago

Hi, I've updated the README with this detail. Essentially, you can access the prepared dataset for pretraining here. However, if you have a different torch-geometric version than the one we used then it may not work. In such case, you can prepare the dataset yourself from the raw SMILES strings. To do so, download and unzip the data from chem data. Navigate to dataset/chembl_filtered/processed/smiles.csv. This file should contains the SMILES strings of all the molecules used in pretraining. Hopefully I didn't miss anything. Let me know if you need more help.

wjxts commented 2 months ago

Based on dataset/chembl_filtered/processed/smiles.csv, should I use mol/prepare_data_old.py to preprocess the dataset?

lvkd84 commented 2 months ago

Based on dataset/chembl_filtered/processed/smiles.csv, should I use mol/prepare_data_old.py to preprocess the dataset?

Yes. Actually, I have just commited this file as mol/chembl_pretraining_data.csv. If you want to prepare this training set, you can run:

python prepare_data_old --root <root path> --data_file_path chembl_pretraining_data.csv --smiles_column smiles --vocab_file_path vocab.txt