syr-cn / SimSGT

[NeurIPS 2023] "Rethinking Tokenizer and Decoder in Masked Graph Modeling for Molecules"
31 stars 3 forks source link

About preparing DAVIS and KIBA data folds. #7

Closed gihanpanapitiya closed 8 months ago

gihanpanapitiya commented 8 months ago

Hello,

Can you share more details about how you prepared DAVIS and KIBA datasets?

I downloaded these datasets from here https://github.com/chao1224/GraphMVP/tree/main/datasets. Then preprocessed as instructed there. I then combined the resulting train.csv and test.csv files to create the full dataset. Then I used scaffold splitting to split this full dataset to train, valid and test. For DAVIS I used the transformed affinities (-np.log10(y / 1e9) to train the model. Is this the approach you used as well?

It would be great if you could add your preprocessed train, valid and test folds to the repository.

syr-cn commented 8 months ago

Thanks for you interest.

Unfortunately I can't find the preprocessed train/valid/test files.

For the two dta datasets, we randomly split them into trian/valid/test sets, following the setting of GraphMVP. Below is from the GraphMVP's paper:

Table 5: Results for four molecular property prediction tasks (regression) and two DTA tasks (regression). We report the mean RMSE of 3 seeds with scaffold splitting for molecular property downstream tasks, and mean MSE for 3 seeds with random splitting on DTA tasks. For GraphMVP , we set M = 0.15 and C = 5. The best performance for each task is marked in bold. We omit the std here since they are very small and indistinguishable. For complete results, please check Appendix G.4.

We did not perform any preprocessing except the preprocessing.py in GraphMVP. But we applied normalization to the labels in the tuning stage. (SimSGT/regression/tuning_dta.py/train_dta, line 246)

gihanpanapitiya commented 8 months ago

Thank you very much for the details! Just for clarification, did you use the same test.csv as prepared in preprocess.py in the GraphMVP repository (https://github.com/chao1224/GraphMVP/blob/main/datasets/dti_datasets/davis/preprocess.py) ?

syr-cn commented 8 months ago

Thank you very much for the details! Just for clarification, did you use the same test.csv as prepared in preprocess.py in the GraphMVP repository (https://github.com/chao1224/GraphMVP/blob/main/datasets/dti_datasets/davis/preprocess.py) ?

Yes. As shown in line 177~187 of regression/tuning_dta.py, we use the original train.csv and test.csv files processed by GraphMVP's preprocess.py.