How to divide it2_tt_v1.3_completeset into test and train set

wdm2 commented 1 year ago

Thank you for sharing your excellent work. I have downloaded the crossdocked2020 v1.3 data. I would like to know how all data is divided into train and test. It seems that "it2_tt_v1.3_completeset_test0.types" and "it2_tt_v1.3_completeset_train0.types" are the same file. I thought it2_tt_v1.3_train[0-2].types was concatenated with it2_tt_v1.3_completeset_train0.types, is that correct?

francoep commented 1 year ago

The completeset types files contain ALL of the data, as such both the completeset_train and completeset_test files are identical (the reason that two exist has to do with compatibility of some lab scripts which assume the existence of a train & test file).

If you want to split it up, you can use our 3fold clustered-cross-validation splits (instructions here ). These are the it2_tt_v1.3_train[0-2].types and corresponding test.types. The train each contain 2/3 of the data, and the test the remaining 1/3.

Or you could generate your own.

wdm2 commented 1 year ago

Thank you so much for your prompt and detailed reply! Understanding the dataset's handling is now clear, allowing me to proceed with my work confidently.

gnina / models

How to divide it2_tt_v1.3_completeset into test and train set #33