microsoft / molecule-generation

Implementation of MoLeR: a generative model of molecular graphs which supports scaffold-constrained generation
MIT License
264 stars 43 forks source link

Query about data split! #44

Closed mukhergm closed 1 year ago

mukhergm commented 1 year ago

Hello,

I have a query about SMILES data split into training, validation and test sets. Can I do randomly data split or are there any set-up rules for data split? How many minimum SMILES data points are required to train the model?

Thanks

kmaziarz commented 1 year ago

Hi @mukhergm!

Can I do randomly data split or are there any set-up rules for data split?

Since MoLeR is a generative model, I think a random split would be fine. The validation/test sets are not as crucial here as they could be for e.g. a property prediction model, and their main purpose is early stopping, to ensure we only fit the general data distribution but not the particular sample used as training data. In our experiments, we used the Guacamol dataset, which already comes split into train/validation/test; you can get it here. Our preprocessing script expects that the data is already split and that the files are named accordingly.

How many minimum SMILES data points are required to train the model?

This would depend on many factors, e.g. the size of the model, but as a rule of thumb: not many. GNN-based models such as MoLeR tend to be not as data-hungry as e.g. models based on RNNs/Transformers, as many constraints (such as valence) are already built in. The Guacamol training set contains around 1M molecules, and I wouldn't expect the model to benefit from much more than that unless one makes the model itself larger. I would also expect that less than 1M (e.g. 100K) would already be enough to get a reasonable model.

If you're dataset is very small, one trick might be to duplicate the SMILES in your {train, valid, test}.smiles files (but do that after splitting to prevent data leakage). During preprocessing, I think a different generation order will be selected for each copy of each molecule, acting as a form of data augmentation. We didn't experiment with that too much, as the 1M molecules in Guacamol were already enough to get a good model, and it seemed that making the dataset larger was no longer helping.