microsoft / molecule-generation

Implementation of MoLeR: a generative model of molecular graphs which supports scaffold-constrained generation
MIT License
277 stars 43 forks source link

memory overflow with large dataset preprocessing #65

Closed danielkaifeng closed 2 months ago

danielkaifeng commented 1 year ago

Dear author, I am trying to train the model with over 10 millions datapoints and even though I set --num-processes as 3 by molecule_generation preprocess data/merged_lib results/merged_lib_full traces/merged_lib_full --pretrained-model-path xxx_best.pkl --num-processes 3, the memory keeps growing and overflow.

Any approach to reduce memory for extremely large dataset? Thanks!

danielkaifeng commented 1 year ago

I guess the reason of memory overflow largely due to preprocessing using pretrained-model. To overcome this, it's reasonable if I preprocess data without pretrained-model, but train model with pretrained-model checkpoint for training initialization?

kmaziarz commented 1 year ago

It would be surprising if plugging in the pretrained model checkpoint was to blame here (but maybe that is the case, I'm not sure). If you want to use the checkpoint for training initialization, then the atom metadata (e.g. atom type / motif vocabulary) has to be kept in sync, this is why the checkpoint has to be provided during preprocessing.

Two thoughts:

danielkaifeng commented 1 year ago

It is on the first stage of initializing feature extractors, in generating FeaturisedData before xxx.jsonl.gz generated. I think there are some approaches to solve the question:

  1. Skip pretrain model in preprocessing as you mentioned, this helps reducing some memory.
  2. The overall memory still keeps growing in feature extraction, I guess it might be caused by storing large list and smiles_datapoints in memory. I will try to seperate multiple FeaturisedData to batched *.jsonl.gz files and make some modification in training dataloader to train.
  3. Write FeaturisedData datapoints to xxx.h5 file during generation by h5py.
kmaziarz commented 1 year ago

The overall memory still keeps growing in feature extraction, I guess it might be caused by storing large list and smiles_datapoints in memory.

While the SMILES are indeed all read into memory, the processing then proceeds in an online fashion based on iterables. I think in principle the processed samples do not have to all fit in memory, while 10M samples in SMILES form should not take that much.

At the point when the code prints out the sizes of the folds and says "beginning featurization", is the memory usage already high then? This should be a point when all the SMILES are already in memory. If memory is not high then but continues to grow later, maybe this is because the parallel processes are faster in processing the samples than the main process is in consuming them, leading to more and more samples being "queued up".