memory overflow with large dataset preprocessing

danielkaifeng commented 1 year ago

Dear author, I am trying to train the model with over 10 millions datapoints and even though I set --num-processes as 3 by molecule_generation preprocess data/merged_lib results/merged_lib_full traces/merged_lib_full --pretrained-model-path xxx_best.pkl --num-processes 3, the memory keeps growing and overflow.

Any approach to reduce memory for extremely large dataset? Thanks!

danielkaifeng commented 1 year ago

I guess the reason of memory overflow largely due to preprocessing using pretrained-model. To overcome this, it's reasonable if I preprocess data without pretrained-model, but train model with pretrained-model checkpoint for training initialization?

kmaziarz commented 1 year ago

It would be surprising if plugging in the pretrained model checkpoint was to blame here (but maybe that is the case, I'm not sure). If you want to use the checkpoint for training initialization, then the atom metadata (e.g. atom type / motif vocabulary) has to be kept in sync, this is why the checkpoint has to be provided during preprocessing.

Two thoughts:

Do you actually need to start with a pretrained model if you want to train on 10M samples (and e.g. our pretrained model was trained on Guacamol with ~1M samples)? This many samples should be more than enough to just train from scratch. Training from initialization was more intended for e.g. if someone wants to fine-tune on hundreds/thousands of molecules of particular interest.
At which point during preprocessing are you getting the error? There should be an initial shorter phase which produces *.jsonl.gz files and then a longer phase that further processes them. Are you able to get through the first phase (i.e. get those files to be saved)? If so, it could be a good idea to then kill the processing and restart from the same directory, then it would notice the files exist and go right to the second phase. Separating the phases like this might help prevent e.g. some resources not being freed between one and the other, which could reduce peak memory usage.

danielkaifeng commented 1 year ago

It is on the first stage of initializing feature extractors, in generating FeaturisedData before xxx.jsonl.gz generated. I think there are some approaches to solve the question:

Skip pretrain model in preprocessing as you mentioned, this helps reducing some memory.
The overall memory still keeps growing in feature extraction, I guess it might be caused by storing large list and smiles_datapoints in memory. I will try to seperate multiple FeaturisedData to batched *.jsonl.gz files and make some modification in training dataloader to train.
Write FeaturisedData datapoints to xxx.h5 file during generation by h5py.

kmaziarz commented 1 year ago

The overall memory still keeps growing in feature extraction, I guess it might be caused by storing large list and smiles_datapoints in memory.

While the SMILES are indeed all read into memory, the processing then proceeds in an online fashion based on iterables. I think in principle the processed samples do not have to all fit in memory, while 10M samples in SMILES form should not take that much.

At the point when the code prints out the sizes of the folds and says "beginning featurization", is the memory usage already high then? This should be a point when all the SMILES are already in memory. If memory is not high then but continues to grow later, maybe this is because the parallel processes are faster in processing the samples than the main process is in consuming them, leading to more and more samples being "queued up".

microsoft / molecule-generation

memory overflow with large dataset preprocessing #65