Closed danielkaifeng closed 2 months ago
I guess the reason of memory overflow largely due to preprocessing using pretrained-model. To overcome this, it's reasonable if I preprocess data without pretrained-model, but train model with pretrained-model checkpoint for training initialization?
It would be surprising if plugging in the pretrained model checkpoint was to blame here (but maybe that is the case, I'm not sure). If you want to use the checkpoint for training initialization, then the atom metadata (e.g. atom type / motif vocabulary) has to be kept in sync, this is why the checkpoint has to be provided during preprocessing.
Two thoughts:
*.jsonl.gz
files and then a longer phase that further processes them. Are you able to get through the first phase (i.e. get those files to be saved)? If so, it could be a good idea to then kill the processing and restart from the same directory, then it would notice the files exist and go right to the second phase. Separating the phases like this might help prevent e.g. some resources not being freed between one and the other, which could reduce peak memory usage.It is on the first stage of initializing feature extractors, in generating FeaturisedData
before xxx.jsonl.gz
generated.
I think there are some approaches to solve the question:
FeaturisedData
to batched *.jsonl.gz files and make some modification in training dataloader to train.FeaturisedData
datapoints to xxx.h5
file during generation by h5py
.The overall memory still keeps growing in feature extraction, I guess it might be caused by storing large list and smiles_datapoints in memory.
While the SMILES are indeed all read into memory, the processing then proceeds in an online fashion based on iterables. I think in principle the processed samples do not have to all fit in memory, while 10M samples in SMILES form should not take that much.
At the point when the code prints out the sizes of the folds and says "beginning featurization", is the memory usage already high then? This should be a point when all the SMILES are already in memory. If memory is not high then but continues to grow later, maybe this is because the parallel processes are faster in processing the samples than the main process is in consuming them, leading to more and more samples being "queued up".
Dear author, I am trying to train the model with over 10 millions datapoints and even though I set
--num-processes
as 3 bymolecule_generation preprocess data/merged_lib results/merged_lib_full traces/merged_lib_full --pretrained-model-path xxx_best.pkl --num-processes 3
, the memory keeps growing and overflow.Any approach to reduce memory for extremely large dataset? Thanks!