Memory error when training with image augmentation

nathanmlim commented 3 years ago

Hello, I came across your repository when looking into ChemNet. Great resource and code you have here.

When running your code: I try to train on chembl25 using the --use_augment flag. I get the following error related to memory issues

INFO:deepchem.models.keras_model:Ending global_step 36000: Average loss 0.000248933
Traceback (most recent call last):
  File "train_chemception_chembl.py", line 187, in <module>
    main()
  File "train_chemception_chembl.py", line 156, in main
    checkpoint_interval=10)
  File "/data/nlim/Github/deepchem/deepchem/models/keras_model.py", line 328, in fit
    callbacks, all_losses)
  File "/data/nlim/Github/deepchem/deepchem/models/keras_model.py", line 401, in fit_generator
    for batch in generator:
  File "/data/nlim/Github/deepchem/deepchem/models/chemnet_models.py", line 331, in default_generator
    n_samples = dataset.X.shape[0] + (
  File "/data/nlim/Github/deepchem/deepchem/data/datasets.py", line 2444, in X
    return np.vstack(Xs)
  File "<__array_function__ internals>", line 6, in vstack
  File "/data/nlim/anaconda3/envs/deepchem/lib/python3.6/site-packages/numpy/core/shape_base.py", line 283, in vstack
    return _nx.concatenate(arrs, 0)
  File "<__array_function__ internals>", line 6, in concatenate
MemoryError: Unable to allocate 225. GiB for an array with shape (1178969, 80, 80, 4) and data type float64

I can train just fine without the --use_augment flag but I'm trying to replicate the results in the original ChemNet paper as you have. Were you able to successfully replicate the original ChemNet paper and use image augmentation? If so, what hardware were you using? I currently only have access to some Titan Xs which have about 12GB of memory.

vsomnath commented 3 years ago

Hi Nathan,

Thank you for your question. Yes, I was also running into memory issues with the augmentation and didn't really manage to get them running correctly within DeepChem. However, I was able to replicate the results more-or-less from the ChemNet paper, without using augmentation. I was using the 2080 Ti with about 11 GB of memory.

The goal in the end was to be able to just see how well this transfer learning protocol works on different datasets. You can find the results (closer to the bottom), here.

nathanmlim commented 3 years ago

Thanks for your response! Very interesting results even without image augmentation. I'm still interested in trying to get the model with image augmentation and may perhaps try training on a subset of the chembl25 dataset. I do have some related follow up question though.

From the error message, it indicates that the np.array it is trying to load is stored as float64 dtype. Do you have any pointers as to where in the deepchem source code I can edit to perhaps convert the array into float32 or possibly some other more compact dtype? Additionally, do you know if it is possible to utilize multiple GPUs for model training/predicting so that I can perhaps make use of a cluster to get around the memory limit?

On another note, when training the model the logger module prints something like INFO:deepchem.models.keras_model:Ending global_step 2429000: Average loss 0.000125934 Would you happen to know how I could perhaps get a progress bar on the actual epoch it is on? The current deepchem release no longer has the verbose flag available and I haven't been able to figure out how to use the wandDB tool.

Any insights you can provide would be greatly appreciated!

vsomnath commented 3 years ago

Hi Nathan,

Apologies for not getting back on this question. Inside the model, np.float64 is converted to np.float32 (https://github.com/deepchem/deepchem/blob/361207f8694f4e104fa8eb9eb4293de478cff9fa/deepchem/models/keras_model.py#L239). The error you see originates from deepchem/data/datasets.py, and maybe you can consider the effect of explicit setting of datatypes around this line (https://github.com/deepchem/deepchem/blob/361207f8694f4e104fa8eb9eb4293de478cff9fa/deepchem/data/datasets.py#L2102).

I dont think DeepChem has multi-gpu support at this point. I have also been behind on the recent DeepChem updates, and am not aware of how progress during training is displayed now. Although @rbharath would know better.

rbharath commented 3 years ago

There are a couple of trick you can do. One is to do epoch by epoch training:

losses_so_far = []
for epoch in range(n_epoch):
  loss = model.fit(nb_epoch=1)
  losses_so_far.append(loss)
  ## some plotting code

Or you can use all_losses

all_losses = []
model.fit(dataset, nb_epoch, all_losses=all_losses)
## plot all_losses

We don't have multi-gpu support but this is something on our roadmap for the coming year

vsomnath / chemnet_auxilliaries

Memory error when training with image augmentation #1