Closed Avv22 closed 2 years ago
If anyone comes across OOM errors with "newer" versions of TF, there was a memory leak that was introduced in TF shortly after 2.8.2.
The tensorflow-addons module is deprecated and states their latest supported TF is 2.14.
With TF 2.14, I could see the used memory continuing to grow using nvidia-smi
until training would crash with OOM.
Switching to TF 2.8.2 fixed this issue for me.
Hello,
We run the 2.1 tensorflow implementation on our machine that has 16 GB of RAM and 4 GB of GPU as you specified in your documentation:
!/usr/bin/env bash
Then run ./train_python150k.sh as follows:
$ ./train_python150k.sh $DATA_DIR $DESC $CUDA $SEED
We go the following error:
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[320,26350] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Add] 0%| | 0/337723 [00:20<?, ?it/s]
Edit: I even tried smaller python dataset around 1 GB for both train and test and got same error as above. Tensor size is large. Number of trainable parameters are around 5 million.
I change
config.py
file to the following (divided all values by 2 for each variable), not sure if this is recommended please:The original
config.py
file has:I run the training script with
default
option thus I change default part above inconfig.py
:$ ./train_python150k.sh $DATA_DIR default $CUDA $SEED
Note: the model is still training, so I am not sure what would be the output. It has finished 1 Epoches so far. So it seems my issue was the buffer/shuffle size. However, do you think halving parameters would effect your model training please? If this is not recommended, could be you please give me acceptable parameters sized decreased as your original
config.py
file configuration give me OOM error.