Open Lin1007 opened 4 years ago
Hi Lin, Please try reducing the batch size, that will help.
Hi Lin, Please try reducing the batch size, that will help.
Thanks for replying. But I think it isn't the main problem. I had reduced the batch size to 1, which I was able to train and evaluate at my local computer with 6GB of GPU memory, however, colab still crash with CUDA memory error, even though the batch size is 1 and having 16GB of GPU memory. I'm wondering if it is due to the actualization of TF Objection or something related to the virtual GPU, because using AWS it also crashed.
not sure if this is related, but i also get the same problem, the RAM is completely filled up and the colab runtime crases, the logs says CUDA_OUT_OF_MEMORY
reproducible colab notebook: https://colab.research.google.com/drive/1Q0Aj61riRPOr3EYfvSbA9nZ1v8j7A0QE?usp=sharing
i've tried to reduce batch_size from 32 to 16, problem still persists
not sure if this is related, but i also get the same problem, the RAM is completely filled up and the colab runtime crases, the logs says CUDA_OUT_OF_MEMORY
reproducible colab notebook: https://colab.research.google.com/drive/1Q0Aj61riRPOr3EYfvSbA9nZ1v8j7A0QE?usp=sharing
i've tried to reduce batch_size from 32 to 16, problem still persists
fixed it by
train_ds.batch(128).map(augment, num_parallel_calls=tf.data.experimental.AUTOTUNE).cache().prefetch(tf.data.experimental.AUTOTUNE)
so the order should be batch -> map -> cache, my bad, should have read the docs properly
@Lin1007 Were you able to figure it out. I used to use r.1.13 branch and didn't face that issue over there. But switched to master and facing this issue. It is essentially trying to run an eval after saving the checkpoint and runs out of memory when it will try to evaluate. I'm facing this problem for ssd_mobilenet_v1_fpn_coco model.
Prerequisites
1. The entire URL of the file you are using
https://github.com/tensorflow/models/tree/master/research/object_detection
2. Describe the bug
I'm training a Faster RCNN with inception_v2 as pertained net on colab and also AWS, using GPU of Nvidia K80 and P100 (12 GB GPU memory). The training starts correctly until 1100 steps, and then it reports CUDA memory error when evaluate the model.
3. Steps to reproduce
4. Expected behavior
Evaluate and continues training
5. Additional context
6. System information