Closed danijar closed 5 years ago
Even setting the batch size to 20, this error still exists.
Did you try a smaller batch size than 20?
Yes, I tried to reduce the batch size to 10 or 5. It didn't work.
Does TF find your GPU and does your GPU have enough memory available (no other TF running)?
Thank you so much for your answer. Setting --params {batch_shape: [1, 50]} ,it stared training. I ran 2 days on single 2080TI and the epoch just reach to 9. There is another TF running on this computer, but I checked the GPU memory, GPU memory only takes up about 3 percent. Is there a way to make this program run faster?
This is not really specific to this code. Make sure the other program has the growing memory option enabled to not reserve all GPU memory. It may also be that TF doesn't use the GPU because another program is already using it. Either way, I recommend running only one training run at the same time. Good luck!
I found that this error can actually often be avoided without reducing the batch size, by instead disabling TensorFlow's memory optimizations in _create_session()
in trainer.py
:
from tensorflow.core.protobuf import rewriter_config_pb2
off = rewriter_config_pb2.RewriterConfig.OFF
config.graph_options.rewrite_options.memory_optimization = off
This error shows when there is not enough GPU memory available:
Setting
--params {batch_shape: [20, 50]}
reduces the batch size from 50 to 20.