Closed digiamm closed 6 years ago
@lucadigiammarino This is because your GPU cannot allocate enough RAM for the network. You can try reducing the batch_size
to a small number but even then if it doesn't work, you will need to change the layers parameters.
You can calculate the memory required for your network by following something like this.
Regarding abrupt stop, did you see a message like Killed
? If yes, then you can check the log from syslog:
vi /var/log/syslog
I had the Same issue
Please Can you please help me out with this?
@shobhitnpm Can you rather post the full-log instead of the screenshot along with the output of nvidia-smi
?
It should be either the case that you are already running some code on your GPU and it has eaten up your memory or your GPU i.e GTX 1050 doesn't have enough memory to run this network even with the batch size 1.
I think the second case is more probable because 1050 only has 3GB or 2GB memory, so you should probably try to use a smaller network like mobilenet
, resnet50
etc instead of resnet101
. Some guidelines on the hardware required for running different versions of faster-rcnn
can be found here.
I am trying to run a CNN however I am getting this error. I have tried reducing the batch size, the number of nodes, however, this still doesn't work ResourceExhaustedError: OOM when allocating tensor with shape[2458624,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node training_2/Adam/mul_23}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node loss_1/mul}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
For the people stuck with this in models other than mnist.
the reason for this is the high amount of parameters (please check your model.summary()
).
A good method to drastically lower these parameters is to add:
subsample=(2, 2)
(careful it lowers the resolution of images/data) in all the Convolutional layers above that Flatten layer,
if subsample doesn't work then it is stride=(2, 2)
.
Hi guys, I am a beginner with TF and I am trying to running some Atari Reinforcement Learning training on my laptop with GeForce GT 650M. I get the following error and I can't figure out what's wrong, I've tried to change my batch size multiple times but same again. Sometimes stop with this error and sometimes just quit after 300 steps without any error message.
This is my code.. I am embedding all of it so it will be clear.
Thank you in advance for your help.