Open jameschartouni opened 6 years ago
Same problem here. Runs out of memory and crashes everytime when trying to train on GTX 1080ti
Currently none of the demos work anymore. I can't train any matterport mask r-cnn models. I can still train keras models, so the issue appears to be specific to this repo.
I'm encountering the same problem on a new virtual environment except train_shapes now works. The model trains on the CPU. It just breaks in GPU mode.
same...I also ran in GPU mode, GTX 1080Ti and image size is (448,448,3).
Btw, do you know other config parameters which decide the model size? e.g. change BACKBONE from "resne101" to "resnet50", or a smaller TRAIN_ROIS_PER_IMAGE. I will try it tonight and update the result. ---------------------> it doesn't work... so maybe it's about cuda or cudnn version tensorflow-gpu 1.6.0 cuda 9.0 cudnn 7.0.5 ubuntu 16.04 GTX 1080Ti
Have you used the NVIDIA GPU Cloud to replace the local GPU?
@stonePJ I haven't tried the cloud yet. I am not sure why that will make a difference though.
This problem seems invariant to the parameters I choose. I don't think it's a gpu memory problem since the train_shapes.ipynb demo also fails. I can also still perform inference. Have any of you tried Docker? Do you think this could be a driver or Cuda/CudNN issue? The odd thing is that I was able to manage a successful run last morning and haven't been able to replicate it. The problem can't be reproduced 100% of the time. And the error when running the script is a seg fault.
I've personally encountered this when running CUDA 9.1 with tensorflow 1.8. The version of tensorflow offered on pypi is only known to work with CUDA 9.0: https://www.tensorflow.org/install/install_sources#tested_source_configurations
You can compile tensorflow from source to work with 9.1 if you'd like. Or you can downgrade to 9.0 and use the precompiled tensorflow.
On ubuntu 16.04 I had to apt-mark hold cuda-9-0
because each time I ran apt upgrade I would be upgraded to cuda 9.1.
@rafihayne you are correct. I am able to train now with no memory problems after removing cuda8 and installing cuda9.0. I am also using cudnn v7.1 for cuda9.0. Then install tensorflow-gpu==1.5. This should fix it. @jameschartouni let me know if it works for you.
I downgraded from Cuda 9.1 to Cuda 9. I am also using Tensorflow 1.8. I still had a memory error but no longer a seg fault. So I reduced the TRAIN_ROIS_PER_IMAGE down to 100. It can't handle anything much higher. Now it trains. It looks like Mask RCNN is fairly memory memory constrained on the 1080ti. I may need to throw in an extra card if I want to improve performance. I could really benefit from turning up the configs that utilize more memory.
@jameschartouni what cudnn version you used? My Cuda is 9.0 and I even reduce TRAIN_ROIS_PER_IMAGE down to 2, still out of memory. so I want to keep all versions same to yours, then try again.
I'm using CudaNN 7.1
I'm using Cuda9.0 and cudnn7.0.5. When I train the dataset on cmd shell, it shows the source compiled with version of 7.0.3. And I try to train the dataset on CPU mode,but it's too slow. If I want the process to be faster, Ineed to change my device. I have just GTX950M 2G.
@jameschartouni Thank you! Cuda 9.0 + cudnn 7.1+ tensorflow-gpu 1.8, OOM is solved. but for me, a new issue arise....I guess it's about tf version. Floating point exception (core dumped) https://github.com/matterport/Mask_RCNN/issues/513 it's just so sensitive to the version of cuda or tensorflow.
@Zico2017 Take a look at the link I posted earlier. I bet your issue is using cudnn 7.1 and not the supported cudnn 7.0.
I've been frequently getting this error. Do you think it could be related to the issues discussed here?
2018-05-06 11:41:25.919022: F ./tensorflow/core/util/cuda_launch_config.h:127] Check failed: work_element_count > 0 (0 vs. 0)
Aborted (core dumped)
@jameschartouni Hi! I also got this error before. But when I reinstalled everything, it was solved, and I've got some detection results successfully after training on resnet101. I don't know which one works: Cuda V 9.0.176 cudnn V 7.0.5 NVIDIA driver: 384.111 build a virtual environment(not built via anaconda) to install tensorflow-gpu=1.5.0
I can confirm that the driver setup above is the most stable. Thanks!
I was trying with
Bad config: cuda v9 cuDNN 7 tensorflow-gpu 1.10
with imagenet from scratch on coco dataset
at stage 4+ of all layers I was impressed when the job got killed at 150Gb of memory allocated.
This was using two v100 cards So I was looking into the multiprocessing and the like, but it seems this is the most appropriate ticket. On machines without the quota I typically saw a OOM kill before that.
with v1.0 of the git repo with no modifications using the command
python3 coco.py train --model=imagenet --data=/path/to/coco/data
Tried again with a 64Gb memory allocation on a single K80 and got the following output https://pastebin.com/kGFHEY0K
with tensorflow-gpu 1.5
This looked really interesting to me: https://github.com/tensorflow/models/issues/1817#issuecomment-325988741
where is the file containing the 'BATCH_SIZE' variable ?
config.py or coco.py
Has anyone tried this out?
https://github.com/tungld/tensorflow/blob/lms-contrib/tensorflow/contrib/lms/README.md
i am trying to train my own data set but show this error massage
2020-02-11 18:38:03.836262: W tensorflow/core/common_runtime/bfc_allocator.cc:271] **** 2020-02-11 18:38:03.836306: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at assign_op.h:117 : Resource exhausted: OOM when allocating tensor with shape[1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc 2020-02-11 18:38:03.839397: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 16.94M (17760256 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2020-02-11 18:38:03.841625: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 16.94M (17760256 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
please help
I am trying to train the mask rcnn mode with imagenet weights on a custom dataset on a gtx 1080ti 11gb on Ubuntu 16.04. Regardless of what image resolution that I use, the kernel runs out of memory and crashes whenever I call train(). I am using a batch size of one. Before I restart the kernel, I kill all python processes to clear the GPU memory. What could be causing the memory issue? The demos runs perfectly. I can also run all the inspect_data code on my dataset, and everything looks kosher. Posted below is my config output.