tensorflow / models

Models and examples built with TensorFlow
Other
77.04k stars 45.77k forks source link

object detection stuck at CUDA_ERROR_OUT_OF_MEMORY, tried every solution, not working #10122

Open Nomia opened 3 years ago

Nomia commented 3 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

what is this????

2. Describe the bug

I was following this guide(https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html) to try the object detection feature with tensorflow gpu on my windows 10 computer,everything works fine, until i train with this line code: python model_main_tf2.py --model_dir=models/my_ssd_resnet50_v1_fpn --pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config image

the console give me a 'CUDA_ERROR_OUT_OF_MEMORY', followed by a lot of similar errors: image

the solutions i tried: 1.change my batch_size to 1, has not any effect, still oom 2.set_memory_growth to true, it has some effect, the memory increased slowly, but result in a same error, the training process still CUDA_ERROR_OUT_OF_MEMORY, oom etc. 3.i tried tf.config.experimental.set_virtual_device_configuration(gpu, [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)]), I got the same result ...

Here is a screenshot that might give some important information: image

as you can see, in the screenshot, the memory usage is just 50% at that time, but the program just report the oom error, and there was not some other program compete for the memory(nvidia-msi -l)

3. Steps to reproduce

just follow the guide of this tutorial exactly https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html

4. Expected behavior

train my data, and show the loss and accuracy

5. Additional context

some screenshot might be helpful image image (1) image (2)

6. System information

jvishnuvardhan commented 3 years ago

@Nomia Can you please share details of GPU system you are using?

OS Platform and Distribution:
TensorFlow installed from: 
TensorFlow version: 
Python version: 
Installed using virtualenv? pip? conda?: 
Bazel version: 
GCC/Compiler version (if compiling from source): 
CUDA/cuDNN version: 
GPU model and memory: 

Did you check with any other model? Are you noticing same OOM issue with other simple models? Thanks!

Nomia commented 3 years ago

@Nomia Can you please share details of GPU system you are using?

for this question, I've provided that, you can check it at 6.System information above

Did you check with any other model? Are you noticing same OOM issue with other simple models? Thanks!

for this question, no, I haven't

jvishnuvardhan commented 3 years ago

@Nomia Can you pleas try running any one of the examples as shown on this page https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/auto_examples/index.html

mayureshagashe2105 commented 3 years ago

@Nomia Have you tried running this code on Google Colab? If not please test if you are getting the same error.