Open Nomia opened 3 years ago
@Nomia Can you please share details of GPU system you are using?
OS Platform and Distribution:
TensorFlow installed from:
TensorFlow version:
Python version:
Installed using virtualenv? pip? conda?:
Bazel version:
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:
Did you check with any other model? Are you noticing same OOM issue with other simple models? Thanks!
@Nomia Can you please share details of GPU system you are using?
for this question, I've provided that, you can check it at 6.System information
above
Did you check with any other model? Are you noticing same OOM issue with other simple models? Thanks!
for this question, no, I haven't
@Nomia Can you pleas try running any one of the examples as shown on this page https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/auto_examples/index.html
@Nomia Have you tried running this code on Google Colab? If not please test if you are getting the same error.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
1. The entire URL of the file you are using
what is this????
2. Describe the bug
I was following this guide(https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html) to try the object detection feature with tensorflow gpu on my windows 10 computer,everything works fine, until i train with this line code:
python model_main_tf2.py --model_dir=models/my_ssd_resnet50_v1_fpn --pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config
the console give me a 'CUDA_ERROR_OUT_OF_MEMORY', followed by a lot of similar errors:
the solutions i tried: 1.change my
batch_size
to 1, has not any effect, still oom 2.set_memory_growth
to true, it has some effect, the memory increased slowly, but result in a same error, the training process stillCUDA_ERROR_OUT_OF_MEMORY
, oom etc. 3.i triedtf.config.experimental.set_virtual_device_configuration(gpu, [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=4096)])
, I got the same result ...Here is a screenshot that might give some important information:
as you can see, in the screenshot, the memory usage is just 50% at that time, but the program just report the oom error, and there was not some other program compete for the memory(
nvidia-msi -l
)3. Steps to reproduce
just follow the guide of this tutorial exactly https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html
4. Expected behavior
train my data, and show the loss and accuracy
5. Additional context
some screenshot might be helpful
6. System information