tensorflow / models

Models and examples built with TensorFlow
Other
77.25k stars 45.75k forks source link

Error out of GPU memory during model training #9345

Open NAEE09 opened 4 years ago

NAEE09 commented 4 years ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py

2. Describe the bug

I want to train the model Mask R-CNN Inception ResNet V2 1024x1024, I have my dataset coverted to .record file, the pipeline model is configured, and the GPU works with other training models. I tried to limit the GPU memory (also works in other training models) but the error still appears.

Error:

2020-10-06 12:10:44.322216: E tensorflow/stream_executor/cuda/cuda_driver.cc:825] failed to alloc 17179869184 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-10-06 12:10:44.322569: W ./tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 17179869184

3. Steps to reproduce

from ~/models/research

python object_detection/model_main_tf2.py --pipeline_config_path=/home/robotronics/Projects/blm_Mask_RCNN/model_MaskRCNN/mask_rcnn_inception_resnet_v2_1024x1024_coco17_gpu-8/model.config --model_dir=/home/robotronics/Projects/blm_Mask_RCNN/blm/models/model --num_train_steps=5000 --sample_1_of_n_eval_examples=10 --alsologstostderr

4. Expected behavior

Complete training model

5. Additional context

I try to limit the memory in the model_main_tf2.py and model_lib_v2.py

import tensorflow as tfl

gpus = tfl.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
  # Currently, memory growth needs to be the same across GPUs
    tfl.config.experimental.set_virtual_device_configuration(gpus[0],[tfl.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tfl.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
  # Memory growth must be set before GPUs have been initialized
    print(e)

I did the examples of the documentation https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/auto_examples/plot_object_detection_checkpoint.html and also work.

6. System information

NAEE09 commented 4 years ago

In addition, as I mentioned in the issue I did this example https://github.com/tensorflow/models/blob/master/research/object_detection/colab_tutorials/eager_few_shot_od_training_tf2_colab.ipynb (not in colab), and it worked, and I try to do the same with my own data and the Mask R-CNN Inception ResNet V2 1024x1024 model, and the same error appears when I convert the dataset into an iterator. I can load the configuration of the model, build it, but in this part the error appears and I am limiting the memory at the beginning of the program.

train_input = inputs.train_input( train_config=train_config, train_input_config=train_input_config, model_config=model_config, model=detection_model)

train_input = train_input.repeat()

input_iter = iter(train_input) features, labels = next(input_iter)

gowthamkpr commented 4 years ago

Please take a look at this issue here and let me know if it helps. Thanks!

google-ml-butler[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

NAEE09 commented 4 years ago

@gowthamkpr I've looked that issue, but the problem is I have another config file and I can't change the parameters that they recommend. I solved the problem reducing the image size to 300x600 and mask size 35x35 in the config model, but I have pretty bad results. Any advise?or how can I optimize the memory usage as in the issue you mentioned?

``