tensorflow / models

Models and examples built with TensorFlow
Other
76.79k stars 45.85k forks source link

Out Of Memory when training on Big Images #1817

Closed pjeambrun closed 6 years ago

pjeambrun commented 7 years ago

Out Of Memory when training on Big Images

Systeme Information

Describe the Problem

I have successfully run the pets tutorial on this Google Compute Instance When I train a fasterrcnn resnet 101 on my dataset (VOC format, 47 classes, image_size: 1000/2000) with:

python object_detection/train.py --train_dir=data_xxxx --pipeline_config_path=data_xxxx/faster_rcnn_resnet101_pets.config

I get the following error at the beginning of the training:

2017-06-29 17:24:13.193833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:7) -> (device: 7, name: Tesla K80, pci bus id: 0000:00:0b.0)
2017-06-29 17:24:15.414228: I tensorflow/core/common_runtime/simple_placer.cc:675] Ignoring device specification /device:GPU:0 for node 'prefetch_queue_Dequeue' because the input edge from 'prefetch_queue' is a reference connection and already has a device field set to /device:CPU:0
INFO:tensorflow:Restoring parameters from /home/ubuntu/models/data_xxxx/model.ckpt
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path data_doliprane/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
[1]    4359 killed     python object_detection/train.py --train_dir=data_xxxx 

I managed to avoid the oom on this dataset by resizing all the images and annotation files (divided the dimensions by 4).

I didn't modify the config file(just the number of class and the paths), therefore they should be resized at 600 1024, and the bug should not occur with the big images

Is there a way I can train on my images without having to shrink them? Are there some parameters I can tune to avoid this problem?

veonua commented 4 years ago

I believe this bug https://github.com/tensorflow/tensorflow/issues/33516 is realted

in dataset_builder.py I've changed dataset.map( ... , tf.data.experimental.AUTOTUNE) to dataset.map( ... , num_parallel_calls)

and memory leak seems to be fixed