tensorflow / models

Models and examples built with TensorFlow
Other
77.21k stars 45.75k forks source link

Memory Leak Training Faster-RCNN (Resnet 50) #8621

Open patricksansom opened 4 years ago

patricksansom commented 4 years ago

Prerequisites

I am using TF 1.15.2 as TF2 version of research/object_detection models not yet released...

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/object_detection/model_main.py

Invokation: python /home/ec2-user/SageMaker/models/research/object_detection/model_main.py --pipeline_config_path=/home/ec2-user/SageMaker/fuego-train/faster_rcnn_resnet50.config --model_dir=/home/ec2-user/SageMaker/data/model --num_train_steps=5800 --alsologtostderr

I am training the rcnn resnet50 model from the faster_rcnn_resnet50_coco_2018_01_28.tar.gz checkpoint on a 48cpu machine with 192GB memory (no GPU).

Model config uploaded (as .txt file)... faster_rcnn_resnet50.config.txt

2. Describe the bug

On my training runs I don't reach any sort of stable memory state. Memory requirements just continue to increase all the way to 192GB and then the job fails.

I get through 360 steps (batch size 1) in 10 min then checkpoint and eval and then some more steps. Training appears to be progressing ok e.g. loss is decreasing. However, memory is ever increasing before run eventually fails when memory usage nears 192GB (runs sometimes fail on allocation or just crashes)

3. Steps to reproduce

Memory leaks consistently on every run.

4. Expected behaviour

Expect to reach a stable memory requirement during training.

5. Additional context

log attached (lots of TF warnings)... train.log

tfrecord dataset is 2GB - please advise if you want me to provide this.

The rcnn resnet50 weights can be downloaded from the object_detection model zoo...

Note: Images are quite large - 1500x2000. e.g. (Box in filename is single object bounding box within each image) -rw-rw-r-- 1 ec2-user ec2-user 458461 Jun 2 12:49 69bravo-e-mobo-c2019-08-13T14_21_44_Box_160x1092x313x1218.jpg -rw-rw-r-- 1 ec2-user ec2-user 460998 Jun 2 12:49 69bravo-e-mobo-c2019-08-13T14_22_44_Box_156x1082x328x1205.jpg -rw-rw-r-- 1 ec2-user ec2-user 451599 Jun 2 12:49 69bravo-e-mobo-c2019-08-13T14_26_44_Box_175x1044x344x1220.jpg -rw-rw-r-- 1 ec2-user ec2-user 465679 Jun 2 12:49 69bravo-e-mobo-c2019-08-13T14_27_44_Box_137x1051x382x1245.jpg ...

These images have been prepared as shard tfrecords: -rw-rw-r-- 1 ec2-user ec2-user 35 Jun 4 01:40 smoke_label_map.pbtxt -rw-rw-r-- 1 ec2-user ec2-user 46713957 Jun 4 01:39 smoke_train.record-00000-of-00050 -rw-rw-r-- 1 ec2-user ec2-user 42074074 Jun 4 01:39 smoke_train.record-00001-of-00050 -rw-rw-r-- 1 ec2-user ec2-user 42461823 Jun 4 01:39 smoke_train.record-00002-of-00050 -rw-rw-r-- 1 ec2-user ec2-user 41608315 Jun 4 01:39 smoke_train.record-00003-of-00050 -rw-rw-r-- 1 ec2-user ec2-user 41683135 Jun 4 01:39 smoke_train.record-00004-of-00050 ...

6. System information

patricksansom commented 4 years ago

FYI...

Same environment training the ssd resnet50 model (checkpoint ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz) on my dataset seems to run fine - stabilises with a memory use of 70GB with a batch size of 64 as specified in the config file.

Regards, Patrick.

patricksansom commented 4 years ago

faster_rcnn_resnet101_coco_2018_01_28 also has the steadily increasing memory issue running against my dataset.

One thought: Could this be related to large variable image sizes? Should I resize images to a fixed size on input? If so what size would you recommend given I am working with large images?

Regards, Patrick

mgultekin commented 4 years ago

Hi Patrick, can you give me an idea about how to use TensorFlow API on sagemaker ? I am looking for days but I couldn't find any example of that. I know how to make classification on sagemaker but I confused about using tensorflow API, I can do the same thing like I do in google colab but I don't understand how to specify gpu and instance in sagemaker as I do on classification. I appreciate if you can help me