Running out of Memory while Training

jameschartouni commented 6 years ago

I am trying to train the mask rcnn mode with imagenet weights on a custom dataset on a gtx 1080ti 11gb on Ubuntu 16.04. Regardless of what image resolution that I use, the kernel runs out of memory and crashes whenever I call train(). I am using a batch size of one. Before I restart the kernel, I kill all python processes to clear the GPU memory. What could be causing the memory issue? The demos runs perfectly. I can also run all the inspect_data code on my dataset, and everything looks kosher. Posted below is my config output.

Configurations:
BACKBONE                       resnet101
BACKBONE_STRIDES               [4, 8, 16, 32, 64]
BATCH_SIZE                     1
BBOX_STD_DEV                   [0.1 0.1 0.2 0.2]
DETECTION_MAX_INSTANCES        100
DETECTION_MIN_CONFIDENCE       0.7
DETECTION_NMS_THRESHOLD        0.3
GPU_COUNT                      1
GRADIENT_CLIP_NORM             5.0
IMAGES_PER_GPU                 1
IMAGE_MAX_DIM                  512
IMAGE_META_SIZE                20
IMAGE_MIN_DIM                  512
IMAGE_MIN_SCALE                0
IMAGE_RESIZE_MODE              square
IMAGE_SHAPE                    [512 512   3]
LEARNING_MOMENTUM              0.9
LEARNING_RATE                  0.001
LOSS_WEIGHTS                   {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE                 14
MASK_SHAPE                     [28, 28]
MAX_GT_INSTANCES               100
MEAN_PIXEL                     [123.7 116.8 103.9]
MINI_MASK_SHAPE                (56, 56)
NAME                           Cathode Imaging
NUM_CLASSES                    8
POOL_SIZE                      7
POST_NMS_ROIS_INFERENCE        1000
POST_NMS_ROIS_TRAINING         2000
ROI_POSITIVE_RATIO             0.33
RPN_ANCHOR_RATIOS              [0.5, 1, 2]
RPN_ANCHOR_SCALES              (8, 16, 32, 64, 128)
RPN_ANCHOR_STRIDE              1
RPN_BBOX_STD_DEV               [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD              0.7
RPN_TRAIN_ANCHORS_PER_IMAGE    256
STEPS_PER_EPOCH                100
TRAIN_BN                       False
TRAIN_ROIS_PER_IMAGE           200
USE_MINI_MASK                  True
USE_RPN_ROIS                   True
VALIDATION_STEPS               5
WEIGHT_DECAY                   0.0001

UM-Titan commented 6 years ago

Same problem here. Runs out of memory and crashes everytime when trying to train on GTX 1080ti

jameschartouni commented 6 years ago

Currently none of the demos work anymore. I can't train any matterport mask r-cnn models. I can still train keras models, so the issue appears to be specific to this repo.

I'm encountering the same problem on a new virtual environment except train_shapes now works. The model trains on the CPU. It just breaks in GPU mode.

Zico2017 commented 6 years ago

same...I also ran in GPU mode, GTX 1080Ti and image size is (448,448,3).

Btw, do you know other config parameters which decide the model size? e.g. change BACKBONE from "resne101" to "resnet50", or a smaller TRAIN_ROIS_PER_IMAGE. I will try it tonight and update the result. ---------------------> it doesn't work... so maybe it's about cuda or cudnn version tensorflow-gpu 1.6.0 cuda 9.0 cudnn 7.0.5 ubuntu 16.04 GTX 1080Ti

f1ashine commented 6 years ago

Have you used the NVIDIA GPU Cloud to replace the local GPU?

UM-Titan commented 6 years ago

@stonePJ I haven't tried the cloud yet. I am not sure why that will make a difference though.

jameschartouni commented 6 years ago

This problem seems invariant to the parameters I choose. I don't think it's a gpu memory problem since the train_shapes.ipynb demo also fails. I can also still perform inference. Have any of you tried Docker? Do you think this could be a driver or Cuda/CudNN issue? The odd thing is that I was able to manage a successful run last morning and haven't been able to replicate it. The problem can't be reproduced 100% of the time. And the error when running the script is a seg fault.

rafihayne commented 6 years ago

I've personally encountered this when running CUDA 9.1 with tensorflow 1.8. The version of tensorflow offered on pypi is only known to work with CUDA 9.0: https://www.tensorflow.org/install/install_sources#tested_source_configurations

You can compile tensorflow from source to work with 9.1 if you'd like. Or you can downgrade to 9.0 and use the precompiled tensorflow.

On ubuntu 16.04 I had to apt-mark hold cuda-9-0 because each time I ran apt upgrade I would be upgraded to cuda 9.1.

UM-Titan commented 6 years ago

@rafihayne you are correct. I am able to train now with no memory problems after removing cuda8 and installing cuda9.0. I am also using cudnn v7.1 for cuda9.0. Then install tensorflow-gpu==1.5. This should fix it. @jameschartouni let me know if it works for you.

jameschartouni commented 6 years ago

I downgraded from Cuda 9.1 to Cuda 9. I am also using Tensorflow 1.8. I still had a memory error but no longer a seg fault. So I reduced the TRAIN_ROIS_PER_IMAGE down to 100. It can't handle anything much higher. Now it trains. It looks like Mask RCNN is fairly memory memory constrained on the 1080ti. I may need to throw in an extra card if I want to improve performance. I could really benefit from turning up the configs that utilize more memory.

Zico2017 commented 6 years ago

@jameschartouni what cudnn version you used? My Cuda is 9.0 and I even reduce TRAIN_ROIS_PER_IMAGE down to 2, still out of memory. so I want to keep all versions same to yours, then try again.

jameschartouni commented 6 years ago

I'm using CudaNN 7.1

f1ashine commented 6 years ago

I'm using Cuda9.0 and cudnn7.0.5. When I train the dataset on cmd shell, it shows the source compiled with version of 7.0.3. And I try to train the dataset on CPU mode,but it's too slow. If I want the process to be faster, Ineed to change my device. I have just GTX950M 2G.

Zico2017 commented 6 years ago

@jameschartouni Thank you! Cuda 9.0 + cudnn 7.1+ tensorflow-gpu 1.8, OOM is solved. but for me, a new issue arise....I guess it's about tf version. Floating point exception (core dumped) https://github.com/matterport/Mask_RCNN/issues/513 it's just so sensitive to the version of cuda or tensorflow.

rafihayne commented 6 years ago

@Zico2017 Take a look at the link I posted earlier. I bet your issue is using cudnn 7.1 and not the supported cudnn 7.0.

jameschartouni commented 6 years ago

I've been frequently getting this error. Do you think it could be related to the issues discussed here?

2018-05-06 11:41:25.919022: F ./tensorflow/core/util/cuda_launch_config.h:127] Check failed: work_element_count > 0 (0 vs. 0)
Aborted (core dumped)

Zico2017 commented 6 years ago

@jameschartouni Hi! I also got this error before. But when I reinstalled everything, it was solved, and I've got some detection results successfully after training on resnet101. I don't know which one works: Cuda V 9.0.176 cudnn V 7.0.5 NVIDIA driver: 384.111 build a virtual environment(not built via anaconda) to install tensorflow-gpu=1.5.0

jameschartouni commented 6 years ago

I can confirm that the driver setup above is the most stable. Thanks!

samhodge commented 6 years ago

I was trying with

Bad config: cuda v9 cuDNN 7 tensorflow-gpu 1.10

with imagenet from scratch on coco dataset

at stage 4+ of all layers I was impressed when the job got killed at 150Gb of memory allocated.

This was using two v100 cards So I was looking into the multiprocessing and the like, but it seems this is the most appropriate ticket. On machines without the quota I typically saw a OOM kill before that.

with v1.0 of the git repo with no modifications using the command

python3 coco.py train --model=imagenet --data=/path/to/coco/data

samhodge commented 6 years ago

Tried again with a 64Gb memory allocation on a single K80 and got the following output https://pastebin.com/kGFHEY0K

with tensorflow-gpu 1.5

samhodge commented 6 years ago

This looked really interesting to me: https://github.com/tensorflow/models/issues/1817#issuecomment-325988741

darvida commented 6 years ago

where is the file containing the 'BATCH_SIZE' variable ?

samhodge commented 6 years ago

config.py or coco.py

samhodge commented 6 years ago

Has anyone tried this out?

https://github.com/tungld/tensorflow/blob/lms-contrib/tensorflow/contrib/lms/README.md

Manishsinghrajput98 commented 4 years ago

i am trying to train my own data set but show this error massage

2020-02-11 18:38:03.836262: W tensorflow/core/common_runtime/bfc_allocator.cc:271] **** 2020-02-11 18:38:03.836306: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at assign_op.h:117 : Resource exhausted: OOM when allocating tensor with shape[1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc 2020-02-11 18:38:03.839397: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 16.94M (17760256 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2020-02-11 18:38:03.841625: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 16.94M (17760256 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory

please help

matterport / Mask_RCNN

Running out of Memory while Training #498