8 k80 GPU configuration Resource exhausted: OOM when allocating tensor with shape[1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Adblu commented 4 years ago

Hi, what is good configuration to make efficient training ?I am using p2.8xlarge. My dateset contains train 7500 images, test 1500 of resolution 1600x1600.

I set:

GPU_COUNT = 8, 
IMAGES_PER_GPU = 1
NUM_CLASSES = 1 + 1 
STEPS_PER_EPOCH = 100
IMAGES_PER_GPU = 1

and I get:

Resource exhausted: OOM when allocating tensor with shape[1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

For some reason, when i open nvidia-smi I can see that it wants to put everything on 1 card GPU_0.

What can I do now ?

Even tho I have set the GPU_COUNT to 8,

this is what i get:

BACKBONE resnet101 BACKBONE_STRIDES [4, 8, 16, 32, 64] BATCH_SIZE 4 BBOX_STD_DEV [0.1 0.1 0.2 0.2] COMPUTE_BACKBONE_SHAPE None DETECTION_MAX_INSTANCES 100 DETECTION_MIN_CONFIDENCE 0.9 DETECTION_NMS_THRESHOLD 0.3 FPN_CLASSIF_FC_LAYERS_SIZE 1024 GPU_COUNT 1 GRADIENT_CLIP_NORM 5.0 IMAGES_PER_GPU 4 IMAGE_CHANNEL_COUNT 3 IMAGE_MAX_DIM 1024 IMAGE_META_SIZE 14 IMAGE_MIN_DIM 800 IMAGE_MIN_SCALE 0 IMAGE_RESIZE_MODE square IMAGE_SHAPE [1024 1024 3] LEARNING_MOMENTUM 0.9 LEARNING_RATE 0.001 LOSS_WEIGHTS {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0} MASK_POOL_SIZE 14 MASK_SHAPE [28, 28] MAX_GT_INSTANCES 100 MEAN_PIXEL [123.7 116.8 103.9] MINI_MASK_SHAPE (56, 56) NAME damage NUM_CLASSES 2 POOL_SIZE 7 POST_NMS_ROIS_INFERENCE 1000 POST_NMS_ROIS_TRAINING 2000 PRE_NMS_LIMIT 6000 ROI_POSITIVE_RATIO 0.33 RPN_ANCHOR_RATIOS [0.5, 1, 2] RPN_ANCHOR_SCALES (32, 64, 128, 256, 512) RPN_ANCHOR_STRIDE 1 RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2] RPN_NMS_THRESHOLD 0.7 RPN_TRAIN_ANCHORS_PER_IMAGE 256 STEPS_PER_EPOCH 100 TOP_DOWN_PYRAMID_SIZE 256 TRAIN_BN False TRAIN_ROIS_PER_IMAGE 200 USE_MINI_MASK True USE_RPN_ROIS True VALIDATION_STEPS 50 WEIGHT_DECAY 0.0001

There still is GPU_COUNT = 1

truongtd6285 commented 4 years ago

Maybe this helps: https://github.com/matterport/Mask_RCNN/wiki

burhr2 commented 4 years ago

Hi! kindly see this issue https://github.com/matterport/Mask_RCNN/issues/2312 there is a pointer to the same implementation that supports TensorFlow 2=>. If this solves your problem kindly close the issue so as others can navigate to other issues easier

xxxming730 commented 2 years ago

Hello, I am trying to train with 4 RTX A6000 (48G), when I set it to gpu_count=4 Images_per_gpu=8 the memory overflow error occurs, the first graphics card video memory occupies almost full, the other three only occupy 15G, 13G, 13G, and gpu_count=1 Images_per_gpu=8 when there is no problem at all, can complete the training smoothly, I really don't know what the reason is , can you help me answer it, thank you very much!

Maybe this helps: https://github.com/matterport/Mask_RCNN/wiki

matterport / Mask_RCNN

8 k80 GPU configuration Resource exhausted: OOM when allocating tensor with shape[1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc #2149