matterport / Mask_RCNN

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow
Other
24.7k stars 11.71k forks source link

Evaluating Various Datasets #240

Open canerozer opened 6 years ago

canerozer commented 6 years ago

TensorFlow Version: 1.5.0 Keras Version: 2.1.2 Python version: 3.5.2 GPU: NVIDIA 1080 GTX

Hello,

I am trying to extract the region of interest output of various datasets with the given model file, however after around a thousand iterations, I receive some errors. I don't think it is an issue caused by the limitations of my GPU, since this is only a forward pass task. I have to restart evaluating afterwards and same thing happens again. There are no problems with receiving the final output with the same datasets.

This was the draft code that I wrote.

import os
import sys
import random
import math
import numpy as np
import skimage.io
import matplotlib
import matplotlib.pyplot as plt
import argparse
import coco
import utils
import model as modellib
import visualize
import time
import keras.backend as K
import tensorflow as tf

# Test some videos
parser = argparse.ArgumentParser(description='Test some videos.')
parser.add_argument('--test-dataset-dir', metavar='TD', type=str,
                    help='enter the test directory')
parser.add_argument('--image-extension', metavar='CDC', type=str,
                    default=".jpg", help="type the codec of images")

args = parser.parse_args()

config_keras = tf.ConfigProto()
# config_keras.gpu_options.allow_growth = True
config_keras.gpu_options.allocator_type = 'BFC'
K.set_session(tf.Session(config=config_keras))

# Root directory of the project
ROOT_DIR = os.getcwd()

# Directory to save logs and trained model
MODEL_DIR = os.path.join(ROOT_DIR, "logs")

# Local path to trained weights file
COCO_MODEL_PATH = os.path.join(ROOT_DIR, "mask_rcnn_coco.h5")
# Download COCO trained weights from Releases if needed
if not os.path.exists(COCO_MODEL_PATH):
    utils.download_trained_weights(COCO_MODEL_PATH)

# Directory of images to run detection on
IMAGE_DIR = os.path.join(args.test_dataset_dir)

frame_folder_names = os.listdir(IMAGE_DIR)
video_directories = []
video_names = []
for folder_name in frame_folder_names:
    assert os.path.isdir(os.path.join(IMAGE_DIR, folder_name)), (
        "The image directory should only contain folders")
    video_names.append(folder_name)
    video_directories.append(IMAGE_DIR+""+folder_name)

class InferenceConfig(coco.CocoConfig):
    # Set batch size to 1 since we'll be running inference on
    # one image at a time. Batch size = GPU_COUNT * IMAGES_PER_GPU
    GPU_COUNT = 1
    IMAGES_PER_GPU = 1

config = InferenceConfig()
config.display()

# Create model object in inference mode.
model = modellib.MaskRCNN(mode="inference", model_dir=MODEL_DIR, config=config)

# Load weights trained on MS-COCO
model.load_weights(COCO_MODEL_PATH, by_name=True)

# COCO Class names
# Index of the class in the list is its ID. For example, to get ID of
# the teddy bear class, use: class_names.index('teddy bear')
class_names = ['BG', 'person', 'bicycle', 'car', 'motorcycle', 'airplane',
               'bus', 'train', 'truck', 'boat', 'traffic light',
               'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird',
               'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear',
               'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie',
               'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
               'kite', 'baseball bat', 'baseball glove', 'skateboard',
               'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup',
               'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
               'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
               'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed',
               'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',
               'keyboard', 'cell phone', 'microwave', 'oven', 'toaster',
               'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors',
               'teddy bear', 'hair drier', 'toothbrush']

def coco_to_voc_bbox_converter(y1, x1, y2, x2):
    w = x2 - x1
    h = y2 - y1
    return x1, y1, w, h

def to_rgb1(im):
    w, h = im.shape
    ret = np.empty((3, w, h), dtype=np.uint8)
    ret[0, :, :] = im
    ret[1, :, :] = im
    ret[2, :, :] = im
    return ret

# Start testifying images for every frame in a particular folder_name.
# When enumerator hits the batch size number, the model will begin detection.
video_counter = 0

# Number of clipped refined anchors to be extracted per frame is limited to 50.
limit = 50
for video_id, video_dir in enumerate(video_directories):
    print("Video in Process: {}/{}".format(video_id+1, len(video_directories)))
    print("Video Name: {}".format(video_dir))

    image_list = []
    image_ids = os.listdir(os.path.join(IMAGE_DIR, video_dir))
    image_counter = 0

    # Sort the images in folder and retrieve only jpg images
    sorted_image_ids = sorted(image_ids, key=lambda x: x[:-4])
    sorted_image_ids is= list(filter(lambda x: args.image_extension in x,
                                   sorted_image_ids))

    for d, image_id in enumerate(sorted_image_ids):
        print (image_id)
        if(image_id[-4:] == args.image_extension):
            image = skimage.io.imread(os.path.join(video_dir, image_id))
            dims = image.shape

            # If image is BW, convert it to RGB for handling exception.
            if len(image.shape) == 2:
                image = to_rgb1(image)

            image_list.append(image)

            # Get the scale and padding parameters by using resize_image.
            m_image, x, scale, pad = utils.resize_image(image,
                                                        config.IMAGE_MIN_DIM,
                                                        config.IMAGE_MAX_DIM,
                                                        config.IMAGE_PADDING)

            # Roughly calculate padding across different axises.
            aver_pad_y = (pad[0][0] + pad[0][1])/2
            aver_pad_x = (pad[1][0] + pad[1][1])/2

        if len(image_list) == config.BATCH_SIZE:
            print("Processed Frame ID: {}/{}".format(d+1,
                                                     len(sorted_image_ids)))

            # Code taken from the iPython file, to retrieve the top anchors.
            pillar = model.keras_model.get_layer("ROI").output
            results = model.run_graph(image_list, [
                ("rpn_class", model.keras_model.get_layer("rpn_class").output),
                ("proposals", model.keras_model.get_layer("ROI").output),
                ("refined_anchors_clipped",
                    model.ancestor(pillar, "ROI/refined_anchors_clipped:0")),
            ])

            r = results["refined_anchors_clipped"][0, :limit]
            scores = ((np.sort(results['rpn_class'][:, :, 1]
                               .flatten()))[::-1])[:limit]

            # A little bit of math for inversing 1024x1024 to dim[0]xdim[1]
            r = (r - np.array((aver_pad_y, aver_pad_x,
                               aver_pad_y, aver_pad_x)))/scale

            # Clears the image list after evaluation
            image_list.clear()

            with open(MODEL_DIR+"/"+video_names[video_id], 'a+') as f:
                for prop_id, proposals in enumerate(r):
                    y1, x1, y2, x2 = proposals
                    x, y, w, h = coco_to_voc_bbox_converter(y1, x1, y2, x2)
                    if x < 0:
                        x = 0
                    if y < 0:
                        y = 0
                    if x + w > dims[1]:
                        w = dims[1] - x
                    if y + h > dims[0]:
                        h = dims[0] - y
                    things_to_write = "{}\t{}\t{}\t{}\t{}\t{}\n".format(
                        prop_id+1, format(x, '.2f'), format(y, '.2f'),
                        format(w, '.2f'), format(h, '.2f'),
                        format(scores[prop_id], '.8f'))
                    f.write(things_to_write)
            print("")

First, the program is beginning showing these warnings:


0246.jpg
Processed Frame ID: 246/1490
2018-02-07 23:20:36.850939: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 8.00MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2018-02-07 23:20:36.851319: W tensorflow/stream_executor/cuda/cuda_dnn.cc:2400] 
rpn_class                shape: (1, 261888, 2)        min:    0.00000  max:    1.00000
proposals                shape: (1, 1000, 4)          min:    0.00000  max:    1.00000
refined_anchors_clipped  shape: (1, 6000, 4)          min:    0.00

Then the warnings change to:

0260.jpg
Processed Frame ID: 260/1490
2018-02-07 23:20:50.712188: W tensorflow/stream_executor/cuda/cuda_dnn.cc:2400] 
rpn_class                shape: (1, 261888, 2)        min:    0.00000  max:    1.00000
proposals                shape: (1, 1000, 4)          min:    0.00000  max:    1.00000
refined_anchors_clipped  shape: (1, 6000, 4)          min:    0.00000  max: 1024.00000

And finally, I receive this error.

2018-02-07 23:21:21.264267: I tensorflow/core/common_runtime/bfc_allocator.cc:680] 2 Chunks of size 67108864 totalling 128.00MiB
2018-02-07 23:21:21.264274: I tensorflow/core/common_runtime/bfc_allocator.cc:680] 1 Chunks of size 134217728 totalling 128.00MiB
2018-02-07 23:21:21.264279: I tensorflow/core/common_runtime/bfc_allocator.cc:684] Sum Total of in-use chunks: 6.35GiB
2018-02-07 23:21:21.264288: I tensorflow/core/common_runtime/bfc_allocator.cc:686] Stats: 
Limit:                  6954539418
InUse:                  6819272448
MaxInUse:               6953454080
NumAllocs:                  881418
MaxAllocSize:           2225602560

2018-02-07 23:21:21.264806: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ***************************************************************************************************_
2018-02-07 23:21:21.265215: W tensorflow/core/framework/op_kernel.cc:1198] Resource exhausted: OOM when allocating tensor with shape[1,512,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1329, in _run_fn
    status, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,512,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[Node: rpn_model/rpn_conv_shared/convolution = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_p2/BiasAdd, rpn_conv_shared/kernel/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[Node: ROI/strided_slice_20/_31077 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3023_ROI/strided_slice_20", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "xtracting_rpn_results_2.py", line 147, in <module>
    ("refined_anchors_clipped", model.ancestor(pillar, "ROI/refined_anchors_clipped:0")),
  File "/home/user/Desktop/Mask_RCNN/model.py", line 2449, in run_graph
    outputs_np = kf(model_in)
  File "/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py", line 2357, in __call__
    **self.session_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1128, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1344, in _do_run
    options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1363, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,512,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[Node: rpn_model/rpn_conv_shared/convolution = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_p2/BiasAdd, rpn_conv_shared/kernel/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[Node: ROI/strided_slice_20/_31077 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3023_ROI/strided_slice_20", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'rpn_model/rpn_conv_shared/convolution', defined at:
  File "xtracting_rpn_results_2.py", line 71, in <module>
    model = modellib.MaskRCNN(mode="inference", model_dir=MODEL_DIR, config=config)
  File "/home/user/Desktop/Mask_RCNN/model.py", line 1735, in __init__
    self.keras_model = self.build(mode=mode, config=config)
  File "/home/user/Desktop/Mask_RCNN/model.py", line 1835, in build
    layer_outputs.append(rpn([p]))
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/topology.py", line 603, in __call__
    output = self.call(inputs, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/topology.py", line 2061, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/usr/local/lib/python3.5/dist-packages/keras/engine/topology.py", line 2212, in run_internal_graph
    output_tensors = _to_list(layer.call(computed_tensor, **kwargs))
  File "/usr/local/lib/python3.5/dist-packages/keras/layers/convolutional.py", line 164, in call
    dilation_rate=self.dilation_rate)
  File "/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py", line 3195, in conv2d
    data_format=tf_data_format)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 754, in convolution
    return op(input, filter)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 838, in __call__
    return self.conv_op(inp, filter)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 502, in __call__
    return self.call(inp, filter)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 190, in __call__
    name=self.name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 639, in conv2d
    data_format=data_format, dilations=dilations, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3160, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1625, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,512,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[Node: rpn_model/rpn_conv_shared/convolution = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](fpn_p2/BiasAdd, rpn_conv_shared/kernel/read)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[Node: ROI/strided_slice_20/_31077 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3023_ROI/strided_slice_20", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

In essence, my purpose is to evaluate multiple series of images more than a single image and to solve the corresponding problem, I have tried defining the allocator type as BFC as a suggestion. I also evaluate with a batch size of 1. However, I think that there might be an issue with the garbage collection. Does anyone of you have a suggestion to solve this problem?

Best regards,

ygean commented 6 years ago

@dontgetdown I meet this problem too,have you found solution for it?

canerozer commented 6 years ago

Not yet except for restarting the tensorflow session again after seeing the warnings about 1400 iterations.

On Thu, Mar 1, 2018, 8:20 AM AllenZhou notifications@github.com wrote:

@dontgetdown https://github.com/dontgetdown I meet this problem too,have you found solution for it?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/matterport/Mask_RCNN/issues/240#issuecomment-369477038, or mute the thread https://github.com/notifications/unsubscribe-auth/AJAjCE7qKzsu9V5ifpKm736VSEOaFpsyks5tZ4UhgaJpZM4R9bBZ .

canerozer commented 6 years ago

Today I checked the relevant code again and I will soon try to make different implementation than how I did. I will also try executing the code on different computers, possibly tomorrow.

ygean commented 6 years ago

@dontgetdown Thank you

canerozer commented 6 years ago

Made some research for the insights of the problem.

The out of memory issue occurs when I run the graph by using the section.

results = model.run_graph(... , ("refined_anchors_clipped", model.ancestor(pillar, "ROI/refined_anchors_clipped:0")) As I saw that this problem does not occur during training, I can say that the issue is caused by model.ancestor function.

canerozer commented 6 years ago

Also one more thing. When I set the batch size 4 for inference, it is not possible to get the output for 4 of the batches for the nodes in ProposalLayer.

rpn_class shape: (4, 261888, 2) min: 0.00000 max: 1.00000 float32 rpn_bbox shape: (4, 261888, 4) min: -7.51282 max: 25.38355 float32 refined_anchors_clipped shape: (1, 6000, 4) min: 0.00000 max: 1.00000 float32