matterport / Mask_RCNN

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow
Other
24.65k stars 11.7k forks source link

Optimizing inference time #2227

Open HjalteMann opened 4 years ago

HjalteMann commented 4 years ago

I am using the Mask RCNN network to detect a single object class in time-lapse images. I am running the detection on a single RTX2080 GPU, on which I have also done the training. The backbone is ResNet101. I have tried ResNet50 but the detections produced are not very good.

I am experiencing the issue that the inference is quite slow. The inference for the first image usually takes more than 4 minutes, and the following images probably take around 15 seconds per image. I need to run the detections on a very large number of images (100k's), so I would really like to speed this up.

The images I am processing are 6080x3420 and around 5 mb.

Is there any way to speed up the inference time? I have formulated some more specific questions below.

  1. Is the inference time dependent on the size of the image? Meaning, will the inference be faster if I run it on downscaled images. I am asking because, to my knowledge, the images are downscaled in the script itself, right? So I am not sure there would be any gain in downscaling images beforehand, as this also takes time?

  2. If the inference time is faster on downscaled images, will I need to train on images of the same downscaled dimensions as well? Since the images are downscaled during training anyway?

  3. The inference time is dependent on the number of objects that are detected in an image. Is there any way to speed this up? I have some images that contain 40-60 objects, and the inference can take around a minute for these.

  4. I read somewhere that the detections are done on the CPU. Is that right? I have monitored the CPU usage while running the detections and I have no increase here. When monitoring the GPU I do see an increase, but it is just short peaks of 15-20% (I assume every time an image is processed) and then back to zero.

  5. For this project, I am actually not using the masks, just the bounding boxes. Is there any way of exploiting this to speed up inference? I would like to still use the Mask RCNN network and not the Faster RCNN, but maybe there is something I could comment out in the code?

The code I use for the detection is given below:

def detect(model, dataset_dir, subset):
    print("Running on {}".format(dataset_dir))

    # Create directory
    if not os.path.exists(RESULTS_DIR):
        os.makedirs(RESULTS_DIR)
    submit_dir = "submit_{:%Y%m%dT%H%M%S}".format(datetime.datetime.now())
    submit_dir = os.path.join(RESULTS_DIR, submit_dir)
    os.makedirs(submit_dir)

    # Read dataset
    dataset = DrDataset()
    dataset.load_dr(dataset_dir, subset)
    dataset.prepare()
    # Load over images
    submission = []
    boxes = []
    test_dir = os.path.join(dataset_dir, subset)
    test_ids = next(os.walk(test_dir))[2]

    # Processing times
    tot_start = time.time()
    print("Length of test_ids", len(test_ids))
    proc_times = []
    img_ids = []

    for image_id in test_ids:
        start = time.time()
        print(image_id)
        #Load image and run detection
        path_to_image = os.path.join(test_dir,image_id)
        image = skimage.io.imread(path_to_image)
        #image = dataset.load_image(image_id)
    #     # Detect objects
        r = model.detect([image], verbose=0)[0]
        # Encode image to RLE. Returns a string of multiple lines
        source_id = image_id # dataset.image_info[image_id] #["id"]
        rle = mask_to_rle(source_id, r["masks"], r["scores"])
        submission.append(rle)

        #Append bounding boxes to list
        box = utils.extract_bboxes(r["masks"])
        box = ','.join(str(v) for v in box)
        box = image_id+", "+box
        boxes.append(box)

        # Save image with masks

        visualize.display_instances(
            image, r['rois'], r['masks'], r['class_ids'],
            dataset.class_names, r['scores'],
            show_bbox=True, show_mask=False,
            title="Predictions")
        plt.savefig("{}_Prediction.JPG".format(path_to_image))
        plt.close('all')
        end = time.time()
        print("Processing time for the image: ", end-start)
        proc_times.append(end-start)
        img_ids.append(image_id)

    tot_end = time.time()
    print("Total prossesing time: ", tot_end-tot_start)
    print("Prossesing time per image: ", (tot_end-tot_start)/(len(test_ids)-1))

    # Save to csv file
    submission = "ImageId,EncodedPixels\n" + "\n".join(submission)
    file_path = os.path.join(submit_dir, "submit.csv")
    with open(file_path, "w") as f:
        f.write(submission)
    print("Saved to ", submit_dir)

    file_path_proc = os.path.join(submit_dir, "proc_times_per_image.csv")
    with open(file_path_proc, 'w') as f:
        writer = csv.writer(f)
        writer.writerows(zip(proc_times,img_ids))

    boxes = "ImageId,Boxes\n" + "\n".join(boxes)
    file_path = os.path.join(submit_dir, "submit_boxes.csv")
    with open(file_path, "w") as f:
        f.write(str(boxes))
    print("Saved to ", `submit_dir)
suchiz commented 4 years ago

Hi there, i will answer to your 2 first questions for sure, and will be abstract for the others.

  1. Yes, inference time depends on the image size, the bigger it is, the longer is the inference time. It is resized in the script if you havent modified it. in config.py checkout the IMG_RESIZE_MODE. Depending on what you put there, it will be resized or not.

  2. Yes, it will. The NN always take the same input. so you weights for 1024x1024 wont match another size. You will have to train again.

  3. I think you are asking for to much here

  4. I think it also processed on the GPU

  5. I already saw people asking for it in the issues. It is possible. But as you already know, Mask RCNN uses Faster RCNN. You can take all the parameters used here for Faster RCNN and try to set Faster RCNN somewhere else.

HjalteMann commented 4 years ago

Thanks for your comments, @suchiz. Regarding your answer to question two.

I did some experiments on predicting on downscaled images and I am actually getting an increasing F1 score when I run on downscaled images (all the way down to around 10% of the original size), even though the model was trained on full resolution images.

If the NN downscales images to 1024x1024 when training, why do I need to train a new model if I want to predict on downscaled images? I don't understand the process here. If I feed a full resolution image to the NN during training, it will resize it to 1024x1024. If I resize the image to 50% of the original size before feeding it to the network during training, it will still resize that image to 1024x1024 - so shouldn't the result be the same?

Regarding the IMG_RESIZE_MODE, I have the parameters set to: IMAGE_RESIZE_MODE = "square" IMAGE_MIN_DIM = 800 IMAGE_MAX_DIM = 1024

Does that mean that the images are resized to 800x1024 during both training and prediction? If so, I am still wondering why the F1 increases when resizing the image beforehand, since it is also resized before predicting.

suchiz commented 4 years ago

Sorry for the confusion. By downscaling images, I meant downscaling the input image. Indeed, it will in your mode it will always resize ANY image to 1024x1024. As it is said in the doc. It will downscale your largest edge to 1024 then, the 2nd edge will be downscale to keep the original aspect ratio of your image, and filling up the image with 0 value.

And this step is indeed done in training and inference (cause the NN needs the same input).

About the F1 score, it appends sometimes that a too high quality get worse results because of too much details. Downscaling will "smooth" like a gaussian blur and reduce some "noise" (which are details here). But I think mask rcnn works better when your objects have homogeneous surfaces, that can be why you have a better score. It can be a possibility, I am not saying it is, please double check what I am saying, I can be wrong.

So as you may know, resize function are clearly not perfect, so even if they are resized to 1024x1024, you wont always have the same result, depending on the interpollation function you choose.