Live Mask RCNN - Githubissues

matiqul commented 6 years ago

Is it possible to live object detection Mask RCNN using single GPU...?

s-bayer commented 6 years ago

I'm not sure what you define as live. Live in the sense of taking a photo and doing inference on it without too long of a delay should not be much of a problem.

The original paper states that it takes ~200ms to perform inference on a single frame on one GPU. You should be able to improve this number a bit, e.g. by lowering the floating point precision, reducing the image size, using a smaller feature extractor, ...

Even if you do all this and some other optimisations live inference on a video with e.g. 30fps is probably not possible.

sberryman commented 6 years ago

As a side note, the inference is only a part of the total time it will take. I just recently broke the project into 3 steps (pre-processing, inference and post-processing) using pythons multiprocessing.Process and multiprocessing.JoinableQueue as I have 7+ million frames to analyze. Without using shared memory (would definitely speed it up) I'm able to get my 3 GPU's to hover around 70-80% utilization resulting in about 15-18fps.

Edit: The current bottle neck is pickling/unpickling across the queue between the inference and post-processing steps. Using shared memory should allow me to hit 100% utilization on the GPU's.

Source images are stored as JPG on disk with a resolution of 3264×1836 which are resized down to the default IMAGE_MIN_DIM=800 and IMAGE_MAX_DIM=1024. This is no way an optimized way of doing it, more of a hack to speed it up without an enormous amount of work.

Here is a sample showing the message rate (message=frame in this case):

matiqul commented 6 years ago

Thanks for your kind information.It means 5 fps using single GPU...But I am running the program in GPU it needs detection time 8-10 second per frame. Can you mentioned what is the problem...?

MPForte commented 6 years ago

Have you managed to do a real-time (near real-time) detection using a single GPU? I have the same problem.

matiqul commented 6 years ago

Please install GPU version tensorflow....

MPForte commented 6 years ago

yes I have it, but it still has a delay of around 2s (as GPU I have a NVDIA GTX 1080Ti)

waleedka commented 6 years ago

@sberryman That an interesting project. Even on multiple GPUs, 7M frames would probably take a week or so to process. If this is a public project, or if you end up writing about it publicly, I would love to learn more about it.

sberryman commented 6 years ago

@waleedka I have finished running it on the 7M frames, very interesting results. I probably needed to run it at higher resolution as it has mis-classified a lot of data. Basically I wouldn't expect to see a traffic light, bear, parking meter etc. My data is from two 4K security cameras pointing at a very busy pedestrian boardwalk.

Class count: https://gist.github.com/sberryman/eef3a873e0e9976162226162d9a7c713

I'm working with New Media Rights to open up the video dataset to the public. I can keep you updated on the release if you would like. Side note, I have roughly 32M more frames of data I could analyze without capturing any more video.

ghost commented 6 years ago

I've hooked the Mask RCNN to a webcam though Open CV and I can get around 3FPS (512 window) on a 1080Ti. However, the GPU utilization is only around 20%. I don't think it is the raw processing power that is the issue. There is some other bottleneck in the program that makes it impossible to get a decent frame rate. I tried using threaded frames for the Open CV portion and it didn't make any significant difference in the FPS performance so I have concluded the bottleneck is somewhere in the RCNN.

MPForte commented 6 years ago

Same! I use cv2 to acquire the videos from a daVinci Surgical System! My GPU utilization is 27%! I have HD resolution (1080i59.94) so I suppose that it's better to run this algorithm only with compressed images! But I still have to try it!

ghost commented 6 years ago

To achieve a reasonable FPS, I think frames need to be pre-processed and processed in parallel. I believe my current implementation is that OpenCV is taking a frame and feeding it to the network, and the network doesn't process another frame until the first frame is completely processed.

To get it to run faster and to utilize the GPU, the feed from the camera needs to be processed like a factory line. Maybe each layer as a station, instead of the whole line waiting and working on one image.

Unfortunately, I'm rather new to programming and I have no idea how to implement this. If anyone has a solution, I would love to see it!

waleedka commented 6 years ago

@sberryman Yes, a bear on a busy intersection is ... unbearable. Haha, sorry couldn't resist :) But, there seems to be only one instance of it in the whole set, which is not bad. Even with higher resolution, you'd still get errors, but hopefully fewer. You might want to see the same detection done in many frames before accepting it.

waleedka commented 6 years ago

@ironjedi We merged an update a few days ago that improves performance a bit in the inference stage. It would show a big improvement especially if you're running the inference on multiple images at the same time (i.e. BATCH_SIZE > 1)

ghost commented 6 years ago

Great! I'll give it a try when I get back to my GPU!

I'll try to work out the code. Any idea how to feed an OpenCV webcam stream as multiple images? Since the images are still coming in at different times, is it still possible to treat it as a batch? Maybe it would be possible to batch the stream and then feed it in. Might cause some time delay but at least the FPS would be higher.

ghost commented 6 years ago

@sberryman think it'd be possible to get 15+ FPS running 2x 1080Ti on a beefed up laptop and some optimisations?

Need at least 15FPS for a prototype. Just got real-time Mask RCNN running but it's way to slow lol 1-2FPS so wondering where to from here

sberryman commented 6 years ago

@roman3k based on the code when I did inference testing I would say no. I haven't tested it since @waleedka made performance improvements ~13 days ago.

I was testing on 4K images though, maybe with ~~smaller images and~~ batching it could be possible? But then it wouldn't really be real-time.

oak-tree commented 6 years ago

@sberryman Did you run it on 4k images or you downsample/resize the image before?

sberryman commented 6 years ago

@oak-tree I almost answered without looking at the code. So it looks like images were downsampled before inference. The config file is showing

IMAGE_MIN_DIM = 800
IMAGE_MAX_DIM = 1024

oak-tree commented 6 years ago

@sberryman, thanks for the quick reply. I asked because it surprised me because I was almost sure that 4k images most likely to cause a GPU memory issue.

ghost commented 6 years ago

@sberryman 15FPS would be viable. What rig do you have? Am building for HMD so looking at server side processing. Was hoping laptop would work but could use a bigger box if needed. Thanks!

sberryman commented 6 years ago

I was able to get roughly 5-6fps from a single GTX 1080Ti. So I had two 1080 Ti's (11GB) in a custom built tower and another computer with a standard GTX 1080 (8GB) along with a bunch of threads to load and decode the JPG's and mold the images so that wasn't the bottle neck. So to get to 15 FPS, I had to utilize 3 GPUs.

XinchengTan commented 6 years ago

How should we modify the code to make the model use GPU? I have a single GPU, so I assume there's no need to change the GPU_COUNT? Thank you!

cherryxiongyw commented 6 years ago

@sberryman Hello，I want to do same work like you to make inference faster， so I use multiprocessing.Process like this:

p = multiprocessing.Process(target=model.detect, args=([image],)  
p.start()     
p.join()

but the code can't run further, no bug, no output, so what can i do for this problem?

sberryman commented 6 years ago

@cherryxiongyw I found it was much easier to limit memory usage for each inference process and run multiple processes. Multiprocessing in python is not the easiest (based on my skill level) and limiting memory is very easy. My bottle neck was mostly JPEG decoding and image resizing which are CPU intensive and easily sped up with multiple processes.

Example of per process gpu memory limit. This isn't specific to Mask RCNN and is a simple tensorflow config option. Example below will use 20% of your GPU memory per process.

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
tf_config = tf.ConfigProto()
tf_config.gpu_options.per_process_gpu_memory_fraction = 0.2
set_session(tf.Session(config=tf_config))

AshleyRoth commented 6 years ago

@sberryman Hello there! Can you explain me how to run demo.ipynb on video(.avi) or via web camera? With GPU.. please

Now i installed, run with command jupyter notebook and testing a few images. It's work on images with CPU.

I work on Ubuntu 16.04

Weixing-Zhang commented 6 years ago

I ran into exactly same problem. The inference process is quite slow on my GTX 1080Ti (less than 1 fps, 600 by 600 pixels frame). The GPU utilization on the GPU still is 0% with 4 processes running.

I tried to run the Mask R-CNN inference from this repository using multiprocessing and queue. It only got a slightly better performance. The GPU utilization is much higher while executing VGG16 inference using the same multiprocessing strategy.

Any help would be appreciated.

hienpham2tiki commented 6 years ago

I was able to detect full HD image in about 300ms with just GTX 1060GB. Note I used tensorflow serving and called grpc.

Weixing-Zhang commented 6 years ago

@hienduyph Thanks! I will dig into the tensorflow serving and hope it can save me. It took almost 8 hours to process 5k images on the second GPU GTX 980 (around 0 percent utilization).

Weixing-Zhang commented 6 years ago

Just want to share some update with someone like me run into similar issues. I made a mistake not using gpu.

Set the gpu as device rather than the default cpu:0. It's good to find out CPU is not that slow in inference. In my case, it can make it 0.2 fps. You can inference a fair amount of pictures on traditional HPCs in minutes.

# GPU for training.
DEVICE = "/gpu:%s"%(self.gpu_id)  # /cpu:0 or /gpu:0

I still trying to figure out how to max out the performance of GTX 1080ti (around 60%). sberryman mentioned he can make 100% utilization on his GTX 1080ti using shared memory solution.

rusuvalentin commented 6 years ago

Hi everyone,

I am trying to implement mask RCNN in real time using a client-server service. Send 5 fps from a real-time video and receive the mask using COCO demo. Can I get some suggestions of how to do it? I was thinking about not closing the model sequence until the execution is finished,in this way I will have an initial time-delay, but after that everything should work fine. Still, I am using only 1 GPU and this can be a lack of computational power. Can someone get me some advices of what should I do?

Thank you very much!

JeffLee06 commented 5 years ago

I am experiencing the same, with a 1080ti I was able to get to 5 fps, recently I've been experimenting with TensorRT on this model, but some layers were not supported, any workaround is much appreciated!!!!

ankmathur96 commented 5 years ago

@JeffLee06 how is your progress using TensorRT for this?

CCGY commented 5 years ago

@JeffLee06 Yes, how is the progress using TensorRT?

MathiasKahlen commented 5 years ago

Set the gpu as device rather than the default cpu:0. It's good to find out CPU is not that slow in inference. In my case, it can make it 0.2 fps. You can inference a fair amount of pictures on traditional HPCs in minutes.
# GPU for training.
DEVICE = "/gpu:%s"%(self.gpu_id)  # /cpu:0 or /gpu:0
@Weixing-Zhang How do you do this? My gpu is near 100% utilization while training but 0-10% in inference mode. I don't see DEVICE anywhere in the config file?

yerzhik commented 5 years ago

Set the gpu as device rather than the default cpu:0. It's good to find out CPU is not that slow in inference. In my case, it can make it 0.2 fps. You can inference a fair amount of pictures on traditional HPCs in minutes.
# GPU for training.
DEVICE = "/gpu:%s"%(self.gpu_id)  # /cpu:0 or /gpu:0
@Weixing-Zhang How do you do this? My gpu is near 100% utilization while training but 0-10% in inference mode. I don't see DEVICE anywhere in the config file?

Did you install tensorflow-gpu? instead of just tensorflow

MathiasKahlen commented 5 years ago

Set the gpu as device rather than the default cpu:0. It's good to find out CPU is not that slow in inference. In my case, it can make it 0.2 fps. You can inference a fair amount of pictures on traditional HPCs in minutes.
# GPU for training.
DEVICE = "/gpu:%s"%(self.gpu_id)  # /cpu:0 or /gpu:0
@Weixing-Zhang How do you do this? My gpu is near 100% utilization while training but 0-10% in inference mode. I don't see DEVICE anywhere in the config file?
Did you install tensorflow-gpu? instead of just tensorflow

I installed tensorflow-gpu yes

yerzhik commented 5 years ago

Set the gpu as device rather than the default cpu:0. It's good to find out CPU is not that slow in inference. In my case, it can make it 0.2 fps. You can inference a fair amount of pictures on traditional HPCs in minutes.
# GPU for training.
DEVICE = "/gpu:%s"%(self.gpu_id)  # /cpu:0 or /gpu:0
@Weixing-Zhang How do you do this? My gpu is near 100% utilization while training but 0-10% in inference mode. I don't see DEVICE anywhere in the config file?
Did you install tensorflow-gpu? instead of just tensorflow

Then while running inference, nvidia-smi shows how many memory usage compared to before starting the program? It should use several GB. If it is using and your program is in the list there, then it spends some time to process the data between each image actual GPU run and the rest of your program

Weixing-Zhang commented 5 years ago

Set the gpu as device rather than the default cpu:0. It's good to find out CPU is not that slow in inference. In my case, it can make it 0.2 fps. You can inference a fair amount of pictures on traditional HPCs in minutes.
# GPU for training.
DEVICE = "/gpu:%s"%(self.gpu_id)  # /cpu:0 or /gpu:0
@Weixing-Zhang How do you do this? My gpu is near 100% utilization while training but 0-10% in inference mode. I don't see DEVICE anywhere in the config file?

Sorry for replying late and the misleading "config file". Haven't checked my personal Github for a while.

You don't need to have a separated config file. But it's better off specifying what devices you intend to run tensorflow. Did you try running some example codes of the inferencing process on some sample data? It's likely that your code was not utilizing GPU at all while in inference mode.

MathiasKahlen commented 5 years ago

@Weixing-Zhang No worries, I'm happy that you replied :)

I think @yerzhik is right, since it shows memory usage of 9gb, I just wonder why the volatile gpu-util is only around 20-25% all the time. Maybe because I'm reading from a video file one frame at a time. I don't need to analyze all 60 frames per second, so I set it to only run the object detection every 60. frame. However, the more frames I skip between each detection the longer it also takes, so I guess I have to work out the way I read the video file with OpenCV? I tried setting the position between each detection, however it creates a lot of errors. I also created a post on Stackoverflow for this:

I also thought about multithreading my application so that one thread is preparing the images and the other reads them from a queue?

yerzhik commented 5 years ago

@Weixing-Zhang No worries, I'm happy that you replied :)

I think @yerzhik is right, since it shows memory usage of 9gb, I just wonder why the volatile gpu-util is only around 20-25% all the time. Maybe because I'm reading from a video file one frame at a time. I don't need to analyze all 60 frames per second, so I set it to only run the object detection every 60. frame. However, the more frames I skip between each detection the longer it also takes, so I guess I have to work out the way I read the video file with OpenCV? I tried setting the position between each detection, however it creates a lot of errors. I also created a post on Stackoverflow for this:

I also thought about multithreading my application so that one thread is preparing the images and the other reads them from a queue?

Do you care about exact rate? Let's say as you mentioned in SO, do you need to read 1 frame per second or the latest available?

MathiasKahlen commented 5 years ago

@yerzhik 1 frame per x time interval. I just have to grab frames with the same time interval since I need to know how much time each frame represents. I am tracking the time which each object is present. I havn't tried with setting position by msec, maybe that would work better?

I just realized this thread is about live detection and that I'm reading from a video file. I hope this is still relevant to the original question :)

yerzhik commented 5 years ago

@MathiasKahlen Yes a different thread in which you calculate sum of time difference between grabbing the camera frames would work. And then you send the frame to RCNN once you got the closest to "x time interval" frame. Not sure about precision though.

MathiasKahlen commented 5 years ago

A little update:

I found this thread on SO. Before I was using capture.read() to run through all the frames in the video, however, read()

Grabs, decodes and returns the next video frame.

So using grab() to run through the irrelevant frames and retrieve() once every x frames increased the volatile util to be around 40% all the time and the processing is a lot faster. The thread also suggest this together with multithreading, which is what I'm going to try next.

Sneyler commented 4 years ago

Hi, this might look like a silly question but i'm currently trying to find how to get the FPS counter ( i'm new so be kind ) nor the execution time, i can't find them anywhere , i tried to use the Mask RCNN project on my gpu and it worked, can anyone help me, i'v been searching the internet for days but couldn't find anything

matterport / Mask_RCNN

Live Mask RCNN #225