Closed nvnnghia closed 5 years ago
Hi @nvnnghia,
This place is because easy to split and the processing time of GPU and CPU is the same. Of course, any node can be split.
I tried to split Mask R-CNN at 'Gather' node because it is easy to split, but the processing time got worse.
Thanks for your response. I still wonder why the speed is improved a lot. If we just split the model into 2 parts, ideally the speed can be 2 times faster, but in fact, it much faster than 2 times. Can you explain this?
Hi, @nvnnghia,
First, although it is CPU part, it is slow on GPU. Because the execution speed of tf.where depends on Hz, so it is faster to run it on CPU with high Hz. This alone improves to 9 FPS -> 19 FPS on TX2. (ssd_mobilenet_v1_coco_2018_01_28, 640x480 image size without visualization)
Next, I separate the execution of the model into gpu thread and cpu thread. And other python processing (drawing etc.) into main thread. As a result, TX2 improves to 31.2 FPS.
As a result of these tunings, speed improvement is about 3 times.
Thank you very much for the detail explanation.
Why don't you split the model in the middle of the graph? Like, If we have 20 convolution layers, we will break it into 2 parts. What will happen if we do like that?