Efficiency of segmentation process using GSCNN

jdonovanCS commented 3 years ago

I'm not sure I'd classify this as an issue so much as a question, but I was wondering what speed you were able to achieve for the segmentation. Was this something that you tried running in real-time?

Either way, I think the dataset may be useful for me in the near future, but if the speed you achieved with GSCNN was also fairly decent, then that may be of use too.

maskjp commented 3 years ago

Hi, @jdonovanCS,

Thank you for your interest in RELLIS-3D. We didn't have a comprehensive test of the inference speed. But we're also interested in the inference speed. So I did a very rough test on a small GPU (NVIDIA 1050 ti). I have some rough test results. I hope it can help.

HRNet(input size: 256x256)
used time mean: 0.04635466100496539 sec
used time std: 0.003864203854806664 sec
HRNet(input size: 512x512)
used time mean: 0.052707083772814446 sec
used time std: 0.011851471431177723 sec
HRNet(input size: 256x512)
used time mean: 0.04873286266075937 sec
used time std: 0.03829245378510723 sec

GSCNN(input size: 512x512)
mean: 0.5907404807766089 sec
std: 0.32865302767345766 sec
GSCNN(256x256)
mean: 0.16192860780150126 sec
std: 0.11323345831572867 sec

These results use python to make the inference and NVIDIA 1050 ti GPU. You can try to run our code to do a similar test. The GSCNN uses a larger backbone and outputs full-size semantic labels that might because why it's slower than HRNet. Meanwhile, HRNet only outputs 1/4 size labels if my memory is correct. The inference speed can be improved if you use c++ or torchscript.

I hope this helps you!

jdonovanCS commented 3 years ago

That definitely helps! Thank you!

jdonovanCS commented 3 years ago

I had one other question. It looks like the output for both HRNet and GSCNN are semantic segmentation predictions. Your video looks like a full scene segmentation for every pixel in the image. Am I misinterpreting something, or are you doing something to obtain that full-scene segmentation for each pixel?

maskjp commented 3 years ago

I had one other question. It looks like the output for both HRNet and GSCNN are semantic segmentation predictions. Your video looks like a full scene segmentation for every pixel in the image. Am I misinterpreting something, or are you doing something to obtain that full-scene segmentation for each pixel?

Hi, @jdonovanCS

The two models predict semantic labels for input images. GSCNN gives the full-size corresponding semantic labels, and HRNET gives 1/4 size semantic labels. When we create the videos, we upsample the semantic labels to the same size as the images.

jdonovanCS commented 3 years ago

Awesome! Thanks!

unmannedlab / RELLIS-3D

Efficiency of segmentation process using GSCNN #9