Open UnaNancyOwen opened 4 years ago
The challenge is the Huffman decoding. Before anything can be done in parallel or on the GPU, the MJPEG must first be Huffmann decoded. This process is serial and single-threaded...the math can't be changed.
As an example, the Nvidia APIs do this first step on the CPU and later steps on the GPU. Unfortunately, since the Huffmann decoding is a serial math process, it will always be computationally expensive and high latency. There is no opportunity with a single JPEG frame for multithreading/gpu. :-(
However, there is a rare feature of JPEG called "restart markers". Originally, it was designed for error correction. To allow a restart of the image processing so if an error occurred, then it would be isolated to a small section and the other parts of the images would be ok.
These markers can be used in another way. They can be used to separate the compressed data into chunks. Then each chunk can be decompressed/processed by separate threads or a GPU compute unit.
If the Kinect Azure camera insert many restart markers into their JPEG frames, then existing technology/decoders on the CPU or GPU can be used to decompress these in parallel.
Caution: Azure Kinect camera creates 4:2:2 JPEG frames. 😕 Some GPU decoders (like nvidia's nvenc) do not support that subsampling. They only support 4:2:0 JPEG frames. However, some others decoders (like nvidia's nvJPEG) do support 4:2:2.
Caution: libjpeg-turbo (often part of tools like OpenCV) has a performance decrease with restart markers. Their implementation is always on the CPU and is not multithreaded. So they are optimized for single-threaded use. In their docs they write
The optimized Huffman decoder in libjpeg-turbo does not handle restart markers in a way that makes the rest of the libjpeg infrastructure happy, so it is necessary to use the slow Huffman decoder when decompressing a JPEG image that has restart markers.
Does the Kinect Azure already insert many restart markers into the JPEG frames? I would check myself, but I have yet to find a tool that will allow me to inspect for them.
Putting restart markers in the JPEG data stream is the first step.
After that, I'm not so excited about the SDK adding such an decompress api in isolation. Instead, it is more interesting if the decompressed image is in GPU memory. Then to be able to manipulate that images using tools like OpenCV's cv::UMat or cv::gpu. And after all manipulation, pulling it down to the CPU.
This functionality has to happen for the Azure Kinect to be viable in the IOT space, many applications require higher resolution to be viable(and therefore an even greater CPU bottle neck) and GPU/Cuda acceleration is already available in environments such as the Nvidia Jetson family(which has nvJPEG available for the AK's 4:2:2). Is there any progress on this?
Would there be a current implementation without added restart markers that could at least take some of the strain off of the CPU? Currently I'm using an Nvidia Jetson Nano I'm getting 3-4 fps BGRA 3072p and a solid 15 fps using MJPG, so even getting to ~9-10 would be useable.
This is a suggestion.
Is your feature request related to a problem? Please describe.
Currentry, The Azure Kinect Sensor SDK uses TurboJPEG to decode JPEG (Motion-JPEG) retrieved from color camera. TurboJPEG is highly optimized and provide high performance on CPU. I like TurboJPEG. :) However, it consumes a lot of CPU resources in return. Also, I feel a bit underpowered to CPU (that spread through the market at now) for decode very beautiful and high-resolution image of Azure Kinect. I feel the same thing, even when decoding footage from multiple color cameras because Azure Kinect supports multiple connections on a PC.
Describe the solution you'd like
I think CPU resources available to the user application are limited when decoding JPEG on CPU. So, I suggest to add GPU acceleration option for JPEG decoding to Azure Kinect Sensor SDK. (For example, there is the NVIDIA Video Codec SDK that runs on NVIDIA GPUs.)
NOTE: This is optional and does not impose any hardware constraints on users who want to run on the CPU alone.
What do you think? Please consider this. Thanks,