microsoft / Azure-Kinect-Sensor-SDK

A cross platform (Linux and Windows) user mode SDK to read data from your Azure Kinect device.
https://Azure.com/Kinect
MIT License
1.49k stars 620 forks source link

change MJPG color camera output of 4:2:2 subsampling -- instead output 4:2:0 #1299

Open diablodale opened 4 years ago

diablodale commented 4 years ago

Use of 4:2:2 subsampling in the MJPG color stream limits use of hardware acceleration like those from NVidia. Sometimes (M)JPG decompressors only support hardware acceleration in 4:4:4 and 4:2:0. They fallback to slower software for 4:2:2.

Given our eyes don't see the difference, given that AI often only uses the Luma, and given the loved NV12 datastream is already 4:2:0, I recommend the firmware be changed to send only 4:2:0 MPEG datastreams or add a new K4A_IMAGE_FORMAT_COLOR_MJPG420 k4a_image_format_t to support both.

Related to https://github.com/microsoft/Azure-Kinect-Sensor-SDK/issues/1279

Benefits

  1. Natural 4:2:0 support on hardware decompressors.
  2. No need for extra memory and cpu/gpu time to convert from 4:2:2 to 4:2:0 formats. Many tools, e.g. Intel Media SDK, prefer to work with NV12 (a 4:2:0 subsampling layout).
  3. Perhaps less USB3 bandwidth on the colorstream if it benefits the macroblocks and compresses more. Color is already laggy and bandwidth heavy (see other open issues). This an opportunity to improve.

Hardware vendors docs

NVidia Jetson docs -- no 4:2:2 https://docs.nvidia.com/jetson/l4t/index.html#page/Tegra%2520Linux%2520Driver%2520Package%2520Development%2520Guide%2Fsoftware_features_tx2_tx2i.html%23wwpID0EADDUHA

NVidia Video Codec SDK "** 4:2:2 is not natively supported on HW" https://developer.nvidia.com/nvidia-video-codec-sdk

NVidia Forum - NVDec doesn't support 4:2:2 https://forums.developer.nvidia.com/t/does-nvencode-support-42-chroma-subsampling/55565

I have not yet found clarity on AMD's UVD and Intel QuickSync. AMD doesn't list support (or not) of 4:2:2. Intel lists support of 4:2:2 but doesn't distinguish if their support is native hardware or fallback to software.

Caveat

It might be that the MJPG camera colorstream is non-standard JPG. Meaning it might be a restricted fixed JPEG with omitted Huffman table, YCbCr, 4:2:2, basic Huffman encoding, etc. https://www.loc.gov/preservation/digital/formats/fdd/fdd000063.shtml

If this non-standard datastream is what the color camera is outputting, then the scenario is a bit more complex to consider and discuss. That non-standard JPG datastream defacto type can only use 4:2:2. It is unknown the repercussions to the UVC support if the datastream changes to 4:2:0 https://en.wikipedia.org/wiki/Motion_JPEG#Disadvantages

It is unknown what Kinect Azure developers might already be running fixed 4:2:2 pipelines. As comparison, Windows broke webcams in 2016 when they changed MJPEG support https://social.msdn.microsoft.com/Forums/windowsdesktop/en-US/9d6a8704-764f-46df-a41c-8e9d84f7f0f3/mjpg-encoded-media-type-is-not-available-for-usbuvc-webcameras-after-windows-10-version-1607-os?forum=mediafoundationdevelopment

UnaNancyOwen commented 4 years ago

This is an important idea to use hardware-accelerated decoding (related #1293). I think this is a suggestion worth considering and most important. Please consider in Azure Kinect team. Best Regards,

piquan commented 4 years ago

I'm not involved in this project; I just stumbled on the thread while doing something else. But I may have relevant information.

To answer your caveat first: the USB Video Class standard specifies that MJPEG uses 4:2:2. See https://www.usb.org/sites/default/files/USB_Video_Class_1_5.zip , document "USB_Video_Payload_MJPEG_1.5.pdf", page 9: "The following structure is required for the image data: […] Subsampling - 422". (To answer your other question, it does require a baseline sequential DCT, but does optionally allow a Huffman table to be included-.)

As for accelerated decoding:

The NVDECODE library (formerly NVCUVID) only outputs 4:2:0 surfaces for JPEG inputs, as noted in your first link, but it can read JPEG encoded in monochrome, 4:2:0, 4:2:2, or 4:4:4.

NVDECODE works in two modes: CUVID and CUDA. I'm not sure, but I gather that CUVID uses GPU features that are specifically designed for temporally coded streams like MPEG (and the others listed in your second link). CUDA uses the broader parts of the GPU: the same hardware used by OpenGL, Folding@Home, etc. The NVDECODE library will use whichever mode is needed. For formats listed in your second link, it will use CUVID hardware, but for JPEG it uses CUDA only. As your first link notes, it will still only output 4:2:0 (specifically NV12).

I've been decoding 4:2:2 MJPEG (from a non-Kinect webcam) using my Nvidia card's acceleration for a couple of weeks.

I haven't yet tested on my Jetson, but here's what my RTX 2060 supports through Nvidia's Video Codec SDK v10.0.26 (codec, input subsampling, min - max resolution, max size, output texture formats):

MPEG1    420   8 bpp:  48x 16 -  4080x 4080    15MP NV12
MPEG2    420   8 bpp:  48x 16 -  4080x 4080    15MP NV12
MPEG4    420   8 bpp:  48x 16 -  2032x 2032     2MP NV12
VC1      420   8 bpp:  48x 16 -  2032x 2032     2MP NV12
H264     420   8 bpp:  48x 16 -  4096x 4096    16MP NV12
JPEG     Mono  8 bpp:  64x 64 - 32768x16384 16384MP NV12
JPEG     420   8 bpp:  64x 64 - 32768x16384 16384MP NV12
JPEG     422   8 bpp:  64x 64 - 32768x16384 16384MP NV12
JPEG     444   8 bpp:  64x 64 - 32768x16384 16384MP NV12
HEVC     420   8 bpp: 144x144 -  8192x 8192    64MP NV12
HEVC     420  10 bpp: 144x144 -  8192x 8192    64MP NV12 P016
HEVC     420  12 bpp: 144x144 -  8192x 8192    64MP NV12 P016
HEVC     444   8 bpp: 144x144 -  8192x 8192    64MP YUV444
HEVC     444  10 bpp: 144x144 -  8192x 8192    64MP YUV444_16Bit
HEVC     444  12 bpp: 144x144 -  8192x 8192    64MP YUV444_16Bit
VP8      420   8 bpp:  48x 16 -  4096x 4096    16MP NV12
VP9      420   8 bpp: 128x128 -  8192x 8192    64MP NV12
VP9      420  10 bpp: 128x128 -  8192x 8192    64MP NV12 P016
VP9      420  12 bpp: 128x128 -  8192x 8192    64MP NV12 P016

You'll note that JPEG can work with much larger pictures than the MPEG-like formats; I suspect that's because it's using the entire video card memory through CUDA instead of being limited to the memory that CUVID can access.

Disclaimer: I work for Nvidia, but not in the GPU department. My notes here are just based on what I've found while working on a personal project. I speak for myself, not for Nvidia. Take what I say with the same salt as you would any other hobbyist's response.

diablodale commented 4 years ago

thanks for sharing @piquan, I follow along with what you wrote. Reading yours and referring again to docs like https://docs.nvidia.com/video-technologies/video-codec-sdk/nvdec-video-decoder-api-prog-guide/index.html and https://docs.nvidia.com/cuda/nvjpeg/ lead me to think that...please do correct/learnt me 😉...

  1. Every GPU is different. For example, NVidia hardware (both CUDA cores and NVDEC hardware engine) will have varying support for decoding: unsupported, software-only, software+gpu hardware, etc. For example, the doc writes "Starting with CUDA 11.0, hardware accelerated JPEG decode is available on GA100."
  2. It is suspect the full decoding process is GPU hardware for Kinect scenarios. My thinking is based on general knowledge of decoding MJPEG + the specific NVJpeg doc writing "host and device phases of the decode process" and elsewhere in that doc describing how this host phase is the huffman decoding. Huffman is single-threaded due to the Huffman algorithm itself.
  3. NVJpeg seems to be using CUDA with mentions like "...of the CUDA toolkit that was used to build nvJPEG library" and "If NULL is provided, then the default CUDA runtime cudaMalloc() and cudaFree() functions will be used.". I don't see reference to NVDEC in that NVJpeg doc. This aligns with what you wrote "but for JPEG it uses CUDA only".
  4. The NVJpeg library can run it all on a single isolated GPU thread...but that single thread is alone decoding one whole image. And since a single isolated GPU thread is relatively slow...it requires 100+ MJPEG images to be batched together before the slow+overhead of the isolated GPU thread approach finally surpasses the speed of doing it on the CPU. That's why the usual/default approach is decode huffman on the CPU and then potentially pass the rest of the work to the GPU if the specific hardware supports acceleration of that second phase...otherwise continue on the CPU.
  5. For Kinect, I don't want to batch/stream 100 frames...that means my frames are probably delayed (given 30fps). A specific frame's decode time (latency) is likely somewhere between 1/30 sec < decodetime < 100/30 sec
  6. Most libraries (NVDECODE, NVJpeg, Intel Media SDK, etc.) hide details for easier usage. They instead declare "I can read xyz format and output abc format". Then internally juggle what runs where given hardware specifics and hints like the params to nvjpegCreate().

Hardware acceleration is not consistent both across multiple vendors and within a single vendor's hardware. I encourage use of frame formats which can be implemented with GPU hardware acceleration and/or CPU vectorization/acceleration. I often see 4:2:2 lacking acceleration while 4:2:0 is widely accelerated. The rub...is this defacto 4:2:2 MJPEG UVC thing. After what you wrote, I lean towards keeping 4:2:2 just for the reason "to not break things". Which I dislike as that often means legacy baggage, "Can't switch to 4:2:0 to gain acceleration because that will break the 4:2:2 users...therefore no acceleration". 🤦‍♀️ If the Kinect Azure's firmware can accommodate it, I now prefer an additional datastream K4A_IMAGE_FORMAT_COLOR_MJPG420.

This month, I will add NVJpeg to a project and I'll share relevant info that surfaces.

piquan commented 3 years ago

That’s interesting! I agree with your conclusions; this seems to be gated on the idea that the UVC standards require 4:2:2, and (not knowing anything about this project) the firmware should conform to the standards without explicit instructions to the contrary. As you point out, some use cases may benefit from 4:2:0, but presumably that would need special dispensation. That said, I don't think that a real-time datastream is likely to benefit from GPU acceleration: the decode is so much faster than your camera that the inefficiencies of blocking on an empty pipeline eat up any benefits.

As to your premises, let me start by thanking you for the research! I hadn’t been aware of the nvJPEG library before; I’d only used the NVDECODE library previously.

As for your particular points, it looks like you’re pretty reasonably informed now. I’ll see if I can fill in some gaps, but this is just to help with your own curiosity; I don’t think it has a significant bearing on the disposition of this issue.

The most significant thing to remember is the difference between latency and throughput. A lot of people use hardware decode to train neural nets, perform offline transcoding (like Netflix archiving a DVD), perform facial recognition on an incoming stream of thousands of users’ posts, or other tasks that benefit from high throughput but don’t care much about latency of a few hundred milliseconds. When a task (JPEG decoding) consists of many subtasks, and those tasks may have different scheduling requirements such as serialized (Huffman decoding), large-block parallelism (iDCT decoding), and small-block parallelism (YUV→RGB transformation), then pipeline management becomes very important. That’s when the distinction between latency and throughput comes into sharp focus. For an AI trainer’s point of view, being able to pipeline 500 frames can have real benefits. But if your source (Kinect camera) is already rate-limited to 30 fps, then that pipeline doesn’t mean much; you can easily shove one image in the pipe, wait for it to come out 2ms later, and not worry about keeping the pipeline full.

For the full hardware acceleration treatment, you’d need silicon on the chip that’s dedicated to a particular format. That can take up a decent chunk of the chip. You can see some of the evolution of that at https://en.wikipedia.org/wiki/Nvidia_PureVideo, although (as with any encyclopedia) expect there to be some gaps in there. Note that the NVJPG cores in GA100 aren’t part of PureVideo; see also https://en.wikipedia.org/wiki/Ampere_(microarchitecture) for that.

So when somebody talks about whether or not JPEG decoding is hardware accelerated, there’s an ambiguity. Do you mean that every stage is hardware accelerated, or some stages are hardware accelerated? That’s left pretty ambiguous in the docs you’ve cited, and I don’t have anything at the moment to bolster them. I will say that architectures prior to GA100 did have some degree of acceleration support for JPEG, but I infer that was all through CUDA. It may have been doing the Huffman stage on the CPU, the YUV→RGB stage on the GPU, and I have no idea where the iDCT would have been done.

How does this go when it gets down to the brass tacks? Well, I did some timing. On a Ryzen 7 2700X, with libjpeg-turbo (a common CPU-based JPEG library), I got about 1ms latency on a CPU-only decode if I threw one CPU at it. Using nvJPEG on my RTX 2070, I had about 2ms latency, and that would take a full CPU and 8% of the GPU. My timing code used really simple calls into these libraries. I didn’t try to do any sort of parallelism; every operation was run to completion (and downloaded to the CPU) before a new one started. The sample code for nvJPEG can run the same inputs at about 1ms latency, using 90% of the CPU and 10% of the GPU.

Based on these measurements, I think it’s pretty clear that the nvJPEG library, even on hardware prior to GA100, does partially use the GPU. On hardware prior to GA100, it seems to still use the CPU for much of the work; I assume that’s changed after GA100, but didn’t test that.

I suspect that, if you’re not pipelining JPEG decodes with GPU-intensive tasks, then there’s no real point in using GPU-accelerated JPEG decoding. At least, not before GA100. Even then, I strongly suspect that with a 30 fps input, your input pipeline stalls will end up making any GPU gains evaporate; you spend so much time blocking for a mostly-empty pipeline to complete that you just don’t get anything out of it.

One last note, to clarify something I said earlier: I talked about CUDA as the general-purpose GPU hardware, to contrast it with PureVideo (aka CUVID). The library to use it is also named CUDA, and both the NVDECODE and nvJPEG libraries do expect you to use the CUDA library for things like synchronization, memory management, and such. I think that may have thrown you off in point 3.

diablodale commented 3 years ago

Thank you much - generous writing. There is good learning in this for strategy, and some meta which surfaced a feature I can see valuable for my customers 🙌👍

tim-depthkit commented 3 years ago

I wasn't aware that there is JPEG support built in to NVIDIA's hardware either; that's very exciting, because it means there is a possibility to leverage hardware accelerated MFTs to do the work.

I wrote a quick test to determine what hardware MFTs are available on my system that support MJPEG decoding:

NVIDIA MJPEG Video Decoder MFT
Intel® Hardware M-JPEG Decoder MFT

Since these decoders are exposed to MediaFoundation, it should be possible to configure the IMFSourceReader used by the Azure Kinect SDK to use them.

This can be done by setting the MF_READWRITE_ENABLE_HARDWARE_TRANSFORMS attribute to true on the IMFSourceReader, and additionally creating a D3D device manager and setting it as the MF_SOURCE_READER_D3D_MANAGER attribute.

For the reasons outlined above, these decoders my not provide a fully hardware accelerated decode. However, the main benefit is that it opens the possibility of getting D3D textures directly from the decoded IMFMediaSample, meaning that if the goal is to get these frames to the GPU for display or accelerated video encoding, they'll already available there.

This would be about as hardware agnostic as we could get since it is up to (GPU) device vendors to provide support for hardware decoding via this interface, so in theory if other vendors support JPEG acceleration and expose it via Media Foundation in the same way, it would Just Work. It's possible that the newer Ampere GPUs with support for NVJPEG may expose this as a hardware MFT as well.

diablodale commented 3 years ago

?! The SDK doesn't today use windows hardware acceleration?! I never reviewed that part of the codebase to check. 🤦‍♀️ I agree there is great benefit to keeping output on the GPU. (Personally, I want it in OpenCL or opencv UMat's).

In my app (dp.kinect3), I have used the vendor SDKs from Intel and Nvidia to decode JPGs and pull the results back to CPU memory. Even with (likely) copy back to CPU mem, there is benefit to using the hardware acceleration -- particularly with big frames like 4K. The intel quicksync decoder is impressive -- even on a 7 year old laptop. The NVidia is slightly less performant as it does the work with compute code...only the RTX30xxx series has a jpeg decoder in hardware. Intel had it in hw years ago.

Does Microsoft -or- someone else have an SDK update to use hw MFTs?

piquan commented 3 years ago

The specifics vary from one use case to the next, but for many Kinect-related purposes, I’m not surprised that hardware accelerated JPEG decoding isn’t the default.

In general, most low-rate (30 FPS) sources won’t benefit much from hardware JPEG acceleration. It’s great if you’ve got a bunch of images that you’re using to train a neural net so you can them in a fast stream, or if you’re going to use the output directly on the GPU as a texture or something. But otherwise, you end up spending a lot of time transferring data back and forth over PCIe, in the setup overhead to activate a new GPU program, etc, etc.

Last I checked, it looked like the Kinect SDK did most of its work on the CPU. With hardware JPEG decoding, you end up doing the first stage (which is inherently serial within a single image) on the CPU, then transferring that to the GPU, waiting for it to decode, transferring the RGB data back, and then manipulating it on the CPU (such as matching it to the depth map data to get a pixel cloud or something).

If your pipeline were slow enough that you were spending more than 1/30s per frame, and structured your program so it would start processing one frame before the previous frame had finished, then you might see a benefit by offloading some of the CPU work to the GPU. Similarly, if you were doing something that used the decoded JPEG on the GPU, then you might have a benefit, but all the stuff like transforming the image space is currently on the CPU anyway.

That may have changed with Ampere GPUs, which can do more of the JPEG work (specifically the first stage) in specialized hardware. I haven’t done any work with them.

But I don’t think that using the GPU acceleration is an automatic win.

nb: This is my own opinion, not that of my employer. It’s also based only on my own personal programs; other use cases may vary.

diablodale commented 3 years ago

Happy new year @piquan. I just last week finished my nvJPEG work. In general, I had a good experience. Using task mgr (not the best...but....) I can see copy happening. My old machine (kepler) doing more copy work than my new (turing). There was an architectural change in how the Nvidia GPUs manage memory copies between these models "Page Migration Engine". Perhaps this is involved.

When I use the Intel quicksync engine, I see no copies on newer machines. This could easily be a fault in perf counters. Or, it could be because the newest Intel GPUs only work in shared memory with the CPU. So when they write the decoded frame...that frame is written directly to (pinned and shared) CPU memory. At worst, its a cpu mem to cpu mem copy.

Agreed, use case is important to consider. My DLL code provides data to a master EXE that I don't control and polls me at a fixed rate single-threaded. I have to balance the latency (providing old frames) with the processing time between being polled by that master EXE. I can't know the time the master EXE needs and its compute time changes based on their customer's desires. So when that poll to me comes, I need to serve my data asap so the remaining time between the 30fps is available for the master EXE's computation. Yet...also not provide frames to it that are substantially old because no one likes laggy frames. And I can only pipeline so much since the Kinect is 30fps. haha. 🤹‍♀️

I've seen in my testing that given a computer can easily have three compute devices (CPU, Intel integrated GPU, Nvidia discrete GPU)...that the specific "best" option is not consistent. The size of the frame, the output format, and the specific compute devices involved make material differences. For example, small Kinect frames output in something YUV are fast on the CPU because few bytes and no colorspace transforms. But if I want 4K ARGB frames, then my use case sees benefits with GPUs even with their potential copies.

[This is more for Microsoft readers] I can imagine a flag or something in the Kinect SDK that allows us to choose SW or the specific HW. Like in my test machines, I have Intel and Nvidia GPUs on the same computer. Usually, the Nvidia is more powerful. However, with Intel's Quicksync and its shared memory, the Intel does the decoding job in less end-to-end time if my destination is CPU memory. So choosing the "first" or "most powerful" in a GPU list is not reliable.

tim-depthkit commented 3 years ago

Does Microsoft -or- someone else have an SDK update to use hw MFTs?

@diablodale It should be pretty simple to hack in support for HW MFTs, but would likely require a bit of work to design the right API to manage enabling it, and the optional ability to get GPU textures instead of CPU buffer frames.

I can imagine a flag or something in the Kinect SDK that allows us to choose SW or the specific HW.

That would be awesome, unfortunately I haven't been able to find a way to do this using an IMFSourceReader at all. The interface is a bit of a black box beyond telling it that you want to use a HW MFT, it builds its own pipeline and chooses the decoder MFT internally, and seems to choose the first MFT in the list.

I would also love to hear from Microsoft on if HW MFTs were considered or not in the initial API design phase, and what their thoughts are now on integrating something like this.

diablodale commented 3 years ago

Wooo doggy we have delightfully strayed from the OP. I read a lot about "registering an MFT locally" which gives some merit/priority to the topology auto-selection to that locally registered mft. API docs have some info, and here's a thread where someone is forcing the intel quicksync mft https://community.intel.com/t5/Media-Intel-oneAPI-Video/How-to-use-Media-Foundation-plugins-provided-by-IntelMediaSDK/td-p/928082