microsoft / Azure-Kinect-Sensor-SDK

A cross platform (Linux and Windows) user mode SDK to read data from your Azure Kinect device.
https://Azure.com/Kinect
MIT License
1.49k stars 619 forks source link

Corrupted JPEG Stream #1354

Open jmachowinski opened 4 years ago

jmachowinski commented 4 years ago

Describe the bug We are experiencing a rather strange bug with the color stream of the kinect. On an AMD EPYC 7302 machine, the color stream is corrupted. This is reproducible using the k4aviewer and the ros wrapper.

I patched in a better debug message in case the MJPEG decoding fails and it gave me this: DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Not a JPEG file: starts with 0xfe 0x77 DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Not a JPEG file: starts with 0xf8 0x54 DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Corrupt JPEG data: premature end of data segment DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Corrupt JPEG data: 1283 extraneous bytes before marker 0xd3 DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Corrupt JPEG data: 161 extraneous bytes before marker 0xd4 DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Not a JPEG file: starts with 0xfb 0xbf DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Corrupt JPEG data: 9005 extraneous bytes before marker 0xd6 DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Corrupt JPEG data: found marker 0xd5 instead of RST3 DecodeMJPEGtoBGRA32(). MJPEG decode failed, dropping image: Corrupt JPEG data: 841 extraneous bytes before marker 0xd6

This then leads to the error : replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 57300211 type:Depth as there is no color image available for the depth measurement.

Note usbfs_memory_mb is set to 1024

To Reproduce On an AMD EPYC 7302 :

  1. Start k4aviewer
  2. Enable color camera
  3. Disable all other streams
  4. Choose any resolution / fps (720p / 5 FPS also fails)
  5. Start
  6. Log shows a bunch of MJPEG decode failed, dropping image messages
  7. After a few seconds a popup 'Camera failure : timed out!' appears

Desktop

Additional context Note, on our normal desktop machines everything is working fine.

We also tried using an external PCI-E USB Host Controller (DeLock 90492) but this did not work either. The controller also gave us a bunch of wired kernel warnings, so it might be a driver issue with this card.

Is there any PCI-E USB Host Controller with at least two ports, that is known to work under linux ?

Grabbing only the depth stream seems to work. Out of curiosity, is there some sort of CRC on the depth data, to detect corruption ?

jmachowinski commented 3 years ago

We debugged this issue further, it seems, the Azure is extremely sensitive to the used USB host controller. So far we tested : Asmedia 3142

Renesas/NEC - µPD720202

Renesas/NEC - µPD720201

Fresco Logic FL1100

RoseFlunder commented 3 years ago

Yes it definetly is. Also the choice of USB cabel matters. I hab cables labeled as USB 3.0 that did fail after some minutes. The cable that comes with the kinect itself works the best but it is a bit short.

Have already seen this page? https://docs.microsoft.com/en-us/azure/kinect-dk/troubleshooting

"For the Azure Kinect DK on Windows, Intel, Texas Instruments (TI), and Renesas are the only host controllers that are supported. The Azure Kinect SDK on Windows platforms relies on a unified container ID, and it must span USB 2.0 and 3.0 devices so that the SDK can find the depth, color, and audio devices that are physically located on the same device. On Linux, more host controllers may be supported as that platform relies less on the container ID and more on device serial numbers."

qm13 commented 3 years ago

Yes, only select USB controllers are supported per text quoted by @RoseFlunder.

Also make sure you are using Sensor SDK 1.4.1 as we made some changes to improve resilience to dropped color frames. See issue #1194.

jmachowinski commented 3 years ago

I was aware of the troubleshooting page. But lets just say the information regarding linux is sparse...

We are using the latest firmware and SDK 1.4.1

As for our problem, we are currently testing the Fresco Logic FL1100 with only one camera per host controller. This ran stable for 3 hours with one camera, yesterday. We are trying multiple cards with a camera each now.

One thing I noticed while testing, is that only the color stream is unstable. The depth and imu stream seems to be stable all the time.

Also I suspect a bug in the ros node, as it does not detect a broken color stream.

RoseFlunder commented 3 years ago

I haven't measured the time at our PCs running the kinects but I think we kept them running for 24h without an error on Ubuntu, with default cable and connected to USB Ports on the Mainboard with Intel Chipset. Not the ROS Node though but we don't do anything different regarding getting captures compared to the ROS node. Just calling "getCapture" using infinite timeout. But we also use only one Kinect per PC which send their data to a central PC for processing.

qm13 commented 3 years ago

@jmachowinski I have passed you ROS node issue to the Microsoft ROS team.

ooeygui commented 3 years ago

@jmachowinski When you say the ROS node fails after x amount of time, can you post the failure? Does it crash or start throwing errors?

jmachowinski commented 3 years ago

@ooeygui The node does not crash, but it stops publishing any data. The log is full of

[2020-09-18 08:52:14.087] [error] [t=8620] /w/1/s/extern/Azure-Kinect-Sensor-SDK/src/capturesync/capturesync.c (142): replace_sample(). capturesync_drop, releasing capture early due to full queue TS:3874968522 type:Depth [2020-09-18 08:52:14.119] [error] [t=8620] /w/1/s/extern/Azure-Kinect-Sensor-SDK/src/capturesync/capturesync.c (142): replace_sample(). capturesync_drop, releasing capture early due to full queue TS:3875001866 type:Depth

As the color stream broke down completely.

I already looked into the code, and have the guess (only a guess, not verified yet) that if k4a::device::get_capture() is called with timeout infinite the function never returns in case the color stream broke down.

I just looked at the code again, and now I am pretty sure this is the case. With an infinite timeout, we wait for a push to the capture queue. As the color stream broke down, this is never happening and the ros node just stalls. I think a good workaround for this, would be to give a timeout of double the fps time, and shutdown the node if we did not receive X consecutive captures.

jmachowinski commented 3 years ago

As for the long time tests : 2 of 3 cameras stopped sending color images after some hours. I'll patch in some additional debug messages, to determine the exact cause of this.

I also noticed, that this is CPU load dependent. Without subscribing to the topics of the node, the color stream did not drop out (I verified in the code, that the capture from the camera actually happens in this case). This behavior puzzles me, as I run the tests on a 64 core machine, and every thread has basically its own cpu.

ooeygui commented 3 years ago

@jmachowinski Thank you for the investigation! I need to profile the code to see where the slowdown is occurring. We do know that there are some problems with the OpenCV version being used with Melodic, and there are some inefficiencies in the Kinect ROS node. Additionally, since it is using ROS pub/sub without a quality of service monitor, the publishing rate itself may be too fast for the network stack - We may need to throttle upstream of it.

I'll post findings when I am able to carve off time to work on this ROS node.

jmachowinski commented 3 years ago

To answer my own question :

Out of curiosity, is there some sort of CRC on the depth data, to detect corruption ?

The depth data is transmitted in usb bulk mode, so it should be always complete.

I debugged this further and I am pretty sure, we are dealing with a firmware issue here. My findings so far : The color stream is an usb isochronous transfer. Depth and IMU are transmitted as usb bulk transfer.

The isochronous transfer is started in the libuvc part and keeps on going. Using wireshark, one can see, that the usb transfers are send out to the device. At start the color device correctly fills in the payload of the transfer, and everything is working. At some point, the color device just stops doing this. Also note, there is no error transmitted to the usb host. If one adds debug to the function : https://github.com/microsoft/libuvc/blob/5fc483d596c63f1bcd36be35d512468c0b75c5f3/src/stream.c#L601 this behavior is visible.

Any ideas ?

jmachowinski commented 3 years ago

As it might be perhaps useful to someone : We now found a 4x FL1100 PCIe card from Basler (2000036233). This one works flawless under linux if only the depth and imu stream is acquired.

We performed a one day test with 5 cameras, without encountering any problems.

jmachowinski commented 3 years ago

@qm13 can your reproduce my findings ? If not, can I do anything to help to speed this up ? Provide test cases etc...

wes-b commented 3 years ago

Do you have a workaround to this problem using the above USB Host controllers? A workaround meaning K4AViewer runs without stopping the stream? DecodeMJPEGtoBGRA32 errors are to be expected when USB is congested.

Renesas/NEC - µPD720201 Fresco Logic FL1100

I am curious if this issue repro's with firmware 1.6.102075014? Can you try this older firmware? We did make a minor change in UVC that may not have shown up in our testing.

I am also curious how your setup runs with other PC's. As you have determined, we don't have a lot of data on USB Host controllers on Ubuntu. It would be great if you could test with a mother board that has TI/Intel on it.

If that doesn't work then, you will need to dig into LibUvc more unfortunately, as we can't repro this.

The isochronous transfer is started in the libuvc part and keeps on going. Using wireshark, one can see, that the usb transfers are send out to the device.

If you can see that the ISOCH transfer on WireShark, then I wonder if firmware is fine and if the bug is in LibUvc. At the completion of each packet you should be able to see that LibUvc calling libusb_submit_transfer, that has to keep happening for new packets to be received from the driver.

jmachowinski commented 3 years ago

Do you have a workaround to this problem using the above USB Host controllers? A workaround meaning K4AViewer runs without stopping the stream? DecodeMJPEGtoBGRA32 errors are to be expected when USB is congested.

No, using the FL1100 controller made this more stable, but after around 40 minutes the color stream drops out. We use a setup, were each camera has its own USB Host controller, so the load per bus is neglectable (~1.8 mb/sec for depth and around 3.8 mb/sec for color). Therefor the dropouts in the color stream are already strange.

I am curious if this issue repro's with firmware 1.6.102075014? Can you try this older firmware? We did make a minor change in UVC that may not have shown up in our testing.

We'll perform tests on monday, and will report back.

I am also curious how your setup runs with other PC's. As you have determined, we don't have a lot of data on USB Host controllers on Ubuntu. It would be great if you could test with a mother board that has TI/Intel on it.

We performed tests with a 10th gen intel usb controller (Q470), and it ran stable for more than a day. We also have a production system, using 7th gen intel usb controller (Q170) running stable with 2 cameras for weeks now.

As for the TI based controllers, are you referring to the TUSB7340 host controller ? This one seems to be out of production, and we can't buy TI based PCIe cards anywhere.

For our current project we are bound to the AMD platform and sadly there are no Intel based PCIe extension cards, that we are aware of.

If you can see that the ISOCH transfer on WireShark, then I wonder if firmware is fine and if the bug is in LibUvc. At the completion of each packet you should be able to see that LibUvc calling libusb_submit_transfer, that has to keep happening for new packets to be received from the driver.

I added debug code, in case any error shows up, especially in the error path, were the transfer would NOT be resubmitted. I also added debug code that writes a message to cout for every 10000 resubmitted transfers.

The behavior I get from this is, that there are no errors, and I get a continuous stream of messages, that the transfers get resubmitted. At some point in time, I only get transfers back, were the pkt->actual_length is zero. This is consistent with the data shown in WireShark. This lead me to the suspicion, it might be the firmware.

I must also say, I have only limited knowledge of USB internals. If I understood it correct, the bus is completely host controlled. Any endpoint may only submit data, after is was granted a 'timeslot' by the host. This is done by sending an IN packet to the endpoint. My wild guess here would be, that the frequency of the polling is unstable, and that this somehow upsets the device.

wes-b commented 3 years ago

Your results are inline with what we have seen on Windows, where we have seen more of a variety of controllers. The unfortunate fact (perhaps fortunate depending on how you look at it) is that typically using an unsupported host controller fails miserable right away. In this case the controller seems to run really well until the error hits.

jmachowinski commented 3 years ago

I am curious if this issue repro's with firmware 1.6.102075014? Can you try this older firmware? We did make a minor change in UVC that may not have shown up in our testing.

We performed the tests with the older firmware, the behavior is the same as with the newer firmware.

Is there a particular reason, why the color image is transmitted in isochronous mode ? The bulk transfer seems to be stable, and would likely solve this issue.

jmachowinski commented 3 years ago

I started a second test yesterday, just letting a webcam viewer run. Turns out, using the in kernel uvc driver (e.g. opening /dev/video0) seems stable. The viewer has now been running for ~12 hours and the video stream is still working.

jmachowinski commented 3 years ago

Any update on this issue ?

SimoSR commented 3 years ago

We are experimenting a similar behaviour in the following configuration:

In this configuration:

The problem is that the RGB stream is unstable and produces drops in FPS and the image sometimes is corrupted (some part of the RGB image is displayed shifted with respect to the correct position).

Does anyone know if the unstable behaviour of the RGB stream is caused exclusively by USB host controller that is not ufficially supported by AKDK? as stated here https://docs.microsoft.com/it-it/azure/kinect-dk/troubleshooting under host controller issues

@jmachowinski which type of host controller have you used in your tests that has detected the unstability of RGB stream?

jmachowinski commented 3 years ago

We tested almost every USB controller that you can buy as an extension card at the moment. The color stream fails with all of them. But as I said above, I don't believe this is an issue of the USB Controllers, but rather an issue with the Azure.

jmachowinski commented 3 years ago

Happy new year ! Any news on this issue ?

qm13 commented 3 years ago

@jmachowinski we have been investigating this issue using an ASMedia 3142 card and can see a bunch of device level issues in the USB traces. However whilst these issues result in dropped frames they should not result in k4viewer stopping. We would like to get a little more coverage on this and ask you to perform the following test.

Repeat you test using the built in webcam application instead of k4aviewer. Does the webcam application fail after a short time?

jmachowinski commented 3 years ago

@qm13 Are you refering to webcam by Gerd Knorr ? If yes, shall I use a special configuration or just the default one ?

For my last test I used guvcview

jmachowinski commented 3 years ago

I just repeated the test using guvcview, the camera fails after 5-10 minutes.

Kernel : 5.4.0-65-generic RGB camera firmware: 1.6.110 Resolution: 4K

jmachowinski commented 3 years ago

I also repeated the test a second time using guvcview with reduced resolution to 720P. This one is running now for 3 hours without problems.

jmachowinski commented 3 years ago

The 720P test ran over the weekend (~60 Hours in total) and is still fine.

qm13 commented 3 years ago

@jmachowinski this seems to point to low level issues with some USB controllers under high bandwidth utilization. Some apps do better at handling the issues than others. How does the k4aviewer run with 720P?

jmachowinski commented 3 years ago

@qm13 I still don't think the USB controller is the issue here. The bandwidth usage is actually very low, usbtop shows a maximum of 8 MB/sec if the 4K stream is active, plus ~1.7MB/sec for depth and ~1.7 MB/sec for imu and microphone. I can copy data at 500 MB/sec the whole day long from an ssd without issues.

Also note, we are using a Basler USB Card (https://www.baslerweb.com/de/produkte/vision-komponenten/pc-karten/usb-3-0-interface-card-pcie-fresco-fl1100-4hc-x4-4ports/). This one works flawless using linux to capture 12 MP Basler Cameras.

Also, if the USB Controller would have an issue, I would expect all data streams on the bus to stall / break. This is not what we are seeing. Only the color stream breaks down, IMU/Microphone/Depth is fine.

I'll try to perform some tests with the k4aviewer next week.

qm13 commented 3 years ago

@jmachowinski the dev investigating this issue has not been able to repro it. He does not have access to a Linux box and investigated the issue on his Windows box. I am working on getting him a Linux box. In the mean time would you be able to attempt to repro the issue on Windows.

jmachowinski commented 3 years ago

@qm13 Sorry for the delay, has been a busy week... We don't have windows licenses for our preproduction and developer machines, therefore I don't see a way to run a test on windows.

I still want to perform a linux test, and check the timing of the iso transfer packages, but had no time yet, I'll report back when I have results.

jmachowinski commented 3 years ago

StreamBreaking.zip

I finally found some time to look into this again. Above is a USB capture from the point were the stream breaks. Interestingly enough, we can see in the dump, that an Protocol error occurred. I'm pretty sure, libusb never reports an error. I'll add debug to verify this.

Second interesting finding, I installed an FL1100 USB Card into my developer machine (Ryzen 3700X) and the error takes ages to occur. On the EPYC machine, it happens almost instant.

jmachowinski commented 3 years ago

I added some debug in the libuvc and libusb: I see some errors in packet level, and no ones on the transfer level. Judging from the code in libuvc (stream.c:633), errors on packet level seem to be nominal.

To debug the cause of the error, I also added debug in libusb. I see 'a lot' (~every second) of EXDEV status codes for some packets. Also while the stream is still working I see infrequently EPROTO.

For the last 4 Tests I ran, the stream always broke directly after an EPROTO return for a single packet.

Description of the error codes: https://www.kernel.org/doc/html/latest/driver-api/usb/error-codes.html

kvuong2711 commented 3 years ago

Hello! Is there any progress made on this issue? I have the same issue with the Azure Kinect on Jetson Xavier NX while running k4arecorder to record rgb, depth, and imu, where the color stream would break silently and the recorded mkv file got corrupted.

crystal-butler commented 3 years ago

I am running the Sensor SDK on a Jetson Xavier NX like @kvuong2711 and see what might be a similar problem to what is being discussed in this issue thread. My Azure Kinect has had a firmware update, and I'm building from source so I can customize the Sensor SDK. I generally capture data with k4arecorder with IMU off and default parameters otherwise (or sometimes at a lower --rate of 15 fps).

I've installed the SDK around 7 times on Xaviers and have noticed a loose pattern of failure related to the RGB stream. Usually k4aviewer will display live and recorded mkv files just fine when running the Microsoft-supplied k4a libraries. However, when I build the SDK I almost always begin having trouble with RGB playback and live view in k4aviewer. I've tested the recording files on Linux laptop and desktop machines with the same SDK customizations and they give the same error: ../src/record/internal/matroska_read.cpp (1754): convert_block_to_image(). Failed to decompress jpeg image to BGRA format.

I made some changes to the code so that k4aviewer continues to play when the error is thrown, in k4aconvertingimagesource.h and include/k4arecord/playback.hpp. But if the recording is made/played back using the same customized k4arecorder and k4aviewer code built for Linux and with the same Kinect attached to the laptop or desktop, this problem doesn't occur. I've also had the problem when building the SDK without modifications.

On the Xavier I'm working with at the moment, the issue got better when I unplugged a USB drive that was attached to it; the problem improved further when I moved the Kinect's USB plug to a different port and went away entirely when I moved it again. I'm not sure whether the choice of port or number of other USB ports in use actually matters, or if replugging the device is what made the difference.

kvuong2711 commented 3 years ago

I am running the Sensor SDK on a Jetson Xavier NX like @kvuong2711 and see what might be a similar problem to what is being discussed in this issue thread. My Azure Kinect has had a firmware update, and I'm building from source so I can customize the Sensor SDK. I generally capture data with k4arecorder with IMU off and default parameters otherwise (or sometimes at a lower --rate of 15 fps).

I've installed the SDK around 7 times on Xaviers and have noticed a loose pattern of failure related to the RGB stream. Usually k4aviewer will display live and recorded mkv files just fine when running the Microsoft-supplied k4a libraries. However, when I build the SDK I almost always begin having trouble with RGB playback and live view in k4aviewer. I've tested the recording files on Linux laptop and desktop machines with the same SDK customizations and they give the same error: ../src/record/internal/matroska_read.cpp (1754): convert_block_to_image(). Failed to decompress jpeg image to BGRA format.

I made some changes to the code so that k4aviewer continues to play when the error is thrown, in k4aconvertingimagesource.h and include/k4arecord/playback.hpp. But if the recording is made/played back using the same customized k4arecorder and k4aviewer code built for Linux and with the same Kinect attached to the laptop or desktop, this problem doesn't occur. I've also had the problem when building the SDK without modifications.

On the Xavier I'm working with at the moment, the issue got better when I unplugged a USB drive that was attached to it; the problem improved further when I moved the Kinect's USB plug to a different port and went away entirely when I moved it again. I'm not sure whether the choice of port or number of other USB ports in use actually matters, or if replugging the device is what made the difference.

@crystal-butler First, I want to confirm that I also built my SDK from source by customizing the k4arecorder, but I never had any issue with corrupted RGB stream on either a good Linux laptop or desktop server, only exclusively on the Jetson. On the Jetson, I do notice the problem of RGB stream failure happening much more frequently (usually right after starting the k4arecorder) if I have more devices plugged in (e.g., mouse + keyboard at the same time). The problem is usually gone and I'm almost always able to start recording for a while if I unplugged everything else and only left the Azure Kinect cable connected, but after a few minutes, the RGB stream randomly writes out garbage data (corrupted color files). I don't know the exact cause of this and I don't know if this helps, but since when I switch to SSD instead of microSD (read/write speed is about 10x faster), at least the problem of randomly corrupted data doesn't occur anymore.

crystal-butler commented 3 years ago

To follow up, I've got SSDs installed in all the Jetsons I work with, but I run the Jetson OS from an SD card and use the SSD for storage and swap. k4arecorder and k4aviewer run from the SD card but read & write to the SSD.

yonghochang77 commented 2 years ago

i use USB3.0 extend 10m cable. every time below error occured. "replace_sample(). capturesync_drop, releasing capture early due to full queue TS: 5033533 type:Depth" when i upgrade firmware, When the program is first run, can displayed very well. when programe died on debugging. firmware already have something. when i start camera, firmware did not initializing like update firmware. i want initializing camera method like updated firmware state. Can someone create a firmware factory initializing function for me?

ulricheck commented 2 years ago

is there any progress on identifying the problems and providing solutions for using Azure Kinect on AMD Epyc Platforms?

We recently purchased a Gigabyte G482-Z51 Server (Epyc 7352). On Linux and Windows, capturing breaks soon after starting the stream with the error "corrupted jpeg stream" (NFOV_UNBINNED, no matter which color resolution, framerate we choose, Linux & Windows, BGRA / MJPEG settings tested). We used Renesas µPD720201 based USB Controllers (1 camera per chip). Even a single camera can't capture for more than a couple of seconds before the error is occurring.

Before we tested the USB 3.0 Cards on a Supermicro SYS-7049A-T (Xeon Silver 4216, Intel C621) and capturing with up to 4 cameras worked perfectly. In all systems we used Nvidia Quadro RTX6000/A6000 GPUs, linux systems are Ubuntu 18.04, windows systems were latest Windows 10. Azure Driver 1.4.1, the last 3 Azure Kinect Camera Firmware releases were also tested (linux)

Is there any chance for a fix which allows using Azure Kinect Devices on AMD Epyc based Hosts?

jmachowinski commented 2 years ago

The only solution that worked for us it to disable the rgb stream altogether. The problem with this is, that you loose the synchronization feature, that is crucial in a multi sensor setup.

I looked a lot into this problem on the software side, but could find obvious problems. https://github.com/libuvc/libuvc/issues/191 Might be related. It seems, that the ISO transfer frame is oversized. But this only puts additional load on the USB bus. This does not explain why its not working on EPYC systems.

As we ultimately dumped the EPYC systems because of this, I don't have much time to look further into it.

Tips for further debugging: I found it extremely helpful to modify the test example from libuvc in the SDK. If one activates full debug level, and adds some additional debug in the stream.c on bad iso transfers this problem is instant visible. As soon as you see 'a lot' of "bad packet (isochronous transfer); status: %d" you are hitting the issue. This is also a good test for bad USB cables...

Out of curiosity, is there a way to get paid support for this sensor ? Last time we tried the links on the website went to dead ends, and the MS Hotline could not help us.

ulricheck commented 2 years ago

Thanks for your feedback .. we'll work with intel platforms in the meantime, however it would be really appreciated if the MS team could provide some information about the compatibility of the Azure Kinect Sensor for various (multi-camera) platforms (CPU/Chipset/USB-Controller) that were testet / approved.

Disabling RGB is not an option as we depend on both streams in our application. Since the issue occurs on both platforms (windows/linux) i would expect the problem to be on the usb communication layer, rather than the actual driver - but i'm not an expert with hardware and usb specifically.

fbabon commented 1 year ago

still issue with RGB using asmedia 3142 :'(