luxonis / depthai-experiments

Experimental projects we've done with DepthAI.
MIT License
842 stars 370 forks source link

RuntimeError: Communication Exception running gen2-triangulation #210

Open madgrizzle opened 3 years ago

madgrizzle commented 3 years ago

Finally getting back around to trying out the face detection with stereo demo and I find that it runs fine for about 5 seconds but then crashes with the following error:

Traceback (most recent call last): File "main.py", line 207, in frame = queues[i*4].get().getCvFrame() RuntimeError: Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'mono_left' (X_LINK_ERROR)'

It only crashes when someone steps in front of the camera and face detection starts working.

Luxonis-Brandon commented 3 years ago

Thanks for the report and sorry for the trouble. Not immediately sure. But CC: @Erol444 on this.

madgrizzle commented 3 years ago

If it helps, it failed today with a:

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

..and a just had another failure where it was landmarks_left stream. I'm guessing something is crashing in the pipleline and its just erroring out on whatever stream its trying to retrieve.

Erol444 commented 3 years ago

Hello @madgrizzle , I have just pushed a small edit to latest master branch of depthai-experiments. Could you try using that + the latest develop branch of depthai-python? So checkout to develop branch and run python3 examples/install_requirements.py to get the latest develop version of our library. Using that seems to fix the issue. Thanks for reporting!

madgrizzle commented 3 years ago

Awesome, will try tonight! Thanks for looking into it.

madgrizzle commented 3 years ago

It works great! Thanks for the fix!

madgrizzle commented 3 years ago

Well, turns out that it just seems to work 'better', but it still crashes. I'm not entirely sure why I wasn't getting these messages before, but I recently did an update to depthai-python. This occurs right before the crash:

[14442C10411DC2D200] [470.453] [system] [critical] Fatal error. Please report to developers. Log: 'PoolBase' '66' [14442C10411DC2D200] [267.191] [system] [critical] Fatal error. Please report to developers. Log: 'PoolBase' '66' Traceback (most recent call last): File "main.py", line 214, in frame = queues[i*4].get().getCvFrame() RuntimeError: Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'mono_left' (X_LINK_ERROR)'

VanDavv commented 3 years ago

@madgrizzle I'm not immediately sure why this issue occurs, will check with the team too. Could you specify which depthai version you're using?

VanDavv commented 3 years ago

@madgrizzle it appears to be a bug in the depthai itself and @themarpe managed to fix that right away (kudos!). I'll circle back once the experiment is updated with a library version containing the fix, so you'll be able to test again.

Thanks for reporting the issue!

madgrizzle commented 3 years ago

Just curious.. is the "bug in depthai itself" a bug in the FW or a bug in depthai-python (or something else)?

themarpe commented 3 years ago

@madgrizzle it was a bug in FW regarding how messages are shared across the two cores available. More specifically some cache coherency issue.

madgrizzle commented 3 years ago

I was hoping 2.11.0 would fix it for me (based upon what I saw in the commits and announcement), but I get the same error message (X_LINK_ERROR). I've seen @Luxonis-Brandon demo the gen2-triangulation on twitter just recently so I suspect that my problem may be specific to me. I can try to use a different machine to see if maybe its an issue with my specific OAK-D or just the computer I'm using.

themarpe commented 3 years ago

@madgrizzle does it happen immediately or after some time running?

madgrizzle commented 3 years ago

@themarpe, not immediately and the time it takes varies. At the moment, within 30 seconds or so. Couple days ago (I had built your develop branch hoping the fix was in) and it took a few minutes to crash. The OAK-D is powered by the power supply and plugged into a intel NUC i7 via usb-3 cable. I access the NUC via ssh/xserver, but I wouldn't expect that to cause the issue.

madgrizzle commented 3 years ago

@themarpe, I spun up ubuntu 20.04 onto a RPI 4 and got everything installed and running with the same result after a few seconds. It did manage to spit out more information though:

terminate called after throwing an instance of 'dai::XLinkReadError' what(): Couldn't read data from stream: 'mono_left' (X_LINK_ERROR) Stack trace (most recent call last) in thread 35627:

6 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in

5 Object "/lib/aarch64-linux-gnu/libstdc++.so.6", at 0xffff7bda6573, in __cxa_throw

4 Object "/lib/aarch64-linux-gnu/libstdc++.so.6", at 0xffff7bda627f, in std::terminate()

3 Object "/lib/aarch64-linux-gnu/libstdc++.so.6", at 0xffff7bda621b, in

2 Object "/lib/aarch64-linux-gnu/libstdc++.so.6", at 0xffff7bda88cb, in __gnu_cxx::__verbose_terminate_handler()

1 Object "/lib/aarch64-linux-gnu/libc.so.6", at 0xffff8c66fd67, in abort

0 Object "/lib/aarch64-linux-gnu/libc.so.6", at 0xffff8c683138, in gsignal

Aborted (Signal sent by tkill() 35550 1000) Aborted (core dumped)

Also, on another run, I got this in the stream:

[14442C10411DC2D200] [21.932] [system] [critical] Fatal error. Please report to developers. Log: 'class' '374'

If this seems like a device issue, we can close this and I'll look for help on the discord.

themarpe commented 3 years ago

@madgrizzle I'm investigating more. There was some additional issue I saw the last time, but haven't had time yet to debug. Will keep you posted

madgrizzle commented 3 years ago

@themarpe I did some testing and got frustrating results. Not sure any of this is useful, but figured more information is better than none.

I first incrementally enabled various parts of the pipeline (by commenting out the sections I didn't want enabled) and it seemed to run well (>5 minutes) until I enabled the very last part to retrieve the landmarks https://github.com/luxonis/depthai-experiments/blob/b3f72a2a1dfc27f3c55d987482141db6815c0f0c/gen2-triangulation/main.py#L231 Upon enabling that, it started to crash after a few seconds.

So I thought maybe it was just related to that particular model so I adapted the program to use the facial-landmarks-35-adas model and it ran for a really, really long time. This seemed at the time to prove my hypothesis. To make sure, I switched back to the original model to verify it still crashed (which did) and then switched back to the new adas one to verify it didn't crash.. but then it did. That's the frustrating part.

I tried wiping the blob cache and install the newest version of the blob converter and it didn't seem to help.

themarpe commented 3 years ago

@madgrizzle thanks very much for extensive testing.

Current state from my side is that some sort of memory corrupt happens, which seems accelerated when detecting faces and/or unsupported config is printed (odd scenes where something is detected oddly afaik cause this). The longest runs on my end were stationary scenes, while the shortest was dynamic movement with both empty and scenes with faces.

Regarding your observation - if you've removed that line, the actual on device processing should not differ (the queue is non-blocking, the messages are going to continue being produced). Although an issue in parsing could happend, but in that case you'd be left with a host side error.

Regarding the different model, do you think it helped with overall stability, in terms of timing and how soon it crashed?

Any way, I suspect ImageManip issue, but not 100% sure yet. I have a rewrite planned for ImageManip, which might address this alongside hopefully, as its quite elusive and not deterministically reproducible bug (in terms of execution)

madgrizzle commented 3 years ago

@themarpe, you wrote:

"Regarding your observation - if you've removed that line, the actual on device processing should differ (the queue is non-blocking, the messages are going to continue being produced)."

Did you mean that the processing SHOULDN'T differ? I'm having hard time reconciling what you wrote in the sentence with what you wrote in the parenthesis.

Regarding the different model, do you think it helped with overall stability, in terms of timing and how soon it crashed?

I thought it solved it because it wouldn't crash the first time I ran the other model. But the second time and thereafter it crashed as much as the original model. That's the weird part.

Any way, I suspect ImageManip issue, but not 100% sure yet.

I know very little about the internal workings, but I thought it was either the fact that two different 'pipelines' were being run (left camera and right camera) and some issue came up from that or it was ImageManip considering that face detection and landmark recognition in general has been pretty solid on other examples. I used both with gen1 stuff and it seemed solid.

I will try to figure out how to do the cropping of the image host side, thereby eliminating the ImageManip node, and see if it fixes the problem.

themarpe commented 3 years ago

@madgrizzle Yes, thanks for catching the mistake - processing should not differ (edited my post as well).

I will try to figure out how to do the cropping of the image host side, thereby eliminating the ImageManip node, and see if it fixes the problem.

Thanks, that'd be a great data point to have. Let me know how it goes. Also feel free to fork & push your changes there if we'll need to sync further down the line on some common tests, etc...

madgrizzle commented 3 years ago

@themarpe Perhaps this should be an issue 'somewhere else' (I'll ask on discord) but it appears that the ImageManip config errors occur when resize width is less than half the height. I was testing it at lunch and started to slowly move my hand in front of my face and the 'box' (that gets drawn on the screen) started to narrow. When it got around half the height, the errors started to occur. I had to get back to work but I'll do some more testing tonight and see if I can catch those events.

madgrizzle commented 3 years ago

@themarpe I've got things running much, much better (still crashes though) by not processing any face detections where the aspect ratio is less than 70% (width to height or height to width). That seemed to have eliminated most of the errors ImageManipConfig errors that occur. It runs for several minutes now and is much more stable. So I tend to agree ImageManip is the likely culprit, but the fact it happens while the image is dynamic (I find the same to be the case) is odd. When it crashed as I was moving forward to get closer to the keyboard, I got this message:

2 Object "[", at 0, in nil

1 Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x7ff2f143120f, in

0 Object "/home/john/depthvenv/lib/python3.8/site-packages/depthai.cpython-38-x86_64-linux-gnu.so", at 0x7ff2cf60b0b5, in backward::SignalHandling::sig_handler(int, siginfo_t, void)

Segmentation fault (Address not mapped to object [(nil)]) Segmentation fault (core dumped)

I understand ImageManip is being rewritten so I'll hold off on any more investigating/testing for the time being.

/em fingers_crossed

SalsabilDaraghmi commented 3 years ago

I solved this problem by modifying this line of code with dai.Device(p.getOpenVINOVersion()) as device: to this with dai.Device(p.getOpenVINOVersion() ,usb2Mode = True) as device: I just add usb2Mode =True The program is working fine now without stopping You can understand the problem that occurs from this link: https://github.com/luxonis/depthai-python/issues/318#issuecomment-884219556

madgrizzle commented 3 years ago

The version on the repo still crashes for me if I make the change you described. I then tried a version that I made that has a few extra tweaks to the ImageManip script to help reduce invalid configs that seemed to help it run better and it crashed as well. So, I then removed all the video frame grabbing to lower the bandwidth and it still crashed under USB-3. But when I forced it to usb2Mode, it ran stable... but after stopping it and do some other testing, it crashed. I honestly don't know what's going on with it now. Sometimes it just seems to stop detecting without even crashing (hanging though spewing out the "INFO" messages).

So, I keep wondering if its a hardware problem. The more I work with it and try different things, the worse it gets. I give up, let it rest and I come back to it days later and it works better initially but then eventually craps out again and again.

madgrizzle commented 3 years ago

Turns out the latter problem where it seems to hang without crashing was caused by the optimizations I made. When you don't retrieve the frames (left, right, cropped, etc.), the host program's loop runs too fast and it empties the queues of config and landmarks and only rarely gets both left and right landmarks at the same time in a single loop's iteration. I had to slow down the loop by adding more delay to the cv2.waitkey() so that it's not emptying out the queue.

themarpe commented 3 years ago

Cross posting for visibility - https://github.com/luxonis/depthai-python/issues/408#issuecomment-961157245

Main issue as far as I've dug into it, is that a memory corruption is happening in a non-deterministic way, which looks like something might be wrong with the hardware, but its just that the bug is "random" and hard to pin down to a specific cause.

Will keep you posted after I discover more information about it

SzabolcsGergely commented 3 years ago

@madgrizzle could you try installing the following library, it contains some fixes for Script node related memory allocation:

python3 -m pip install -U --prefer-binary --user --extra-index-url https://artifacts.luxonis.com/artifactory/luxonis-python-snapshot-local "depthai==2.11.1.1.dev+7498afce36f34dae5482c2bbf42db317e3e53dd5"

Then modifying the script by adding the following (@alex-luxonis suggested ): image_manip_script.setProcessor(dai.ProcessorType.LEON_CSS)

madgrizzle commented 3 years ago

Knock on wood, but this so far seems pretty stable. I've let it run for about 8 hours so far without issue on the RPI. I'll let it run overnight and then switch to the x86 host to see how it works there.

madgrizzle commented 3 years ago

No problem running overnight. I'm now trying out the script from https://github.com/luxonis/depthai-python/issues/408#issuecomment-961157245 that I modified to include facial detection/recognition. Previously it would run for a few seconds and crash, but its been running for 15 minutes now with no issue.

alex-luxonis commented 3 years ago

@madgrizzle Thanks for the tests! Was .setProcessor(dai.ProcessorType.LEON_CSS) added as well in the Script nodes? If yes, could you also try removing it and see if it's still stable. Would help us in investigating any more issues when running on the default LEON_MSS core.

madgrizzle commented 3 years ago

I had done both changes. When I go home at lunch, i'll undo the setProcessor and see what the effect is.

madgrizzle commented 3 years ago

@alex-luxonis switching the code to use LEON_MSS crashed within about 1 minute. Switched back to LEON_CSS and running stable. Tried MSS again to make certain and it crashed in about 10 seconds. Back to CSS and stable. This is all still running on an RPI host (haven't moved the cable back to the x86)

madgrizzle commented 3 years ago

@alex-luxonis Couple things.. First it appears that for the gen2-triangulation script, I did not change it to LEON_CSS (only the updated depthai). I went ahead and set it to LEON_MSS and will let it run for a while to to be absolutely certain.

Second, the script from https://github.com/luxonis/depthai-python/issues/408#issuecomment-961157245 with my face detection/recognition additions however did crash using LEON_MSS rather quickly (seconds to minutes) and ran for 12+ hours last night using LEON_CSS without crashing. However, when playing around at lunch time (i.e., switching back and forth between MSS and CSS), it did crash with LEON_CSS after about a 15 minute run. That potentially could be related to something else (like ImageManip or maybe just the process of switching the code or something).

Definite improvement running the new depthai + the script node in LEON_CSS processor.

themarpe commented 3 years ago

@madgrizzle thanks for extensive testing - the findings support my theory that moving to CSS only mitigates/suppresses the issue but isn't the cause of it. We'll look into the underlying cause to resolve it properly, in the meantime the mitigation seems to be quite effective

madgrizzle commented 3 years ago

@themarpe seems so.. good enough for my use-case (giving my robot some vision).

madgrizzle commented 3 years ago

Ack.. accidentally closed.

themarpe commented 2 years ago

@madgrizzle in latest develop is a patch that should reduce the issues even further. MSS might work as expected now. Feel free to test in your usecase if it improved:)

madgrizzle commented 2 years ago

It ran gen2-triangulation pretty solid (I stopped it after 10 minutes), but then I ran it on my yoloSpatialCalculator+face recognition program that often hangs on 2.13.3 after a few minutes and this version did the same. Never consistent on when either occurs.. sometimes a couple of minutes, sometimes after a couple of hours, butsometimes just a few seconds.

I do still notice lots of ImageManip unsupported config errors. Any progress on addressing that (I can deal with it, nevertheless)?

themarpe commented 2 years ago

Thanks for testing - is your yolo + face recognition available somewhere openly? Might be a good benchmark to test against, just so that error is more easily observed.

Regarding ImageManip, we have a branch for it, but there are a couple of extra things to fixup and retest, so might take a bit more. I'll also address the issue with unsupported config errors :)

madgrizzle commented 2 years ago

It's currently part of a ROS node, but it won't be hard to strip it out and make it run standalone. I should have it on my repo tonight and will post a link.

madgrizzle commented 2 years ago

This is my repo with the code I'm using. https://github.com/madgrizzle/visiontest

It ran 12 hours today with no one in the room. When I walked it, it said it detected an elephant (great for my ego) and a few minutes later it locked up while my head was down looking at my phone. One key I've discovered to making it lock up is putting my hands to my face (like rubbing my eyes.) It doesn't happen all the time, but so often that it has to be more than coincidence.

themarpe commented 2 years ago

Thanks, will try to recreate:)

madgrizzle commented 2 years ago

I'm current running 2.13.3.0.dev+61eb5c1617623c628fab1cc09d123073518124b0 with great success. It ran for days using LEON_CSS until I intentionally stopped it. I'll try out LEON_MSS as well. I've added a few things to the pipeline since the code in the visiontest repo and moved it back into my ROS node because it appears stable enough for me (/em crosses fingers).