Open madgrizzle opened 3 years ago
Thanks for the report and sorry for the trouble. Not immediately sure. But CC: @Erol444 on this.
If it helps, it failed today with a:
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
..and a just had another failure where it was landmarks_left stream. I'm guessing something is crashing in the pipleline and its just erroring out on whatever stream its trying to retrieve.
Hello @madgrizzle , I have just pushed a small edit to latest master branch of depthai-experiments. Could you try using that + the latest develop
branch of depthai-python
? So checkout to develop branch and run python3 examples/install_requirements.py
to get the latest develop version of our library. Using that seems to fix the issue. Thanks for reporting!
Awesome, will try tonight! Thanks for looking into it.
It works great! Thanks for the fix!
Well, turns out that it just seems to work 'better', but it still crashes. I'm not entirely sure why I wasn't getting these messages before, but I recently did an update to depthai-python. This occurs right before the crash:
[14442C10411DC2D200] [470.453] [system] [critical] Fatal error. Please report to developers. Log: 'PoolBase' '66'
[14442C10411DC2D200] [267.191] [system] [critical] Fatal error. Please report to developers. Log: 'PoolBase' '66'
Traceback (most recent call last):
File "main.py", line 214, in
@madgrizzle I'm not immediately sure why this issue occurs, will check with the team too. Could you specify which depthai
version you're using?
@madgrizzle it appears to be a bug in the depthai itself and @themarpe managed to fix that right away (kudos!). I'll circle back once the experiment is updated with a library version containing the fix, so you'll be able to test again.
Thanks for reporting the issue!
Just curious.. is the "bug in depthai itself" a bug in the FW or a bug in depthai-python (or something else)?
@madgrizzle it was a bug in FW regarding how messages are shared across the two cores available. More specifically some cache coherency issue.
I was hoping 2.11.0 would fix it for me (based upon what I saw in the commits and announcement), but I get the same error message (X_LINK_ERROR). I've seen @Luxonis-Brandon demo the gen2-triangulation on twitter just recently so I suspect that my problem may be specific to me. I can try to use a different machine to see if maybe its an issue with my specific OAK-D or just the computer I'm using.
@madgrizzle does it happen immediately or after some time running?
@themarpe, not immediately and the time it takes varies. At the moment, within 30 seconds or so. Couple days ago (I had built your develop branch hoping the fix was in) and it took a few minutes to crash. The OAK-D is powered by the power supply and plugged into a intel NUC i7 via usb-3 cable. I access the NUC via ssh/xserver, but I wouldn't expect that to cause the issue.
@themarpe, I spun up ubuntu 20.04 onto a RPI 4 and got everything installed and running with the same result after a few seconds. It did manage to spit out more information though:
terminate called after throwing an instance of 'dai::XLinkReadError' what(): Couldn't read data from stream: 'mono_left' (X_LINK_ERROR) Stack trace (most recent call last) in thread 35627:
Aborted (Signal sent by tkill() 35550 1000) Aborted (core dumped)
Also, on another run, I got this in the stream:
[14442C10411DC2D200] [21.932] [system] [critical] Fatal error. Please report to developers. Log: 'class' '374'
If this seems like a device issue, we can close this and I'll look for help on the discord.
@madgrizzle I'm investigating more. There was some additional issue I saw the last time, but haven't had time yet to debug. Will keep you posted
@themarpe I did some testing and got frustrating results. Not sure any of this is useful, but figured more information is better than none.
I first incrementally enabled various parts of the pipeline (by commenting out the sections I didn't want enabled) and it seemed to run well (>5 minutes) until I enabled the very last part to retrieve the landmarks https://github.com/luxonis/depthai-experiments/blob/b3f72a2a1dfc27f3c55d987482141db6815c0f0c/gen2-triangulation/main.py#L231 Upon enabling that, it started to crash after a few seconds.
So I thought maybe it was just related to that particular model so I adapted the program to use the facial-landmarks-35-adas model and it ran for a really, really long time. This seemed at the time to prove my hypothesis. To make sure, I switched back to the original model to verify it still crashed (which did) and then switched back to the new adas one to verify it didn't crash.. but then it did. That's the frustrating part.
I tried wiping the blob cache and install the newest version of the blob converter and it didn't seem to help.
@madgrizzle thanks very much for extensive testing.
Current state from my side is that some sort of memory corrupt happens, which seems accelerated when detecting faces and/or unsupported config is printed (odd scenes where something is detected oddly afaik cause this). The longest runs on my end were stationary scenes, while the shortest was dynamic movement with both empty and scenes with faces.
Regarding your observation - if you've removed that line, the actual on device processing should not differ (the queue is non-blocking, the messages are going to continue being produced). Although an issue in parsing could happend, but in that case you'd be left with a host side error.
Regarding the different model, do you think it helped with overall stability, in terms of timing and how soon it crashed?
Any way, I suspect ImageManip issue, but not 100% sure yet. I have a rewrite planned for ImageManip, which might address this alongside hopefully, as its quite elusive and not deterministically reproducible bug (in terms of execution)
@themarpe, you wrote:
"Regarding your observation - if you've removed that line, the actual on device processing should differ (the queue is non-blocking, the messages are going to continue being produced)."
Did you mean that the processing SHOULDN'T differ? I'm having hard time reconciling what you wrote in the sentence with what you wrote in the parenthesis.
Regarding the different model, do you think it helped with overall stability, in terms of timing and how soon it crashed?
I thought it solved it because it wouldn't crash the first time I ran the other model. But the second time and thereafter it crashed as much as the original model. That's the weird part.
Any way, I suspect ImageManip issue, but not 100% sure yet.
I know very little about the internal workings, but I thought it was either the fact that two different 'pipelines' were being run (left camera and right camera) and some issue came up from that or it was ImageManip considering that face detection and landmark recognition in general has been pretty solid on other examples. I used both with gen1 stuff and it seemed solid.
I will try to figure out how to do the cropping of the image host side, thereby eliminating the ImageManip node, and see if it fixes the problem.
@madgrizzle Yes, thanks for catching the mistake - processing should not differ (edited my post as well).
I will try to figure out how to do the cropping of the image host side, thereby eliminating the ImageManip node, and see if it fixes the problem.
Thanks, that'd be a great data point to have. Let me know how it goes. Also feel free to fork & push your changes there if we'll need to sync further down the line on some common tests, etc...
@themarpe Perhaps this should be an issue 'somewhere else' (I'll ask on discord) but it appears that the ImageManip config errors occur when resize width is less than half the height. I was testing it at lunch and started to slowly move my hand in front of my face and the 'box' (that gets drawn on the screen) started to narrow. When it got around half the height, the errors started to occur. I had to get back to work but I'll do some more testing tonight and see if I can catch those events.
@themarpe I've got things running much, much better (still crashes though) by not processing any face detections where the aspect ratio is less than 70% (width to height or height to width). That seemed to have eliminated most of the errors ImageManipConfig errors that occur. It runs for several minutes now and is much more stable. So I tend to agree ImageManip is the likely culprit, but the fact it happens while the image is dynamic (I find the same to be the case) is odd. When it crashed as I was moving forward to get closer to the keyboard, I got this message:
2 Object "[", at 0, in nil
1 Object "/lib/x86_64-linux-gnu/libc.so.6", at 0x7ff2f143120f, in
0 Object "/home/john/depthvenv/lib/python3.8/site-packages/depthai.cpython-38-x86_64-linux-gnu.so", at 0x7ff2cf60b0b5, in backward::SignalHandling::sig_handler(int, siginfo_t, void)
Segmentation fault (Address not mapped to object [(nil)]) Segmentation fault (core dumped)
I understand ImageManip is being rewritten so I'll hold off on any more investigating/testing for the time being.
/em fingers_crossed
I solved this problem by modifying this line of code
with dai.Device(p.getOpenVINOVersion()) as device:
to this
with dai.Device(p.getOpenVINOVersion() ,usb2Mode = True) as device:
I just add usb2Mode =True
The program is working fine now without stopping
You can understand the problem that occurs from this link:
https://github.com/luxonis/depthai-python/issues/318#issuecomment-884219556
The version on the repo still crashes for me if I make the change you described. I then tried a version that I made that has a few extra tweaks to the ImageManip script to help reduce invalid configs that seemed to help it run better and it crashed as well. So, I then removed all the video frame grabbing to lower the bandwidth and it still crashed under USB-3. But when I forced it to usb2Mode, it ran stable... but after stopping it and do some other testing, it crashed. I honestly don't know what's going on with it now. Sometimes it just seems to stop detecting without even crashing (hanging though spewing out the "INFO" messages).
So, I keep wondering if its a hardware problem. The more I work with it and try different things, the worse it gets. I give up, let it rest and I come back to it days later and it works better initially but then eventually craps out again and again.
Turns out the latter problem where it seems to hang without crashing was caused by the optimizations I made. When you don't retrieve the frames (left, right, cropped, etc.), the host program's loop runs too fast and it empties the queues of config and landmarks and only rarely gets both left and right landmarks at the same time in a single loop's iteration. I had to slow down the loop by adding more delay to the cv2.waitkey() so that it's not emptying out the queue.
Cross posting for visibility - https://github.com/luxonis/depthai-python/issues/408#issuecomment-961157245
Main issue as far as I've dug into it, is that a memory corruption is happening in a non-deterministic way, which looks like something might be wrong with the hardware, but its just that the bug is "random" and hard to pin down to a specific cause.
Will keep you posted after I discover more information about it
@madgrizzle could you try installing the following library, it contains some fixes for Script node related memory allocation:
python3 -m pip install -U --prefer-binary --user --extra-index-url https://artifacts.luxonis.com/artifactory/luxonis-python-snapshot-local "depthai==2.11.1.1.dev+7498afce36f34dae5482c2bbf42db317e3e53dd5"
Then modifying the script by adding the following (@alex-luxonis suggested ):
image_manip_script.setProcessor(dai.ProcessorType.LEON_CSS)
Knock on wood, but this so far seems pretty stable. I've let it run for about 8 hours so far without issue on the RPI. I'll let it run overnight and then switch to the x86 host to see how it works there.
No problem running overnight. I'm now trying out the script from https://github.com/luxonis/depthai-python/issues/408#issuecomment-961157245 that I modified to include facial detection/recognition. Previously it would run for a few seconds and crash, but its been running for 15 minutes now with no issue.
@madgrizzle Thanks for the tests!
Was .setProcessor(dai.ProcessorType.LEON_CSS)
added as well in the Script nodes?
If yes, could you also try removing it and see if it's still stable. Would help us in investigating any more issues when running on the default LEON_MSS
core.
I had done both changes. When I go home at lunch, i'll undo the setProcessor and see what the effect is.
@alex-luxonis switching the code to use LEON_MSS crashed within about 1 minute. Switched back to LEON_CSS and running stable. Tried MSS again to make certain and it crashed in about 10 seconds. Back to CSS and stable. This is all still running on an RPI host (haven't moved the cable back to the x86)
@alex-luxonis Couple things.. First it appears that for the gen2-triangulation script, I did not change it to LEON_CSS (only the updated depthai). I went ahead and set it to LEON_MSS and will let it run for a while to to be absolutely certain.
Second, the script from https://github.com/luxonis/depthai-python/issues/408#issuecomment-961157245 with my face detection/recognition additions however did crash using LEON_MSS rather quickly (seconds to minutes) and ran for 12+ hours last night using LEON_CSS without crashing. However, when playing around at lunch time (i.e., switching back and forth between MSS and CSS), it did crash with LEON_CSS after about a 15 minute run. That potentially could be related to something else (like ImageManip or maybe just the process of switching the code or something).
Definite improvement running the new depthai + the script node in LEON_CSS processor.
@madgrizzle thanks for extensive testing - the findings support my theory that moving to CSS only mitigates/suppresses the issue but isn't the cause of it. We'll look into the underlying cause to resolve it properly, in the meantime the mitigation seems to be quite effective
@themarpe seems so.. good enough for my use-case (giving my robot some vision).
Ack.. accidentally closed.
@madgrizzle in latest develop
is a patch that should reduce the issues even further.
MSS
might work as expected now.
Feel free to test in your usecase if it improved:)
It ran gen2-triangulation pretty solid (I stopped it after 10 minutes), but then I ran it on my yoloSpatialCalculator+face recognition program that often hangs on 2.13.3 after a few minutes and this version did the same. Never consistent on when either occurs.. sometimes a couple of minutes, sometimes after a couple of hours, butsometimes just a few seconds.
I do still notice lots of ImageManip unsupported config errors. Any progress on addressing that (I can deal with it, nevertheless)?
Thanks for testing - is your yolo + face recognition available somewhere openly? Might be a good benchmark to test against, just so that error is more easily observed.
Regarding ImageManip, we have a branch for it, but there are a couple of extra things to fixup and retest, so might take a bit more. I'll also address the issue with unsupported config errors :)
It's currently part of a ROS node, but it won't be hard to strip it out and make it run standalone. I should have it on my repo tonight and will post a link.
This is my repo with the code I'm using. https://github.com/madgrizzle/visiontest
It ran 12 hours today with no one in the room. When I walked it, it said it detected an elephant (great for my ego) and a few minutes later it locked up while my head was down looking at my phone. One key I've discovered to making it lock up is putting my hands to my face (like rubbing my eyes.) It doesn't happen all the time, but so often that it has to be more than coincidence.
Thanks, will try to recreate:)
I'm current running 2.13.3.0.dev+61eb5c1617623c628fab1cc09d123073518124b0 with great success. It ran for days using LEON_CSS until I intentionally stopped it. I'll try out LEON_MSS as well. I've added a few things to the pipeline since the code in the visiontest repo and moved it back into my ROS node because it appears stable enough for me (/em crosses fingers).
Finally getting back around to trying out the face detection with stereo demo and I find that it runs fine for about 5 seconds but then crashes with the following error:
Traceback (most recent call last): File "main.py", line 207, in
frame = queues[i*4].get().getCvFrame()
RuntimeError: Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'mono_left' (X_LINK_ERROR)'
It only crashes when someone steps in front of the camera and face detection starts working.