Closed Ulfzerk closed 2 years ago
Does this happen right away or after a period of time while its running?
Does this happen right away or after a period of time while its running?
After some time, like 8k iterations with detections as I can remember.
I have posted a similar issue https://github.com/luxonis/depthai-experiments/issues/210 It runs for a while then dies with similar error. I was thinking it was imagemanip related but your code doesn't use it. So I'm not sure anymore.
While running this example demo with DEPTHAI_LEVEL = DEBUG
[14442C1091D82CD700] [471.530] [system] [info] Memory Usage - DDR: 74.00 / 359.07 MiB, CMX: 2.34 / 2.50 MiB, LeonOS Heap: 46.90 / 78.63 MiB, LeonRT Heap: 5.28 / 23.84 MiB
[14442C1091D82CD700] [471.530] [system] [info] Temperatures - Average: 88.75 \u00b0C, CSS: 89.16 \u00b0C, MSS 88.44 \u00b0C, UPA: 90.24 \u00b0C, DSS: 87.17 \u00b0C
[14442C1091D82CD700] [471.530] [system] [info] Cpu Usage - LeonOS 52.70%, LeonRT: 31.19%
[14442C1091D82CD700] [472.532] [system] [info] Memory Usage - DDR: 74.00 / 359.07 MiB, CMX: 2.34 / 2.50 MiB, LeonOS Heap: 46.90 / 78.63 MiB, LeonRT Heap: 5.28 / 23.84 MiB
[14442C1091D82CD700] [472.532] [system] [info] Temperatures - Average: 88.66 \u00b0C, CSS: 88.80 \u00b0C, MSS 88.80 \u00b0C, UPA: 90.60 \u00b0C, DSS: 86.43 \u00b0C
[14442C1091D82CD700] [472.532] [system] [info] Cpu Usage - LeonOS 52.81%, LeonRT: 31.61%
[2021-11-03 12:32:56.426] [debug] Log thread exception caught: Couldn't read data from stream: '__log' (X_LINK_ERROR)
[2021-11-03 12:32:56.429] [debug] Timesync thread exception caught: Couldn't read data from stream: '__timesync' (X_LINK_ERROR)
[2021-11-03 12:32:56.447] [debug] Device about to be closed...
[2021-11-03 12:32:56.675] [debug] Watchdog thread exception caught: Couldn't write data to stream: '__watchdog' (X_LINK_ERROR)
[2021-11-03 12:32:58.352] [debug] XLinkResetRemote of linkId: (0)
[2021-11-03 12:32:58.356] [debug] Device closed, 1908
Traceback (most recent call last):
File "yolo_detection_test_sp.py", line 130, in <module>
inPreview = previewQueue.get()
RuntimeError: Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'rgb' (X_LINK_ERROR)'
I will try usb2Mode=True
and higher delay withcv2.waitKey(..)
update: It didn't help
We are working on what could be the same underlying issue. Not 100% sure though. @themarpe is on it.
We are working on what could be the same underlying issue. Not 100% sure though. @themarpe is on it.
I appreciate it very much. If I may ask, is it hardware or software problem? Is it very complicated? How long is this bugfix estimated for? Is there anything I can do to help?
@BlonskiP I ran it on mine (straight copy from the repo because I didn't have much time to try your version) and after an hour it was still running. Temps got up to ~58C. I don't think its the ImageManip issue and maybe it is heat related if you get into the 80C range.
@BlonskiP I ran it on mine (straight copy from the repo because I didn't have much time to try your version) and after an hour it was still running. Temps got up to ~58C. I don't think its the ImageManip issue and maybe it is heat related if you get into the 80C range.
@madgrizzle Thank you very much. I will try to get my hands on raspberry fun as soon as I can anyway ;)
Maybe a low-profile fan on the heatsink on the other side where the cameras are since it seems that part is getting really hot.
So in terms of the heat of the DepthAI module - 85C is not a problem. The DepthAI SoM can run indefinitely at 105C die temperature. That said, I'm not sure if the Pi temperature could be an issue.
Thoughts on this one @themarpe ?
Hi @BlonskiP and @madgrizzle I have tried reproducing the issue yesterday on x86-64 host yesterday but didn't succeed.
I am a bit behind on this issue (and the one @madgrizzle brought up), but I think same issue happens on device side. A memory corruption, which causes the device to crash. Problem is that its non-deterministic and rarely occurs at same place let alone at same time, so its really challenging to catch the source of the issue.
Initial guess was ImageManip, as there is a lot of complexity there, which could cause such a bug, but its not common between these two issues.
I'm still wrapping up ImageManip improvements and I'll attack this instability issues next week.
Regarding HW vs SW issue, the one I observed testing @madgrizzle issue, it was SW, but in this case not sure if maybe host has any influcence (can you reproduce on x86-64 @BlonskiP ?). That said, I still lean to this being the same SW bug.
@themarpe I ran the yolo script as well on an x86-host (the same one I use for the gen2-triangulation) and had no problems... it ran for four hours without a glitch. I had spun up a RPI4 when testing gen2-triangulation issues so I started the yolo script on it during my lunch break and will check it tonight.
It ran for ~6 hours with no problem on the RPI4.
@themarpe I added some additional code (from the gen2-triangulation) for face detection and let it run overnight. When I got up, it had crashed with rgb stream error. There was no face detections going on (no one in room.. except maybe a ghost?) so the imagemanip crop shouldn't have been called. I'll try to run the original script for a long time as well and see if it crashes. I do think there's a memory corruption problem going on as you suspect.. which is unfortunate as that's one of and the hardest types of bugs to find (and I'm assuming on the closed-source side of things as well).
Finally, when I took this script and added cropping, face rotation, and face reidentification to the pipeline as well, it crashed within about 20 seconds. Seems when the pipeline gets busy, the crashing happens quicker.
After 12 hours of running the original script on an RPI, I got this:
It looks like this fix has increased stability, but this error still occurs :(
Hi @BlonskiP Latest develop includes some additional stability improvements - feel free to test those out.
I can confirm I get this error on the following environment:
host: Ubuntu 20.04.3 LTS with AMD CPU camera1: Oak1 camera2: Oak-D Lite connection: usb c cable (provided with oak-d lite)
Testing with gen2-face-recognition
master branch as of today.
python3 main.py --name someone
camera1: fails after a few seconds of recognition. sometimes sgows saving... but not always. camera2: fails almost immediately but same
python3 main.py --name frog
Creating pipeline...
Creating Color Camera...
Creating Face Detection Neural Network...
Creating Head pose estimation NN
Creating face recognition ImageManip/NN
[14442C10D12853D000] [8.516] [NeuralNetwork(10)] [warning] Network compiled for 4 shaves, maximum available 13, compiling for 6 shaves likely will yield in better performance
[14442C10D12853D000] [8.763] [NeuralNetwork(10)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
Saving face...
Saving face...
Saving face...
Saving face...
[14442C10D12853D000] [17.464] [system] [critical] Fatal error. Please report to developers. Log: 'class' '374'
Traceback (most recent call last):
File "main.py", line 254, in <module>
frameIn = frameQ.tryGet()
RuntimeError: Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'frame' (X_LINK_ERROR)'
similar error when just running the main
as opposed to training with --name
@hipitihop can you try using a latest develop
library? The posted issue looks similar to the one that we've recently made some fixes for. Which library version are you currently using? (run as DEPTHAI_LEVEL=debug python3 main.py --name frog
)
@themarpe Library: Depthai version installed: 2.14.1.0.dev+27fa4519f289498e84768ab5229a1a45efb7e4df
My current setup is as follows:
~/development/depthai/
- master git rev-parse HEAD
: 08756f77e885b58421d6d4678782720d3b9f638d
~/development/depthai/depthai-experiments/
- master git rev-parse HEAD
: 7382e8be7308e3aab537842dfa17a49f532d03b5
Debug logs attached:
debug-log-oak-d-lite.txt debug-log-oak1.txt
Updated: let me know if this is an incorrect folder structure. I run the install requirements from the top level but run the experiment from within the depthai-experiments/gen2-facial-recognition
also, given this setup, tell me which branch you want me to test with.
Updated: I now see that I'm using the main demo repo depthai
as opposed to this repo depthai-python
. My bad. Not sure what the difference is. Apologies for my muddling
@hipitihop
The face recognition experiment still has some issues on latest depthai library.
Can you install the one specified in the folder along side it gen2-face-recognition/requirements.txt
(version 2.10)
I think that should work better.
We're looking into this bug in the meantime.
@themarpe
Indeed with the Oak1 this does not crash. It does not seem to do any saving, but this might just need me to clear previous data to start fresh for a given name --name frog
As for the Oak-D Lite: with DEPTHAI_LEVEL=debug python3 main.py --name frog
it complains about finding the camera but continues, however it never displays the window, but is happy to continue reporting temp, cpu, mem each second.
[2022-01-19 09:39:32.620] [debug] Python bindings - version: 2.10.0.0 from 2021-08-24 18:49:37 +0300 build: 2021-08-24 17:52:17 +0000
[2022-01-19 09:39:32.620] [debug] Library information - version: 2.10.0, commit: 57bb84ad209825f181744f2308b8ac6f52a37604 from 2021-08-24 18:49:14 +0300, build: 2021-08-24 17:43:07 +0000
[2022-01-19 09:39:32.623] [debug] Initialize - finished
Creating pipeline...
Creating Color Camera...
Creating Face Detection Neural Network...
Creating Head pose estimation NN
Creating face recognition ImageManip/NN
[2022-01-19 09:39:32.687] [debug] Resources - Archive 'depthai-bootloader-fwp-0.0.12.tar.xz' open: 1ms, archive read: 62ms
[2022-01-19 09:39:33.056] [debug] Resources - Archive 'depthai-device-fwp-7131affa2c01ecd34506e9c3dd8ea9198ed874f1.tar.xz' open: 1ms, archive read: 431ms
[2022-01-19 09:39:33.074] [debug] Device - OpenVINO version: 2021.2
[2022-01-19 09:39:33.080] [debug] Patching OpenVINO FW version from 2021.4 to 2021.2
[18443010A1D10A1300] [11.280] [system] [info] Memory Usage - DDR: 0.12 / 358.55 MiB, CMX: 2.09 / 2.50 MiB, LeonOS Heap: 6.26 / 77.56 MiB, LeonRT Heap: 2.83 / 23.94 MiB
[18443010A1D10A1300] [11.280] [system] [info] Temperatures - Average: 37.71 °C, CSS: 39.35 °C, MSS 36.77 °C, UPA: 37.94 °C, DSS: 36.77 °C
[18443010A1D10A1300] [11.280] [system] [info] Cpu Usage - LeonOS 7.40%, LeonRT: 2.06%
....
[18443010A1D10A1300] [11.722] [system] [error] Attempted to start Color camera - NOT detected!
[18443010A1D10A1300] [11.418] [system] [info] ImageManip internal buffer size '203904'B, shave buffer size '20480'B
[18443010A1D10A1300] [11.418] [system] [info] SIPP (Signal Image Processing Pipeline) internal buffer size '156672'B
[18443010A1D10A1300] [11.418] [system] [info] NeuralNetwork allocated resources: shaves: [0-12] cmx slices: [0-12]
[18443010A1D10A1300] [11.418] [system] [info] ColorCamera allocated resources: no shaves; cmx slices: [13-15]
[18443010A1D10A1300] [11.418] [system] [info] ImageManip allocated resources: shaves: [15-15] no cmx slices.
[18443010A1D10A1300] [11.432] [NeuralNetwork(10)] [info] Needed resources: shaves: 4, ddr: 1605632
[18443010A1D10A1300] [11.432] [NeuralNetwork(10)] [warning] Network compiled for 4 shaves, maximum available 13, compiling for 6 shaves likely will yield in better performance
[18443010A1D10A1300] [11.722] [system] [error] Attempted to start Color camera - NOT detected!
[18443010A1D10A1300] [11.475] [DetectionNetwork(3)] [info] Needed resources: shaves: 6, ddr: 2728832
[18443010A1D10A1300] [11.707] [NeuralNetwork(7)] [info] Needed resources: shaves: 6, ddr: 21632
[18443010A1D10A1300] [11.721] [NeuralNetwork(10)] [warning] The issued warnings are orientative, based on optimal settings for a single network, if multiple networks are running in parallel the optimal settings may vary
[18443010A1D10A1300] [11.721] [NeuralNetwork(10)] [info] Inference thread count: 2, number of shaves allocated per thread: 4, number of Neural Compute Engines (NCE) allocated per thread: 1
[18443010A1D10A1300] [11.722] [DetectionNetwork(3)] [info] Inference thread count: 2, number of shaves allocated per thread: 6, number of Neural Compute Engines (NCE) allocated per thread: 1
[18443010A1D10A1300] [11.723] [NeuralNetwork(7)] [info] Inference thread count: 2, number of shaves allocated per thread: 6, number of Neural Compute Engines (NCE) allocated per thread: 1
[18443010A1D10A1300] [12.281] [system] [info] Memory Usage - DDR: 143.71 / 358.55 MiB, CMX: 2.47 / 2.50 MiB, LeonOS Heap: 16.87 / 77.56 MiB, LeonRT Heap: 7.29 / 23.94 MiB
[18443010A1D10A1300] [12.281] [system] [info] Temperatures - Average: 38.94 °C, CSS: 40.28 °C, MSS 38.65 °C, UPA: 38.65 °C, DSS: 38.18 °C
[18443010A1D10A1300] [12.281] [system] [info] Cpu Usage - LeonOS 13.06%, LeonRT: 59.15%
[18443010A1D10A1300] [13.282] [system] [info] Memory Usage - DDR: 143.71 / 358.55 MiB, CMX: 2.47 / 2.50 MiB, LeonOS Heap: 16.87 / 77.56 MiB, LeonRT Heap: 7.29 / 23.94 MiB
Hello @hipitihop, I believe that is a different issue - OAK-D-Lite uses camera sensors that weren't compatible with the firmware before ~2.11. So OAK-D-Lite using depthai 2.10 on any pipeline will error out with the same issue - camera not found.
Hello. I tried latest release 2.15.1.0, but the crash still happens. Is there an upcoming fix for this issue?
@BlonskiP we've observed that CM4 suffers from an thermal issue on USB hub chip. Can you share more details of your unit? CC: @Luxonis-David
@jasonm189
Can you share more details, minimum reproducible example script and the log of the run with DEPTHAI_LEVEL=debug enabled?
@themarpe I used this example as it is. https://github.com/luxonis/depthai-experiments/tree/master/gen2-face-recognition
@jasonm189 which Luxonis camera/product you are using running the examples on? Is it OAK-D-CM4-PoE or some other camera i.e. OAK-D_Lite?
@jasonm189 which Luxonis camera/product you are using running the examples on? Is it OAK-D-CM4-PoE or some other camera i.e. OAK-D_Lite?
OAK-D. The issue happens only with that example, from what I've read it's a known issue with script node. Do you have a list where known issues can be tracked?
@Erol444 on the above if you have anything like tracking list of issues or you can help with the example.
@jasonm189 which Luxonis camera/product you are using running the examples on? Is it OAK-D-CM4-PoE or some other camera i.e. OAK-D_Lite?
OAK-D. The issue happens only with that example, from what I've read it's a known issue with script node. Do you have a list where known issues can be tracked?
@jasonm189 there was a sporadic error before we changed the script nodes CPU:
script.setProcessor(dai.ProcessorType.LEON_CSS)
After that change, it hasn't crashed anymore. Are you using the latest version of depthai-experiments?
@jasonm189 which Luxonis camera/product you are using running the examples on? Is it OAK-D-CM4-PoE or some other camera i.e. OAK-D_Lite?
OAK-D. The issue happens only with that example, from what I've read it's a known issue with script node. Do you have a list where known issues can be tracked?
@jasonm189 there was a sporadic error before we changed the script nodes CPU:
script.setProcessor(dai.ProcessorType.LEON_CSS)
After that change, it hasn't crashed anymore. Are you using the latest version of depthai-experiments?
Yes, it still crashes after 30+ mins.
@jasonm189 - Sorry about the trouble. And actually given that your setup seems to be the only remaining crashing case here could you make a new issue so we can have all the details of the setup in one place? And then tag @Erol444 and me in it (and this issue)?
On my environment, tiny_yolo v4 sample can work at depthai liberary ver2.14 but from 2.15 it can't. On 2.18, it caused error below
[184430102152AC1200] [1.7] [281.387] [system] [warning] ColorCamera IMX214: capping FPS for selected resolution to 35
Traceback (most recent call last):
File "tiny_yolo.py", line 151, in <module>
inRgb = qRgb.get()
RuntimeError: Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'rgb' (X_LINK_ERROR)'
@kazyam53 Do you mind opening a separate issue and also describe which device (I assume OAK-D Lite) and which host you are using? Thanks!
@themarpe Ok I opened new issue below. https://github.com/luxonis/depthai-python/issues/691
Addressed by https://github.com/luxonis/depthai-core/pull/616
Reran gen2-face-detection in experiments over night, ran for 7h without issues
Hello, I have an issue with running tiny-yolo-v4 with SpatialDetection. I'm using copy+pasted demo from: https://docs.luxonis.com/projects/api/en/latest/samples/SpatialDetection/spatial_tiny_yolo/ With tiny yolo blob: https://artifacts.luxonis.com/artifactory/luxonis-depthai-data-local/network/tiny-yolo-v4_openvino_2021.2_6shave.blob My only change was adding print(...) with iteration counter, fps and average chip temperature.
My device is: OAK-D-CM4 device url: https://shop.luxonis.com/products/depthai-rpi-compute-module-4-edition
depthAi version: 2.11.1 Installed by: python3 -m pip install git+https://github.com/luxonis/depthai-python.git@caf537b without using venv.
Python
python3 --version Python 3.7.3
Raspi system informationsError messages:
File "yolo_detection_test_sp.py", line 151, in <module> boundingBoxMapping = xoutBoundingBoxDepthMappingQueue.get() RuntimeError: Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'boundingBoxDepthMapping' (X_LINK_ERROR)'
or on custom codeRuntimeError: Communication exception - possible device error/misconfiguration. Original message 'Couldn't read data from stream: 'RGB' (X_LINK_ERROR)'
Temperatures Average chip temperature: 85C Raspi temperature: 77C
Code