[BUG] ROS Noetic PoE connectivity issues

luxonis / depthai-ros

Official ROS Driver for DepthAI Sensors.

MIT License

259 stars 189 forks source link

[BUG] ROS Noetic PoE connectivity issues #352

Closed onurbingol closed 1 year ago

onurbingol commented 1 year ago

Describe the bug

I am using luxonis/depthai-ros:v2.7.4-noetic Docker image with a OAK-D Pro PoE camera. The camera is plugged to one of the PoE++ ports of a Ubiquity USW Pro 24 PoE switch. After some time, the ROS driver stops publishing to rgb and stereo topics. This could happen as short as in 2 hours or something like 12 hours after launching the ROS driver. I was never able to get the image from the ROS topic after leaving the camera on overnight.

While the ROS driver is running, I can get the images from the topics without any issues. When I encounter the issue, I can still see the ROS topics but I cannot get any data from them.

I also tried to roll back to luxonis/depthai-ros:v2.7.2-noetic Docker image, but that didn't change anything.

I tried to change the switch, and used a Ubiquity USW Lite 8 PoE. On this switch, I used one of the PoE+ ports. No change observed.

I've tried the Python scripts provided as examples, they seem to work fine. I can't see any log entries regarding the disconnection on the Docker image logs, so I am not sure what is happening.

Minimal Reproducible Example

What I do is pretty much running the following command

/bin/bash -c "roslaunch depthai_ros_driver rgbd_pcl.launch params_file:=/configs/camera.yaml --wait"

configs/camera.yaml is the one provided here

Expected behavior

It should publish to the ROS topics continuously.

Pipeline Graph

I can provide this, but since I haven't customized the configuration file, it would be the default one if I understand it correctly. I was unable to save the pipeline via rosservice call /oak/save_pipeline command, it just didn't create /tmp/pipeline.json file in the container.

Attach system log

I am using the Docker image luxonis/depthai-ros:v2.7.4-noetic
I have just restarted the ROS driver and the container is currently collecting logs in the DEPTHAI_DEBUG=1 mode, I'll check and share the logs after I encounter the issue.

onurbingol commented 1 year ago

Something interesting happened while I was working on collecting the logs. I am not sure if this was the reason of the disconnect issue, but now I cannot even get data from ROS topics.

When I run a query like rostopic list /oak to see the topics the luxonis/depthai-ros:v2.7.4-noetic container is publishing, the ROS driver crashes with an error like this:

[194430109145801300] [a.b.c.d] [20.576] [system] [info] Temperatures - Average: 39.58C, CSS: 40.74C, MSS 39.35C, UPA: 39.58C, DSS: 38.65C
[194430109145801300] [a.b.c.d] [20.577] [system] [info] Cpu Usage - LeonOS 99.93%, LeonRT: 17.49%
[194430109145801300] [a.b.c.d] [20.613] [system] [warning] [32mPRINT:LeonCss:[0m W: Isp cfg mising: 0 
[194430109145801300] [a.b.c.d] [20.657] [system] [warning] [32mPRINT:LeonCss:[0m W: Isp cfg mising: 0 
W: Isp cfg mising: 0 
[2023-07-19 13:38:50.909] [debug] Device about to be closed...
[2023-07-19 13:38:50.945] [debug] XLinkResetRemote of linkId: (0)
[2023-07-19 13:38:50.945] [debug] DataOutputQueue (imu_imu) closed
[2023-07-19 13:38:50.945] [debug] DataOutputQueue (rgb_isp) closed
[2023-07-19 13:38:50.945] [debug] Timesync thread exception caught: Couldn't read data from stream: '__timesync' (X_LINK_ERROR)
[2023-07-19 13:38:50.945] [debug] DataOutputQueue (nn_nn) closed
[2023-07-19 13:38:50.945] [debug] Log thread exception caught: Couldn't read data from stream: '__log' (X_LINK_ERROR)
[2023-07-19 13:38:50.945] [debug] DataOutputQueue (stereo_stereo) closed
[2023-07-19 13:38:53.537] [debug] Device closed, 2627
[2023-07-19 13:38:53.537] [debug] DataInputQueue (right_control) closed
[2023-07-19 13:38:53.538] [debug] DataInputQueue (left_control) closed
[2023-07-19 13:38:53.538] [debug] DataInputQueue (rgb_control) closed
[0m[ INFO] [1689773933.983717531]: Bond broken, exiting[0m
[31m================================================================================REQUIRED process [oak-4] has died!
process has finished cleanly
log file: /root/.ros/log/xyz_1689357109/oak-4*.log
Initiating shutdown!
================================================================================[0m
[depth_image_to_rgb_pointcloud-5] killing on exit
[oak-4] killing on exit
[oak_nodelet_manager-3] killing on exit
[oak_state_publisher-2] killing on exit
shutting down processing monitor...
... shutting down processing monitor complete

The full log of the session is attached.

abc-xyz-depthai-ros-5d96cb6bd6-ddb56_xyz-depthai-ros.log

onurbingol commented 1 year ago

Hello again,

I was able to capture the issues from the debug log. First one is the error I mentioned above and the second one is

[system] [critical] Fatal error. Please report to developers. Log: 'Fatal error on MSS CPU: trap: 09, address: 80220C2C' '0'

Looks like there is a very similar bug report on #293, the only difference is that I am running roslaunch depthai_ros_driver rgbd_pcl.launch params_file:=/configs/camera.yaml command where camera.yaml is exactly the same as the one provided under depthai_ros_driver/configs directory.

One thing I could also share is that the OAK-D Pro PoE camera I have is running (and will be running) on a busy but isolated network with several other PoE cameras along with some other other network devices. I have one OAK-D Pro camera on that network at this time. The switch is good enough to power many PoE cameras, I don't think power would be an issue here.

onurbingol commented 1 year ago

I'd like to share another log stream. In this case, the OAK-D Pro camera failed after running approximately 4 hours.

2023-07-20T11:13:54.529909198Z [19443010A133731300] [x.y.z.w] [28642.322] [system] [warning] PRINT:LeonCss: memcpy_dma: needed 2 dst reads
[19443010A133731300] [x.y.z.w] [28642.366] [system] [warning] PRINT:LeonCss: W: Isp cfg mising: 0 
[19443010A133731300] [x.y.z.w] [28642.762] [system] [warning] PRINT:LeonCss: W: Isp cfg mising: 0 
2023-07-20T11:13:55.009727658Z [19443010A133731300] [x.y.z.w] [28642.824] [system] [info] Memory Usage - DDR: 201.46 / 340.34 MiB, CMX: 2.47 / 2.50 MiB, LeonOS Heap: 67.45 / 77.23 MiB, LeonRT Heap: 6.15 / 41.14 MiB
[19443010A133731300] [x.y.z.w] [28642.824] [system] [info] Temperatures - Average: 66.70C, CSS: 67.88C, MSS 66.24C, UPA: 66.65C, DSS: 66.04C
[19443010A133731300] [x.y.z.w] [28642.824] [system] [info] Cpu Usage - LeonOS 99.97%, LeonRT: 13.21%
[19443010A133731300] [x.y.z.w] [28642.829] [system] [warning] PRINT:LeonCss: W: Isp cfg mising: 0 
[19443010A133731300] [x.y.z.w] [28642.895] [system] [warning] PRINT:LeonCss: W: Isp cfg mising: 0 
[19443010A133731300] [x.y.z.w] [28643.258] [system] [warning] PRINT:LeonCss: W: Isp cfg mising: 0 
[19443010A133731300] [x.y.z.w] [28643.578] [system] [warning] PRINT:LeonCss: W: 4 - 0 - 0x869eadc0
W: Isp cfg mising: 0 
[19443010A133731300] [x.y.z.w] [28643.589] [system] [warning] PRINT:LeonMss: Unexpected trap ( 9) at address 0x80220c2c
2023-07-20T11:13:55.692920479Z Bad trap, 0x9
PSR: 0xf3400ec6 PC: 0x80220c30 NPC: 0x80220c34 TPC: 0x80220c2c
G1:  0x00000000 G2: 0x0000ff04 G3:  0x0b010048 G4:  0x808fe5b4
2023-07-20T11:13:55.692934345Z G5:  0x00000078 G6: 0x00001a0d G7:  0x0000001e I0:  0x00000000
I1:  0x00000004 I2: 0x00000003 I3:  0x001a3803 I4:  0x8068d848
I5:  0x00000000 I6: 0x8066d378 I7:  0x80220d7c Y:   0x00000000

2023-07-20T11:13:55.692971045Z L0:  0x808d2848 L1: 0x8007c004 L2:  0x80221728 L3:  0x00000002
L4:  0x00000004 L5: 0x00000004 L6:  0x808d27f8 L7:  0x808d2818
I0:  0x7820c000 I1: 0x00000960 I2:  0x8066d46c I3:  0x001a3804
I4:  0x8066f288 I5: 0x00000000 I6:  0x8066d408 I7:  0x802692b8
SRA: 0x80555000 SA0:0x00000438 SA1: 0x00000000 SA2: 0x00000000
SA3: 0xcb9afa68 SA4:0x78209978 SA5: 0x805fbf30
Unexpected trap ( 9) at address 0x80220c2c
Bad trap, 0x9
PSR: 0xf3400ec6 PC: 0x80220c30 NPC: 0x80220c34 TPC: 0x80220c2c
G1:  0x00000000 G2: 0x0000ff04 G3:  0x0b010048 G4:  0x808fe5b4
G5:  0x00000078 G6: 0x00001a0d G7:  0x0000001e I0:  0x00000000
2023-07-20T11:13:55.693034595Z I1:  0x00000004 I2: 0x00000003 I3:  0x001a3803 I4:  0x8068d848
I5:  0x00000000 I6: 0x8066d378 I7:  0x80220d7c Y:   0x00000000

2023-07-20T11:13:55.693048742Z L0:  0x808d2848 L1: 0x8007c004 L2:  0x80221728 L3:  0x00000002
L4:  0x00000004 L5: 0x00000004 L6:  0x808d27f8 L7:  0x808d2818
2023-07-20T11:13:55.693057309Z I0:  0x7820c000 I1: 0x00000960 I2:  0x8066d46c I3:  0x001a3804
I4:  0x8066f288 I5: 0x00000000 I6:  0x8066d408 I7:  0x802692b8
SRA: 0x80555000 SA0:0x00000438 SA1: 0x00000000 SA2: 0x00000000
SA3: 0xcb9afa68 SA4:0x78209978 SA5: 0x805fbf30
[19443010A133731300] [x.y.z.w] [28647.404] [system] [critical] Fatal error. Please report to developers. Log: 'Fatal error on MSS CPU: trap: 09, address: 80220C2C' '0'
2023-07-20T11:13:59.502021720Z [19443010A133731300] [x.y.z.w] [28647.406] [system] [warning] PRINT:LeonCss: Fatal error on MSS CPU: trap: 09, address: 80220C2C
2023-07-20T11:14:12.163145269Z [2023-07-20 11:14:12.163] [warning] Monitor thread (device: 19443010A133731300 [x.y.z.w]) - ping was missed, closing the device connection
[2023-07-20 11:14:14.163] [debug] XLinkResetRemoteTimeout returned: X_LINK_TIMEOUT
[2023-07-20 11:14:14.163] [debug] XLinkResetRemote of linkId: (0)
[2023-07-20 11:14:14.163] [debug] DataOutputQueue (imu_imu) closed
[2023-07-20 11:14:14.163] [debug] DataOutputQueue (rgb_isp) closed
[2023-07-20 11:14:14.163] [debug] DataOutputQueue (stereo_stereo) closed
2023-07-20T11:14:14.163435910Z [2023-07-20 11:14:14.163] [debug] Timesync thread exception caught: Couldn't read data from stream: '__timesync' (X_LINK_ERROR)
[2023-07-20 11:14:14.163] [debug] Log thread exception caught: Couldn't read data from stream: '__log' (X_LINK_ERROR)
2023-07-20T11:14:14.163446570Z [2023-07-20 11:14:14.163] [debug] DataOutputQueue (nn_nn) closed

onurbingol commented 1 year ago

I'd like to share some updates after testing the OAK-D Pro PoE cameras for 4 days. As suggested on the documentation, I set rgb_i_fps from the default value (60) to a more reasonable value (10) before starting the tests.

I have two OAK-D Pro PoE cameras, and I've tested rgb_i_fps: 10 on both cameras. I've also set left_i_fps: 5 and right_i_fps: 5 during the tests. These settings reduced the CPU load of one of the cameras to 49-51% range and the other one to 73-76%. I am not sure why the same setting affects the same cameras in a different way, interestingly the camera with the 73-76% CPU load was in a room with A/C all the time.

The good thing is that both cameras are just running fine with the reduced FPS and they haven't failed with the CPU error explained previously. This also makes sense, as running under 100% CPU load never looks good. I haven't tested if I could get 60 FPS data from ROS or the actual difference between setting rgb_i_fps to 60 or 10. At the current state, after 4 days of testing, I could say that it works for me, but I am not sure if it would fail after running it continuously for a week or a month.

onurbingol commented 1 year ago

I've restarted the ROS driver with a new configuration today (it looked stable with the default config and reduced FPS) and it has started failing again. The only log entry regarding the error I can get is the following. This happens when I subscribe to /oak/stereo/image_raw topic:

[2023-07-25 18:04:38.179] [debug] Timesync thread exception caught: Couldn't read data from stream: '__timesync' (X_LINK_ERROR)
[2023-07-25 18:04:38.179] [debug] Log thread exception caught: Couldn't read data from stream: '__log' (X_LINK_ERROR)

The configuration file changes are

stereo_i_depth_filter_size: 3
stereo_i_bilateral_sigma: 0.2
stereo_i_enable_spatial_filter: true
stereo_i_enable_threshold_filter: true
stereo_i_subpixel: true
stereo_i_subpixel_fractional_bits: 3
stereo_i_spatial_filter_iterations: 2
stereo_i_threshold_filter_max_range: 1500
stereo_i_threshold_filter_min_range: 750

As I mentioned, I have two cameras. I was able to run this config on one of them, but the other one fails.

Additional observations:

If I can run the ROS driver without any camera or driver failure, I can subscribe to RGB topics, but subscribing to stereo topics would cause a crash with the error described above.
From time to time, I see the following error. I have seen this error a couple of times, after a couple of restarts it goes away.

[19443010A133731300] [a.b.c.d] [19.688] [system] [info] Memory Usage - DDR: 201.41 / 340.34 MiB, CMX: 2.47 / 2.50 MiB, LeonOS Heap: 67.44 / 77.23 MiB, LeonRT Heap: 6.15 / 41.14 MiB
[19443010A133731300] [a.b.c.d] [19.688] [system] [info] Temperatures - Average: 39.11C, CSS: 39.81C, MSS 38.65C, UPA: 39.81C, DSS: 38.18C
[19443010A133731300] [a.b.c.d] [19.688] [system] [info] Cpu Usage - LeonOS 49.85%, LeonRT: 8.54%
[FATAL] [1690330188.265909769]: ASSERTION FAILED
2023-07-26T00:09:48.266444453Z  file = /opt/ros/noetic/include/ros/publisher.h
    line = 113
2023-07-26T00:09:48.266452769Z  cond = false
2023-07-26T00:09:48.266455665Z  message = 
2023-07-26T00:09:48.266459081Z [FATAL] [1690330188.265963140]: Call to publish() on an invalid Publisher (topic [/oak/imu/data])
[FATAL] [1690330188.265977898]: 
2023-07-26T00:09:48.266464882Z 
[oak_nodelet_manager-3] process has died [pid 54, exit code -5, cmd /opt/ros/noetic/lib/nodelet/nodelet manager /nodelet_manager/load_nodelet:=oak/nodelet_manager/load_nodelet /nodelet_manager/unload_nodelet:=oak/nodelet_manager/unload_nodelet /nodelet_manager/list:=oak/nodelet_manager/list __name:=oak_nodelet_manager __log:=/root/.ros/log/xyz_1689357109/oak_nodelet_manager-3.log].
log file: /root/.ros/log/xyz_1689357109/oak_nodelet_manager-3*.log
[ INFO] [1690330191.331442155]: Bond broken, exiting

Serafadam commented 1 year ago

Hi, thanks for the detailed information. Can you try running stereo in low bandwidth mode? And see if that changes things?

onurbingol commented 1 year ago

@Serafadam I just received a programming board by mail and reflashed the firmware of the both cameras as a last resort. I am testing them now, also playing with the stereo configuration. They haven't failed yet during the last 1 hour, but it seems like these errors are occurring randomly. I'll update this ticket with more information when I have it.

If it still fails with the new firmware, I'll test the low bw mode for the stereo only.

onurbingol commented 1 year ago

After a couple of restarts with some different config values, mainly changing the spatial filter parameters and bilateral sigma value, it started the crash loop again. At that time, I turned low bandwidth on with 50%, but it didn't solve the issue. After waiting ~1 hr, the camera started responding again.

When it is in the crash loop (i.e. continuously crashing no matter what), changing to the default config or any config variation fails. I am wondering if there is a way to run a hard reset before starting the ROS node, as in to make sure that the camera config is cleared from the device. I am sure it is designed to delete the config in case of a restart, but it would be great to test if a hard reset on the device would recover from the crash.

Serafadam commented 1 year ago

This is a long shot, but could you try checking if setting parameter imu_i_sync_method: COPY helps with the issue?

onurbingol commented 1 year ago

I am not using IMU (at least, for the current application I am working on) and I could even turn it off, if possible (edit: I see camera_i_enable_imu config variable for this purpose)

In any case, I'll add/change the parameter you mentioned and leave it on during the weekend to see if it helps.

onurbingol commented 1 year ago

@Serafadam unfortunately, imu_i_sync_method: COPY didn't work.

I set up the config variable on Friday, restarted the camera with a variety of configs (as discussed previously, the config variables I am mainly focused are the ones starting with stereo_). On last week Friday and this Monday, everything was working fine. I tested a variety of config options, reverted back and tested a couple of hours later. No issues at all.

I restarted the container today (Thursday) and after that, subscribing to any image topic started crashing the ROS node. Both cameras were working fine before I restarted the containers.

In summary, restarting the containers after running the cameras continuously for a couple of days (e.g. 4 or more) ends up with a camera/ROS node crash loop. This loop continues for hours, usually the cameras start working again (so, the crash loop ends) very early in the morning. The CPU load stays on the 50% level all the time for both of the cameras during the testing period of 4 days.

Previously, during the crash loop, I was able to connect and get image from the cameras when I used the DepthAI Python SDK directly. That was never an issue, but the ROS node always fails during the loop.

Serafadam commented 1 year ago

I'll try to setup some testing scenario next week, but could you additionally try running those nodes with debugger enabled?

onurbingol commented 1 year ago

@Serafadam I am not sure how to enable debugger without editing the files in the container (I am guessing you are talking about gdb, not setting DEPTHAI_DEBUG=1). Maybe I am not correct, but I usually see a debug roslaunch argument in some of the packages which triggers gdb when debug:=1 (can be done using launch-prefix).

It looks like I have failed to mention something important in my bug report. On my setup, I have a separate roscore instance running in a different docker container and I have to set up ROS_MASTER_URI & ROS_IP while running luxonis/depthai-ros:v2.7.4-noetic docker image. In my current setup, all the containers use host networking, but still I have to set up the IP and URI variables correctly. This is a working setup with my other GigE cameras and I have been using this setup for ~2 years and I haven't had any issues so far.

Considering this information, let me share the depthai-ros logs again:

[ INFO] [1691467550.249111635]: Camera with MXID: 194430109145801300 and Name: x.y.z.w connected!
2023-08-08T04:05:50.249173653Z [ INFO] [1691467550.249143466]: PoE camera detected. Consider enabling low bandwidth for specific image topics (see readme).
[ INFO] [1691467550.275514959]: Device type: OAK-D-PRO-POE-FF
2023-08-08T04:05:50.284073854Z [ INFO] [1691467550.284023358]: Pipeline type: RGBD
[ INFO] [1691467550.341467682]: NN Family: mobilenet
[ INFO] [1691467550.375168549]: NN input size: 300 x 300. Resizing input image in case of different dimensions.
[ INFO] [1691467550.931222870]: Finished setting up pipeline.
2023-08-08T04:05:52.818230142Z [ INFO] [1691467552.818148006]: Camera ready!
[depth_image_to_rgb_pointcloud-4] process has finished cleanly
log file: /root/.ros/log/abc_1691432562/depth_image_to_rgb_pointcloud-4*.log
[ INFO] [1691467565.906746115]: Bond broken, exiting

The last line on the log seemed very interesting and here are some issues that I found about "broken bond":

I am not sure if the "broken bond" I see on the OAK-D logs is the same nodelet issue as mentioned on the links. Since this broken bond issue happens whenever I subscribe to the topic, it feels like it might be related. It is still a guess, for sure. One other thing to mention is that "separate roscore instance" works fine with my other ROS docker containers, but I don't think any of them implements nodelets.

My observation is that there are two main issues mentioned on this ticket:

CPU throttling, which I believe I solved it by reducing the FPS as discussed on the docs
Broken bond, which sounds something like a DNS resolution issue

Do these make any sense to you?

Edit: Also tested luxonis/depthai-ros:v2.7.5-noetic, but observed no change.

Serafadam commented 1 year ago

I am not sure how to enable debugger without editing the files in the container You could launch the container in interactive mode and then add prefixes in the launch files using for example Vim. Specific debug arguments(gdb,valgrind,perf etc.) will be added in the near future.
Broken bond, which sounds something like a DNS resolution issue Could you try running the cameras in one container (or host) and see if things improve? This seems more like a networking issue.

onurbingol commented 1 year ago

Looks like it might be an issue with Nodelets and ROS1, after all.

When ROS_IP and ROS_MASTER_URI values are set correctly while running the docker container, it seems to work fine with roscore started by the rgbd_pcl.launch file.

The separate roscore issue seems to be unrelated to this repo. Note to myself: Maybe it is time to start using ROS2 :)

Thanks for the help!

Serafadam commented 1 year ago

No problem, glad to help :slightly_smiling_face: