luxonis / depthai-ros

Official ROS Driver for DepthAI Sensors.
MIT License
239 stars 173 forks source link

[BUG] Watchdog Humble can't start OAK-D Pro camera #432

Closed michaelnguyen11 closed 8 months ago

michaelnguyen11 commented 8 months ago

Describe the bug I ran camera.launch.py with the use_rviz:=true option and just let it run for around 10 minutes without touching it. The Watchdog (diagCB) caught the sys_logger error and restarted the camera. As the code shows, the restart function should stop and start 'the camera again. But in my case, the watchdog can't start the camera again.

Minimal Reproducible Example Run camera.launch.py in depthai_ros_driver with the OAK-D Pro camera in around 10–15 minutes, leaving it to run alone without touching it.

Expected behavior The Watchdog (diagCB) should restart the camera when error happens.

Screenshots Screenshot from 2023-10-12 13-00-48

Attach system log

camera.launch.py system logs :

[component_container-2] [INFO] [1697088818.017916878] [oak]: Device type: OAK-D-PRO-FF
[component_container-2] [INFO] [1697088818.020849370] [oak]: Pipeline type: RGBD
[component_container-2] [INFO] [1697088818.269628676] [oak]: Finished setting up pipeline.
[component_container-2] [INFO] [1697088819.684924607] [oak]: Published URDF
[component_container-2] [INFO] [1697088819.687891904] [oak]: Camera ready!
[INFO] [launch_ros.actions.load_composable_nodes]: Loaded node '/oak' in container '/oak_container'
[component_container-2] [INFO] [1697088819.702943274] [oak_state_publisher]: got segment oak
[component_container-2] [INFO] [1697088819.702973357] [oak_state_publisher]: got segment oak-d-base-frame
[component_container-2] [INFO] [1697088819.702979473] [oak_state_publisher]: got segment oak_imu_frame
[component_container-2] [INFO] [1697088819.702984629] [oak_state_publisher]: got segment oak_model_origin
[rviz2-1] [INFO] [1697088822.084673445] [rviz2]: Stereo is NOT SUPPORTED
[component_container-2] [ERROR] [1697089397.011889680] [oak]: Camera diagnostics error: No Data
[component_container-2] [ERROR] [1697089397.012008975] [oak]: Restarting camera
[component_container-2] [INFO] [1697089397.012047079] [oak]: Stopping camera.
[ERROR] [component_container-2]: process has died [pid 879, exit code -11, cmd '/opt/ros/humble/lib/rclcpp_components/component_container --ros-args --log-level info --ros-args -r __node:=oak_container -r __ns:=/'].

/diagnostics topic logs:

➜  / ros2 topic echo /diagnostics
....
---
header:
  stamp:
    sec: 1697089396
    nanosec: 10902956
  frame_id: ''
status:
- level: "\0"
  name: 'oak: sys_logger'
  message: System Information
  hardware_id: oak_1844301021BE850E00_OAK-D-PRO-FF
  values:
  - key: System Information
    value: "System Information: \n  Leon CSS CPU Usage: 68.163\n  Leon MSS CPU Usage: 12.887\n Ddr Memory Usage: 164.341\n Ddr Memory Total: 337..."
---
header:
  stamp:
    sec: 1697089397
    nanosec: 11158748
  frame_id: ''
status:
- level: "\x02"
  name: 'oak: sys_logger'
  message: No Data
  hardware_id: oak_1844301021BE850E00_OAK-D-PRO-FF
  values: []

Additional context I observed the same issue with our custom camera based on the OAK-FFC-4P when running the camera for around 10–15 minutes.

michaelnguyen11 commented 8 months ago

Attached syslogs with debug logs enable : export DEPTH_DEBUG=1

watchdog_issue_with_debug_logs.txt

Attached camera config yaml file : camera_config.txt

michaelnguyen11 commented 8 months ago

With camera configuration :

stereo:
      i_publish_left_rect: true
      i_publish_right_rect: true
      i_publish_synced_rect_pair: false

I found that stop() of Watchdog stuck at syncTimer->reset() when try close stereo queue. Hence the watchdog can't complete the stop() to switch to start() the camera again.

[component_container-2] [INFO] [1697096923.857812067] [oak]: Stopping camera.
[component_container-2] [INFO] [1697096923.857891218] [oak]: Node Name close Queue rgb
[component_container-2] [2023-10-12 07:48:43.890] [depthai] [debug] DataOutputQueue (rgb_isp) closed
[component_container-2] [2023-10-12 07:48:43.920] [depthai] [debug] DataInputQueue (rgb_control) closed
[component_container-2] [INFO] [1697096923.921041351] [oak]: Node Name close Queue stereo
[component_container-2] [INFO] [1697096923.921100444] [oak]: [Hiep] Stopping left.
[component_container-2] [2023-10-12 07:48:43.960] [depthai] [debug] DataOutputQueue (left_mono) closed
[component_container-2] [2023-10-12 07:48:43.961] [depthai] [debug] DataInputQueue (left_control) closed
[component_container-2] [INFO] [1697096923.961591764] [oak]: [Hiep] Stopping right.
[component_container-2] [2023-10-12 07:48:44.027] [depthai] [debug] DataOutputQueue (right_mono) closed
[component_container-2] [2023-10-12 07:48:44.045] [depthai] [debug] DataInputQueue (right_control) closed
[component_container-2] [INFO] [1697096924.045518059] [oak]: [Hiep] Stopping stereoQ.
[component_container-2] [2023-10-12 07:48:44.060] [depthai] [debug] DataOutputQueue (stereo_stereo) closed
[component_container-2] [INFO] [1697096924.060586095] [oak]: [Hiep] Stopping left syncTimer.

Workaround : reset the rclcpp::TimerBase::SharedPtr syncTimer pointer when it's initialized only. (inside stereo.cpp file).

--- a/depthai-ros/depthai_ros_driver/src/dai_nodes/stereo.cpp
+++ b/depthai-ros/depthai_ros_driver/src/dai_nodes/stereo.cpp
@@ -332,12 +332,14 @@ void Stereo::closeQueues() {
     if(ph->getParam<bool>("i_publish_topic")) {
         stereoQ->close();
     }
-    if(ph->getParam<bool>("i_publish_left_rect") || ph->getParam<bool>("i_publish_synced_rect_pair")) {
-        syncTimer->reset();
+    if(ph->getParam<bool>("i_publish_left_rect")) {
+        if (ph->getParam<bool>("i_publish_synced_rect_pair"))
+            syncTimer->reset();
         leftRectQ->close();
     }
-    if(ph->getParam<bool>("i_publish_right_rect") || ph->getParam<bool>("i_publish_synced_rect_pair")) {
-        syncTimer->reset();
+    if(ph->getParam<bool>("i_publish_right_rect")) {
+        if (ph->getParam<bool>("i_publish_synced_rect_pair"))
+            syncTimer->reset();
         rightRectQ->close();
     }

Then the issue is gone:

[component_container-2] [ERROR] [1697100499.066037746] [oak]: Camera diagnostics error: No Data
[component_container-2] [ERROR] [1697100499.066213891] [oak]: Restarting camera
[component_container-2] [INFO] [1697100499.066299101] [oak]: Stopping camera.
[component_container-2] [INFO] [1697100499.066380097] [oak]: Node Name close Queue rgb
[component_container-2] [INFO] [1697100499.141800867] [oak]: Node Name close Queue stereo
[component_container-2] [INFO] [1697100499.141912322] [oak]: [Hiep] Stopping left.
[component_container-2] [INFO] [1697100499.185807208] [oak]: [Hiep] Stopping right.
[component_container-2] [INFO] [1697100499.257864673] [oak]: [Hiep] Stopping stereoQ.
[component_container-2] [INFO] [1697100499.293451897] [oak]: [Hiep] Stopping leftRectQ.
[component_container-2] [INFO] [1697100499.370740181] [oak]: [Hiep] Stopping rightRectQ.
[component_container-2] [INFO] [1697100499.371812074] [oak]: Node Name close Queue imu
[component_container-2] [INFO] [1697100499.373055422] [oak]: Node Name close Queue sys_logger
[component_container-2] [INFO] [1697100501.682835897] [oak]: set cam running to false.
[component_container-2] [INFO] [1697100501.682884645] [oak]: Starting camera.
...
[component_container-2] [INFO] [1697100505.943126608] [oak]: Camera ready!

Normally, the shared_ptr doesn't need initialization to be reset. But I don't know why with this rclcpp: TimerBase::SharedPtr syncTimer shared pointer, it stucks at reset() when the shared pointer is not initialized.

michaelnguyen11 commented 8 months ago

The diagnostic_msgs show the status level error with an empty message every 10-15 minutes. Therefore, the Watchdog catches and restarts the camera every 10-15 minutes, even when I leave the camera to run alone without touching it.

I think this annoying issue should be fixed in the future.

Serafadam commented 8 months ago

Hi, thanks for the report, bugfix related to timer reset is included on this branch, can you test and verify it also works for you? Regarding camera restarts, does this happen on FFC device, or OAK-D-PRO as well?

michaelnguyen11 commented 8 months ago

Hi @Serafadam,

With the new change, I confirm that the Watchdog can restart the device when it caught diagnostic_msgs with status level = ERROR.

[component_container-2] [INFO] [1697260526.170789994] [oak]: Device type: OAK-D-PRO-FF
[component_container-2] [INFO] [1697260526.174930563] [oak]: Pipeline type: RGBD
[component_container-2] [INFO] [1697260526.439236006] [oak]: Finished setting up pipeline.
[component_container-2] [INFO] [1697260528.102532038] [oak]: Published URDF
[component_container-2] [INFO] [1697260528.104386311] [oak]: Camera ready!
...
[component_container-2] [ERROR] [1697261116.308186974] [oak]: Camera diagnostics error: No Data
[component_container-2] [ERROR] [1697261116.308232156] [oak]: Restarting camera
[component_container-2] [INFO] [1697261116.308244264] [oak]: Stopping camera.
[component_container-2] [INFO] [1697261118.720689267] [oak]: Starting camera.
[component_container-2] [INFO] [1697261118.720831163] [oak]: No ip/mxid specified, connecting to the next available device.
[component_container-2] [INFO] [1697261121.230660273] [oak]: Camera with MXID: 1844301021BE850E00 and Name: 1.1 connected!
...
[component_container-2] [INFO] [1697261122.894858264] [oak]: Camera ready!

Regarding the code changes, the leftRectQ->close(); and rightRectQ->close(); are called twice, in 2 case :

Regarding camera restarts issue, it happens both on FFC device and OAK-D-Pro device when the device is running around 10-15 minutes. I investigated that issue and found the root cause is relate to /diagnostics topic.

By default, the /diagnostics's diagnostic_updater update each 1 second (line 00557) to get System Information to put to loggerQ, then produceDiagnostics callback function check the queue, if the logData is not available in 1 second, it escalates diagnostic_msgs status level to ERROR and Watchdog catch that status to restart the camera.

I think the 1 second here is quite large time to read System Information, but in some moment, the /diagnostics topic can't read the System Information, hence the camera is restarted.

For me, to avoid this issue, instead of use loggerQ->tryGet, I used loggerQ->get with timeout is 5 seconds (quite magic number actually), to ensure that the camera is only restarted when it's not available in 5 seconds.

Serafadam commented 8 months ago

Hi @michaelnguyen11, the code has been updated to reflect those changes, you can refer to the newest branches

michaelnguyen11 commented 8 months ago

I @Serafadam ,

I've left the camera running overnight and don't see the watchdog restart the camera. I believe that issue is fixed. Thank you !