Closed michaelnguyen11 closed 8 months ago
Attached syslogs with debug logs enable : export DEPTH_DEBUG=1
watchdog_issue_with_debug_logs.txt
Attached camera config yaml file : camera_config.txt
With camera configuration :
stereo:
i_publish_left_rect: true
i_publish_right_rect: true
i_publish_synced_rect_pair: false
I found that stop()
of Watchdog stuck at syncTimer->reset()
when try close stereo queue. Hence the watchdog can't complete the stop()
to switch to start()
the camera again.
[component_container-2] [INFO] [1697096923.857812067] [oak]: Stopping camera.
[component_container-2] [INFO] [1697096923.857891218] [oak]: Node Name close Queue rgb
[component_container-2] [2023-10-12 07:48:43.890] [depthai] [debug] DataOutputQueue (rgb_isp) closed
[component_container-2] [2023-10-12 07:48:43.920] [depthai] [debug] DataInputQueue (rgb_control) closed
[component_container-2] [INFO] [1697096923.921041351] [oak]: Node Name close Queue stereo
[component_container-2] [INFO] [1697096923.921100444] [oak]: [Hiep] Stopping left.
[component_container-2] [2023-10-12 07:48:43.960] [depthai] [debug] DataOutputQueue (left_mono) closed
[component_container-2] [2023-10-12 07:48:43.961] [depthai] [debug] DataInputQueue (left_control) closed
[component_container-2] [INFO] [1697096923.961591764] [oak]: [Hiep] Stopping right.
[component_container-2] [2023-10-12 07:48:44.027] [depthai] [debug] DataOutputQueue (right_mono) closed
[component_container-2] [2023-10-12 07:48:44.045] [depthai] [debug] DataInputQueue (right_control) closed
[component_container-2] [INFO] [1697096924.045518059] [oak]: [Hiep] Stopping stereoQ.
[component_container-2] [2023-10-12 07:48:44.060] [depthai] [debug] DataOutputQueue (stereo_stereo) closed
[component_container-2] [INFO] [1697096924.060586095] [oak]: [Hiep] Stopping left syncTimer.
Workaround : reset the rclcpp::TimerBase::SharedPtr syncTimer
pointer when it's initialized only. (inside stereo.cpp file).
--- a/depthai-ros/depthai_ros_driver/src/dai_nodes/stereo.cpp
+++ b/depthai-ros/depthai_ros_driver/src/dai_nodes/stereo.cpp
@@ -332,12 +332,14 @@ void Stereo::closeQueues() {
if(ph->getParam<bool>("i_publish_topic")) {
stereoQ->close();
}
- if(ph->getParam<bool>("i_publish_left_rect") || ph->getParam<bool>("i_publish_synced_rect_pair")) {
- syncTimer->reset();
+ if(ph->getParam<bool>("i_publish_left_rect")) {
+ if (ph->getParam<bool>("i_publish_synced_rect_pair"))
+ syncTimer->reset();
leftRectQ->close();
}
- if(ph->getParam<bool>("i_publish_right_rect") || ph->getParam<bool>("i_publish_synced_rect_pair")) {
- syncTimer->reset();
+ if(ph->getParam<bool>("i_publish_right_rect")) {
+ if (ph->getParam<bool>("i_publish_synced_rect_pair"))
+ syncTimer->reset();
rightRectQ->close();
}
Then the issue is gone:
[component_container-2] [ERROR] [1697100499.066037746] [oak]: Camera diagnostics error: No Data
[component_container-2] [ERROR] [1697100499.066213891] [oak]: Restarting camera
[component_container-2] [INFO] [1697100499.066299101] [oak]: Stopping camera.
[component_container-2] [INFO] [1697100499.066380097] [oak]: Node Name close Queue rgb
[component_container-2] [INFO] [1697100499.141800867] [oak]: Node Name close Queue stereo
[component_container-2] [INFO] [1697100499.141912322] [oak]: [Hiep] Stopping left.
[component_container-2] [INFO] [1697100499.185807208] [oak]: [Hiep] Stopping right.
[component_container-2] [INFO] [1697100499.257864673] [oak]: [Hiep] Stopping stereoQ.
[component_container-2] [INFO] [1697100499.293451897] [oak]: [Hiep] Stopping leftRectQ.
[component_container-2] [INFO] [1697100499.370740181] [oak]: [Hiep] Stopping rightRectQ.
[component_container-2] [INFO] [1697100499.371812074] [oak]: Node Name close Queue imu
[component_container-2] [INFO] [1697100499.373055422] [oak]: Node Name close Queue sys_logger
[component_container-2] [INFO] [1697100501.682835897] [oak]: set cam running to false.
[component_container-2] [INFO] [1697100501.682884645] [oak]: Starting camera.
...
[component_container-2] [INFO] [1697100505.943126608] [oak]: Camera ready!
Normally, the shared_ptr
doesn't need initialization to be reset. But I don't know why with this rclcpp: TimerBase::SharedPtr syncTimer
shared pointer, it stucks at reset()
when the shared pointer is not initialized.
The diagnostic_msgs
show the status level error with an empty message every 10-15 minutes. Therefore, the Watchdog catches and restarts the camera every 10-15 minutes, even when I leave the camera to run alone without touching it.
I think this annoying issue should be fixed in the future.
Hi, thanks for the report, bugfix related to timer reset is included on this branch, can you test and verify it also works for you? Regarding camera restarts, does this happen on FFC device, or OAK-D-PRO as well?
Hi @Serafadam,
With the new change, I confirm that the Watchdog can restart the device when it caught diagnostic_msgs with status level = ERROR.
[component_container-2] [INFO] [1697260526.170789994] [oak]: Device type: OAK-D-PRO-FF
[component_container-2] [INFO] [1697260526.174930563] [oak]: Pipeline type: RGBD
[component_container-2] [INFO] [1697260526.439236006] [oak]: Finished setting up pipeline.
[component_container-2] [INFO] [1697260528.102532038] [oak]: Published URDF
[component_container-2] [INFO] [1697260528.104386311] [oak]: Camera ready!
...
[component_container-2] [ERROR] [1697261116.308186974] [oak]: Camera diagnostics error: No Data
[component_container-2] [ERROR] [1697261116.308232156] [oak]: Restarting camera
[component_container-2] [INFO] [1697261116.308244264] [oak]: Stopping camera.
[component_container-2] [INFO] [1697261118.720689267] [oak]: Starting camera.
[component_container-2] [INFO] [1697261118.720831163] [oak]: No ip/mxid specified, connecting to the next available device.
[component_container-2] [INFO] [1697261121.230660273] [oak]: Camera with MXID: 1844301021BE850E00 and Name: 1.1 connected!
...
[component_container-2] [INFO] [1697261122.894858264] [oak]: Camera ready!
Regarding the code changes, the leftRectQ->close();
and rightRectQ->close();
are called twice, in 2 case :
i_publish_synced_rect_pair
is set to True
and i_publish_left_rect
and i_publish_right_rect
are set to False
. i_publish_synced_rect_pair
is set to True
and i_publish_left_rect
and i_publish_right_rect
are set to True
.
I think it's quite redundant, I believe it can remove i_publish_synced_rect_pair
OR checking to reduce redundant close call.Regarding camera restarts issue, it happens both on FFC device and OAK-D-Pro device when the device is running around 10-15 minutes. I investigated that issue and found the root cause is relate to /diagnostics topic.
By default, the /diagnostics's diagnostic_updater
update each 1 second (line 00557) to get System Information
to put to loggerQ
, then produceDiagnostics
callback function check the queue, if the logData
is not available in 1 second, it escalates diagnostic_msgs
status level to ERROR
and Watchdog catch that status to restart the camera.
I think the 1 second
here is quite large time to read System Information
, but in some moment, the /diagnostics
topic can't read the System Information
, hence the camera is restarted.
For me, to avoid this issue, instead of use loggerQ->tryGet
, I used loggerQ->get
with timeout is 5 seconds (quite magic number actually), to ensure that the camera is only restarted when it's not available in 5 seconds.
Hi @michaelnguyen11, the code has been updated to reflect those changes, you can refer to the newest branches
I @Serafadam ,
I've left the camera running overnight and don't see the watchdog restart the camera. I believe that issue is fixed. Thank you !
Describe the bug I ran
camera.launch.py
with theuse_rviz:=true
option and just let it run for around 10 minutes without touching it. The Watchdog (diagCB) caught thesys_logger
error and restarted the camera. As the code shows, therestart function
shouldstop
andstart
'the camera again. But in my case, the watchdog can't start the camera again.Minimal Reproducible Example Run
camera.launch.py
indepthai_ros_driver
with the OAK-D Pro camera in around 10–15 minutes, leaving it to run alone without touching it.Expected behavior The Watchdog (diagCB) should restart the camera when error happens.
Screenshots
Attach system log
camera.launch.py
system logs :/diagnostics
topic logs:Additional context I observed the same issue with our custom camera based on the OAK-FFC-4P when running the camera for around 10–15 minutes.