stereolabs / zed-ros-wrapper

ROS wrapper for the ZED SDK
https://www.stereolabs.com/docs/ros/
MIT License
450 stars 392 forks source link

[Question] Zed_wrapper shuts down after some time #674

Closed NikoBach2 closed 3 years ago

NikoBach2 commented 3 years ago

Hi

We are running 7 ZED2 cameras on 7 Jetson TX2 boards with Jetpack 4.3 . The ros master is running on an external machine, with ubuntu 16.04. We have lately seen that the zed_wrapper nodes has shutdown after some time. The nodes does not necessarily shut down at the same time. It can vary from less than an hour to a couple of days from last reboot. It seems to vary, whether it is the publish_state node, or the zed_node, which shuts down.

Following message:

[roslaunch][ERROR] 2021-02-22 12:37:40,620: ================================================================================REQUIRED process [portal_1_camera_0/zed_node-2] has died!
process has died [pid 7475, exit code -9, cmd /home/nvidia/catkin_ws/devel/lib/zed_wrapper/zed_wrapper_node __name:=zed_node __log:=/home/nvidia/.ros/log/a6d0f202-7509-11eb-94e4-98eecb97be3c/portal_1_camera_0-zed_node-2.log].
log file: /home/nvidia/.ros/log/a6d0f202-7509-11eb-94e4-98eecb97be3c/portal_1_camera_0-zed_node-2*.log
Initiating shutdown!
================================================================================ 

Is there a way to tell why the process has died?

Please ask for further informations.

Best regards Nikolaj

Myzhar commented 3 years ago

Hi @NikoBach2 is there some kind of information in the log files? log file: /home/nvidia/.ros/log/a6d0f202-7509-11eb-94e4-98eecb97be3c/portal_1_camera_0-zed_node-2*.log

NikoBach2 commented 3 years ago

Hi @Myzhar The only file in that folder is the roslaunch.log file

Myzhar commented 3 years ago

From what I know the exit code -9 means "Out of memory". What kind of task are the ZED nodes performing? What modules have you activated?

NikoBach2 commented 3 years ago

We are running the zed_wrapper zed2.launch And then once a while start and stop an svo-recording through the zed_node/start_svo_recording and zed_node/stop_svo_recording service. At the moment we are recording every hour, and in the meantime the zed node is just running without we write to it.

Myzhar commented 3 years ago

I suggest you monitor the memory usage to see if the crashes are in some way related to the status of the recording services.

NikoBach2 commented 3 years ago

I will try do that

NikoBach2 commented 3 years ago

You where right, it was a memory issue, when i tried to run the zed_wrapper node in debugger mode. I was trying to use the debugger mode to find out why the nodes broke down. And now I have a new error message, with exit code -11.

[roslaunch][ERROR] 2021-02-24 04:00:08,366: ================================================================================REQUIRED process [portal_1_camera_0/zed_node-2] has died!
process has died [pid 3844, exit code -11, cmd /home/nvidia/catkin_ws/devel/lib/zed_wrapper/zed_wrapper_node __name:=zed_node __log:=/home/nvidia/.ros/log/b7cdfa5e-75c3-11eb-927c-98eecb97be3c/portal_1_camera_0-zed_node-2.log].
log file: /home/nvidia/.ros/log/b7cdfa5e-75c3-11eb-927c-98eecb97be3c/portal_1_camera_0-zed_node-2*.log
Initiating shutdown!
================================================================================

And another camera have this error.

[roslaunch][ERROR] 2021-02-23 18:00:02,582: ================================================================================REQUIRED process [portal_1_camera_2/zed_node-2] has died!
process has died [pid 21830, exit code -6, cmd /home/nvidia/catkin_ws/devel/lib/zed_wrapper/zed_wrapper_node __name:=zed_node __log:=/home/nvidia/.ros/log/b7cdfa5e-75c3-11eb-927c-98eecb97be3c/portal_1_camera_2-zed_node-2.log].
log file: /home/nvidia/.ros/log/b7cdfa5e-75c3-11eb-927c-98eecb97be3c/portal_1_camera_2-zed_node-2*.log
Initiating shutdown!
================================================================================

What does that mean?

NikoBach2 commented 3 years ago

@Myzhar do you have a clue ?

Myzhar commented 3 years ago

The -11 code means Invalid access to storage., check if the ROS log folder has not filled the TX2 system space. The -6 code means Abnormal termination., there can be many causes to this, you should investigate on the status of the TX2 board to better understand what's happening. I suggest you try to use jtop that is a very useful tool to monitor the status of Nvidia Jetson boards.

A tip: to have a better knowledge about the meanings of the exit codes you can look at the files signum.h and signum-generic.h in the usr folder.

NikoBach2 commented 3 years ago

Thank you very much. I will try looking at it. :+1: