```velodyne_hw_ros_wrapper_node``` dies sometimes when launching sensors

NilaySener commented 1 month ago

Description

While running the Nebula driver on Leo Drive's Autonomous Test Vehicle, which is equipped with 4 Velodyne VLP-16 and 1 Velodyne VLS-128 sensors, the velodyne_hw_ros_wrapper_node dies randomly.

Expected Behavior

All LIDAR sensors (4 x Velodyne VLP-16 and 1 x Velodyne VLS-128) should publish ROS2 messages consistently and reliably during the operation of the Nebula driver on the Autonomous Test Vehicle.

Actual Behavior

After the lidars are launched, either all lidars are launched without any problems or some of the lidar component containers dies randomly. You can find some of the test output of node failures mentioned in the output below:

Test 1: The Velodyne VLP-16 node is dying


1722353283.5760190 [component_container-7] VelodyneHwInterface::StringCallback: {"volt_temp":{"bot":{"i_out":2099,"pwr_1_2v":986,"lm20_temp":1110,"pwr_5v":2065,"pwr_2_5v":2048,"pwr_3_3v":2706,"pwr_v_in":936,"pwr_1_25v":0},"top":{"hv":2685,"ad_temp":614,"lm20_temp":1099,"pwr_5v":2070,"pwr_2_5v":2047,"pwr_3_3v":2690,"pwr_5v_raw":2182,"pwr_vccint":974}},"vhv":353,"adc_nf":[14],"adc_stats":[{"mean":14.3,"stddev":0.578}],"ixe":1}
1722353283.5872929 [component_container-7] VelodyneHwInterface::StringCallback: {"gps":{"pps_state":"Locked","position":"41 01.28703566N 028 53.20751111"},"motor":{"state":"On","rpm":602,"lock":"On","phase":26967},"laser":{"state":"On"}}
1722353283.5899334 [component_container-7] [0m[INFO 1722353283.589781410] [sensing.lidar.middle_right.velodyne_hw_interface_ros_wrapper_node]: UDP Driver Started (VelodyneHwInterfaceRosWrapper() at /home/golf/projects/autoware.golf.ups/src/sensor_component/external/nebula/nebula_ros/src/velodyne/velodyne_hw_interface_ros_wrapper.cpp:51)
1722353284.6931584 [ERROR] [component_container-7]: process has died [pid 147464, exit code -11, cmd '/opt/ros/humble/lib/rclcpp_components/component_container --ros-args -r __node:=pointcloud_container -r __ns:=/sensing/lidar/middle_right/pointcloud_preprocessor -p use_sim_time:=False -p wheel_radius:=0.315 -p wheel_width:=0.1 -p wheel_base:=2.64 -p wheel_tread:=1.75 -p front_overhang:=0.99 -p rear_overhang:=0.81 -p left_overhang:=0.14 -p right_overhang:=0.14 -p vehicle_height:=1.86 -p max_steer_angle:=0.6105'].

Test 2: The Velodyne VLS-128 node is dying


1722351592.2855313 [component_container_mt-3] VelodyneHwInterface::StringCallback: {"volt_temp":{"bot":{"pwr_1_0v":1646,"pwr_1_1v":1774,"pwr_1_2v":1966,"pwr_2_5v":4080,"lm20_temp":1047,"valid":true},"top":{"hv":2078,"ad_temp":603,"lm20_temp":1064,"pwr_5v":2051,"pwr_2_5v":2065,"pwr_3_3v":2739,"pwr_raw":1558,"pwr_vccint":607}},"ixe":1}
1722351592.7853172 [component_container_mt-3] expired...
1722351592.7855875 [component_container_mt-3] asyncOnConnect: Operation canceled
1722351592.7884018 [component_container_mt-3] [0m[INFO 1722351592.788190431]
1722351592.7895777 [component_container_mt-3] *** Aborted at 1722351592 (unix time) try "date -d @1722351592" if you are using GNU date ***
1722351592.7930715 [component_container_mt-3] PC: @                0x0 (unknown
1722351592.7945938 [component_container_mt-3] *** SIGSEGV (@0x0) received by PID 58105 (TID 0x7ffba97da640) from PID 0; stack trace: ***
1722351592.7977173 [component_container_mt-3]     @     0x7ffbe006e006 google::(anonymous namespace)::FailureSignalHandler()
1722351592.7981749 [component_container_mt-3]     @     0x7ffbe4c42520 (unknown)
1722351677.3886361 [component_container_mt-3]     @                0x0 (unknown)
1722351593.8865998 [ERROR] [component_container_mt-3]: process has died [pid 58105, exit code -11, cmd '/opt/ros/humble/lib/rclcpp_components/component_container_mt --ros-args -r __node:=pointcloud_container -r __ns:=/sensing/lidar/top/pointcloud_preprocessor -p use_sim_time:=False -p wheel_radius:=0.315 -p wheel_width:=0.1 -p wheel_base:=2.64 -p wheel_tread:=1.75 -p front_overhang:=0.99 -p rear_overhang:=0.81 -p left_overhang:=0.14 -p right_overhang:=0.14 -p vehicle_height:=1.86 -p max_steer_angle:=0.6105'].

Complete Log Files

If you would like to examine the given scenarios in more detail, you can access the launch logs of the scenarios from the links below:

Test1

Test2

Additional Information

Sensor Kit: golf_sensor_kit_launch
Nebula Driver Version: d9aaefc9a4c06f6dae86cd7ef22f6353f1379e4f
Test Vehicle Details: https://github.com/autowarefoundation/autoware.universe/issues/8114

Please let me know if additional information is required or if there are any specific tests that should be performed to help identify the root cause of this issue.

knzo25 commented 1 month ago

@NilaySener Thanks for raising this issue.We run 1xVLS128 + 2-3 VLP16 adn currently do not face this issue.

Things that could give us insight on this issue?

Does this happen if the driver itself is not on the containers? (containers are intrinsically more difficult to debug)
Can you compile just the driver in debug or with debug symbols? This will give us more info on where it actually dies
If this does not happen all the time, would it happen when you replay a rosbag or pcap? (this way we could reproduce it locally)

As a note: I see you are using the GPS's pps right? @drwnz we do not currently use it right?

NilaySener commented 1 month ago

Hi @knzo25, Thank you for the quick response. Here are the answers to the questions you raised:

Does the issue occur when the driver itself is not in the containers?

Yes, the problem is still observed. Here is the log files launch.xml format file I prepared to check this: launch_log_0.txt launch_log_1.txt all_lidar.launch.xml

Can you compile just the driver with debug symbols?

The driver is currently compiled with the debug symbol. It was compiled with the following command:

colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_EXPORT_COMPILE_COMMANDS=1

Does the issue occur when replaying a ROS bag or pcap file?

Here you can downoad the pcap file i recorded with 4 x VLP-16. I did not record the VLS-128 because it was not in the same interface with theese lidars. But in this scenario (4x VP-16) I also observed the launch problem.

Regarding the GPS PPS signal usage

Yes, all sensors are fed with GPRMC and PPS signals.

If there is anything I need to provide additional information about, please let me know.

drwnz commented 1 month ago

We do use PPS signals to synchronize the LiDAR, but generated from an ECU GPIO rather than from GNSS. However, we don't use GPRMC and timstamping is done from UDP packet header timestamps. Do you still get the same issue if you remove the HW monitor in the launch?

knzo25 commented 1 month ago

@NilaySener I just tried to reproduce the error with the data and launcher provided, but it works without issues on my end.

My setup:

main branch
your data
your launcher (except I changed all the IPs to 127.0.0.1)
Replayed with: https://github.com/tier4/pcap_replay

The logs only tell us that the hw interface dies, but not really where. Since the errors can be reproduced with isolated examples (no autoware for example), I think you could try with https://github.com/pal-robotics/backward_ros to see if you can get more info for the current problem

NilaySener commented 4 weeks ago

Hi, thank you very much for your answers and suggestions.

I will remove the HW monitor from the launch file and share the results.

I also noticed that when the node dies, it only goes into the following callback once. https://github.com/tier4/nebula/blob/d9aaefc9a4c06f6dae86cd7ef22f6353f1379e4f/nebula_ros/src/velodyne/velodyne_hw_interface_ros_wrapper.cpp#L228-L235

As for the pcap file, thank you for testing it @knzo25 but I have a question:

I've had to launch it twenty times or more in a row to reproduce it on the vehicle, have you had the opportunity to repeat it that many times?
I wanted to use this repo to match your testing method, but I don't have permission. When I feed the .pcap file to Nebula using tcpreplay, I encountered a problem. For this reason, I cannot feed Nebula with the .pcap file right now. How can I use the repo you used for Replay?

tier4 / nebula