ros2 / rosbag2

Apache License 2.0
285 stars 251 forks source link

Segfault when calling `rosbag2_transport::Recorder::stop()` #1852

Open rg-gravis opened 1 week ago

rg-gravis commented 1 week ago

Description

We occasionally have an issue where a rosbag2 Recorder node segfaults and dies when stop() is called. This ends up truncating the .mcap file being written, resulting in it being corrupted because it lacks trailing magic bytes.

Expected Behavior

No segfault, the bag is closed and the .mcap has its trailing bytes written.

Actual Behavior

The process dies with code -11. Logs:

[INFO] [1730717198.622841633] [main_bag_rosbag_recorder]: Listening for topics... [rosbag_run_recorder_node-9]
[INFO] [1730717198.622848700] [main_bag_rosbag_recorder]: Event publisher thread: Starting [rosbag_run_recorder_node-9]
[INFO] [1730717198.823090723] [main_bag_rosbag_recorder]: Subscribed to topic '/visualization/machine_interface/target_terrain' [rosbag_run_recorder_node-9]
....
[INFO] [1730717199.253250365] [main_bag_rosbag_recorder]: Overriding subscription profile for /rosout [rosbag_run_recorder_node-9]
[INFO] [1730717199.258032766] [main_bag_rosbag_recorder]: Subscribed to topic '/rosout' [rosbag_run_recorder_node-9]
[INFO] [1730717199.268296194] [main_bag_rosbag_recorder]: Recording... [rosbag_run_recorder_node-9]
...
Error:  [1730717288.011213136] [main_bag_rosbag_recorder]: discovery_future_.wait_for(100) return status: timeout [rosbag_run_recorder_node-9]
Error:  [rosbag_run_recorder_node-9]: process has died [pid 111, exit code -11, cmd '/home/dev_ws/install/rosbag_run_recorder/lib/rosbag_run_recorder/rosbag_run_recorder_node --ros-args --log-level INFO --ros-args'].

Logging from other nodes and many "subscribed to topic" messages have been excluded.

To Reproduce

We have not been able to consistently reproduce, and this error occurs only once out of every 50 or so calls to stop(). Contributing factors seem to be heavy CPU load and larger bags, and it does seem to be somewhat device-dependent. Unfortunately I cannot share the full bag, but its mcap summary is

library:   mcap go v1.4.1; libmcap 0.8.0                               
profile:   ros2                                                        
messages:  743886                                                      
duration:  1m28.990235383s                                             
start:     2024-11-04T11:46:38.837867611+01:00 (1730717198.837867611)  
end:       2024-11-04T11:48:07.828102994+01:00 (1730717287.828102994)  
compression:
    zstd: [36/36 chunks] [143.28 MiB/26.77 MiB (81.31%)] [308.08 KiB/sec] 

and it was recording on 59 topics.

We construct the node roughly as follows:

  rosbag2_storage::StorageOptions storage_options = {
      bagPath.string(), "mcap", 0, 0, 0,
      "zstd_fast",
      "", false};
  rosbag2_transport::RecordOptions record_options;
  record_options.topics = /* our own list of ~100 topics */;
  record_options.topic_qos_profile_overrides = {{"/rosout", rclcpp::RosoutQoS()}};
  record_options.rmw_serialization_format = std::string(rmw_get_serialization_format());

  auto writer = rosbag2_transport::ReaderWriterFactory::make_writer(record_options);

  return std::make_shared<rosbag2_transport::Recorder>(std::move(writer), std::make_shared<KeyboardHandler>(false), storage_options, record_options,"main_bag_rosbag_recorder");

and execute it with a MultiThreadedExecutor constructed using the default arguments and 4 threads (the executor is also running 2 other Recorder nodes but they record far less data and we haven't had any issues with them).

System (please complete the following information)

Additional context

This problem had also occurred when running via ros2 bag record, but there it manifested as the recorder apparently shutting down in response to SIGINT, but for some reason not closing the bag properly.