Possible data loss in case of large amount of topics and high publishing frequency

gtep96 commented 1 year ago

Description

I've encountered a data loss when using the ROS2 bag recording: i have ~20 publishers with publishing frequency 1000 msgs/sec per topic, and first several messages could be lost during the initial recording phase. This problem results in missing data at the beginning of the bag file.

Expected Behavior

The ROS2 bag recording should start capturing data immediately upon execution, and the recorded bag file should contain all relevant data from the beginning of the recording session. Is there any limitations on amount of topics/publishing frequency/data size per message for a bag recorder?

Actual Behavior

Upon initiating the ROS2 bag recording, it takes some time before data is captured in the bag file. As a result, the initial moments of data are not properly recorded.

To Reproduce

Run ros2 bag record --all
Launch a node with many (20 in my case) publishers with a timer period = 1/1000
Observe the recorded bag file and check for missing data at the beginning of the file.

System (please complete the following information)

OS: Ubuntu 20.04 (inside Docker dev container)
ROS 2 Distro: Humble
Version: release

Additional context

I've tried to run rosbag2 with --include-unpublished-topics, but with no success, it still can lose some msgs. Running with --no-discovery with specified topics names leads to empty bag:


rosbag2_bagfile_information:
  version: 5
  storage_identifier: sqlite3
  duration:
    nanoseconds: 0
  starting_time:
    nanoseconds_since_epoch: 9223372036854775807
  message_count: 0
  topics_with_message_count:
    []
  compression_format: ""
  compression_mode: ""
  relative_file_paths:
    - rosbag2_2023_07_23-12_55_55_0.db3
  files:
    - path: rosbag2_2023_07_23-12_55_55_0.db3
      starting_time:
        nanoseconds_since_epoch: 9223372036854775807
      duration:
        nanoseconds: 0
      message_count: 0

Is there a way to initialize rosbag2 in advance or how can I solve this problem?

Falimonda commented 1 year ago

The request to capture immediately upon execution sounds a bit unreasonable - how many milliseconds pass between execution and the first message being logged?

Can you provide details about the missing messages? Do mesages on a given topic partially begin to get logged, or do they abruptly, but consistently begin logging at a point in time?

Is the behavior consistent across all topics? Do some topics begin logging prior to others, and how does this compare to the initial millisecond value?

gtep96 commented 1 year ago

Can you provide details about the missing messages? Do mesages on a given topic partially begin to get logged, or do they abruptly, but consistently begin logging at a point in time?

It seems like rosbag connects to topics in random order, and some topics can lose from 1 message up to 50%+ first messages while rosbag starting. I performed many tests, and the only way to prevent this behavior is to add a delay (eg 1 sec) after first message in each topic. In this case the first message could be lost sometimes but all others record correctly.

MichaelOrlov commented 1 year ago

@gtep96 To not lose messages at the beginning due to the nin-determinism in the discovery the expected workflow would be:

Create publishers for topics which is expected to be recorded
Start rosbag2 recorder with the explicitly specified list of topics that are expected to be recorded.
Wait on the publisher's side while the number of subscriptions will increase on 1. i.e. getting confirmation that rosbag2 successfully subscribed to the specified topics.
Start publishing on the topics. This is the way how we testing rosbag2 from unit tests.

Another alternative would be to use Durability: transient local with 'History: keep last QoS settings on the publishers' side and increase DDS queue Depth to be able to catch up on missed topics if the discovery did not happen in time.

gtep96 commented 1 year ago

@gtep96 To not lose messages at the beginning due to the nin-determinism in the discovery the expected workflow would be:

Create publishers for topics which is expected to be recorded

Start rosbag2 recorder with the explicitly specified list of topics that are expected to be recorded.

Wait on the publisher's side while the number of subscriptions will increase on 1. i.e. getting confirmation that rosbag2 successfully subscribed to the specified topics.

Start publishing on the topics. This is the way how we testing rosbag2 from unit tests.

Ok, the first solution works, thank you very much. But what if the topics will be created dynamicly during node execution with dynamic names, and i cannot specify which of them i want to record in advance? Or, in the other case, what if some of the topics would be defined depending on conditions?

Another alternative would be to use Durability: transient local with 'History: keep last QoS settings on the publishers' side and increase DDS queue Depth to be able to catch up on missed topics if the discovery did not happen in time.

Unfortunately, this didn't help(

MichaelOrlov commented 1 year ago

@gtep96 As regards the second option. May be publishing rate is too fast and DDS queue Depth is not enough.

MichaelOrlov commented 9 months ago

Closing as a stale issue.
After some discussions and preliminary analysis there is nothing wrong with rosbag2 or we at least can't reproduce this issue.
Feel free to reopen this issue if you think it makes sense for further discussion or investigation.

ros2 / rosbag2