ros2 / rosbag2

Apache License 2.0
272 stars 245 forks source link

Cache buffer dropping significant messages while splitting the bag files. #1579

Closed yashmewada9618 closed 4 months ago

yashmewada9618 commented 6 months ago

System: Intel NUC with 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz OS: Ubuntu 22.04 GPU: NVIDIA GeForce RTX 2060 / 115W / 6GB (Although not used for this case) ROS: Humble RAM: 64GB Install Method: Deb package.

Hello,

Description

Ros bag record is dropping significant amount of messages while recording ~35 subscribers. It looses around combined 15k messages.

[ros2-10]   /camera/camera/color/metadata: 329
[ros2-10]   /ouster/range_image: 219
[ros2-10]   /camera/camera/gyro/metadata: 2184
[ros2-10]   /ouster/nearir_image: 219
[ros2-10]   /ouster/reflec_image: 219
[ros2-10]   /ouster/signal_image: 219
[ros2-10]   /camera/camera/gyro/sample: 2184
[ros2-10]   /camera/camera/accel/metadata: 1110
[ros2-10]   /camera/camera/imu: 2184
[ros2-10]   /ouster/scan: 219
[ros2-10]   /camera/camera/color/image_raw: 329
[ros2-10]   /ouster/points: 727
[ros2-10]   /camera/camera/aligned_depth_to_color/image_raw: 327
[ros2-10]   /camera/camera/aligned_depth_to_color/camera_info: 327
[ros2-10]   /camera/camera/depth/image_rect_raw: 328
[ros2-10]   /camera/camera/depth/color/points: 327
[ros2-10]   /ouster/imu: 1935
[ros2-10]   /camera/camera/depth/camera_info: 328
[ros2-10]   /camera/camera/color/camera_info: 329
[ros2-10]   /camera/camera/accel/sample: 1110
[ros2-10]   /camera/camera/depth/metadata: 328
[ros2-10] Total lost: 15481

ros2 bag info <bag_file> tells me I lost around 800 lidar points and 4k imu messages for around 95 seconds of bag duration.

Details

So I have a small AMR with an Intel Real sense D455i and an Ouster OS0-32U Lidar. Both the sensors have IMUs and I am recording that too along with the other light weight topics like the robot_description. The incoming sensor data is like below:

/camera/camera/[images/depth_cloud]: 30hz
/ouster/points: 20hz (1024 points each channel with 32 channels)
/camera/camera/imu: 200hz
/ouster/[nearir_image/reflec_image/signal_image]: 20hz
/ouster/imu: 100hz

# Other topics are also there like aligned depth from realsense.

My sensor launching looks something like below:

Ouster Lidar

    ouster_launch = IncludeLaunchDescription(
        XMLLaunchDescriptionSource(
            [
                PathJoinSubstitution(
                    [FindPackageShare("ouster_ros"), "launch", "sensor.launch.xml"]
                )
            ]
        ),
        launch_arguments={
            "viz": "false",
            "sensor_hostname": "os-122212000733.local",  # Replace with the actual sensor hostname
            "lidar_mode": "1024x20",
            "udp_profile_lidar": "RNG19_RFL8_SIG16_NIR16",
            "metadata": "/home/scout/autonomy-stack/src/os-122212000733-metadata.json",  # Replace with relative path to metadata file
            "timestamp_mode": "TIME_FROM_ROS_TIME",
        }.items(),
    )

Realsense Launch

    realsense_launch = IncludeLaunchDescription(
        PythonLaunchDescriptionSource(
            [
                PathJoinSubstitution(
                    [FindPackageShare("realsense2_camera"), "launch", "rs_launch.py"]
                )
            ]
        ),
        launch_arguments={
            "initial_reset": "true",
            "pointcloud.enable": "true",
            "enable_infra1": "false",
            "enable_infra2": "false",
            "enable_sync": "true",
            "align_depth.enable": "true",
            "enable_gyro": "true",
            "enable_accel": "true",
            "gyro_fps": "200",
            "accel_fps": "200",
            "unite_imu_method": "2",
        }.items(),
    )

Now to the actual problem: This much amount of data is pretty huge around 32GB for 5 mins of recording. To keep the CPU load less and less memory consumption I am splitting the bag files for every 3.2GB or 1800 seconds what ever occurs first (Size flag occurs first). With this configuration the CPU load is minimal and RAM consumption is somewhere near to 4% of the total RAM.

I know that working with remote setup can be annoying because of the DDS settings in ROS 2, but after setting up the Cyclone DDS and changing the buffer params I was able to get every data with almost 0 latency and dropping without running the rosbag record.

But things kind of break when I start ros bag record with the below commands:

# No compression of bag file, just splitting.
    rosbag_record = ExecuteProcess(
        cmd=[
            "ros2",
            "bag",
            "record",
            "-o",
            directory,
            "--qos-profile-overrides-path",
            Qos_override,
            "--max-cache-size",
            "1000000000",  # 1 GB
            "--max-bag-size",
            "3221225472",  # 3Gb of raw data
            "--max-bag-duration",
            "1800",  # seconds
            "-s",
            "sqlite3",
            "-a",
        ],
        shell=True,
    )

I am not compressin the bag files as it was not helpfull and it still dropped the messages, more than the above mentioned log. The bandwidth used by the /camera/camera/depth/color/points is around 192 MB/s for 100 messages and by /ouster/points is 32 MB/s fro 100 messages.

I have tried many different stuff from the issues on ros bag2 github page to the Ouster driver github and it seems I cannot solve this issue. Furthermore I need the bag splitting for testing purposes and or else the bag size goes huge.

It looks like ros2 bag is not able to record such data in proper manner or am I missing something here? Has anyone faced this kind of issue?

Thanks.

MichaelOrlov commented 6 months ago

Hi @yashmewada9618, If you are experiencing messages drop from rosbag2 cache with SQLite3 storage plugin during bag split, - this is a known issue. I have fixed it recently in the https://github.com/ros2/rosbag2/pull/1516. But the fix is available only for Rolling.

For Humble, I can recommend using MCAP storage format. It doesn't have such an issue with messages dropping during bag split and is known to be more performant.

In general, to avoid message drops need to make sure that the CPU cores on which running the rosbag2 recorder are not occupied more than 90% and that the hard drive is able to keep up with the data throughput. Hard drive speed needs to be tested with block sizes corresponding to the message sizes and total throughput.

yashmewada9618 commented 6 months ago

Hi @MichaelOrlov, Thank you for getting back to me. I looked into what you suggested and the results are the same but now I am not receiving the dropped cache message log still the messages seem to be dropped when I do ros2 bag info <bag_file>.

I also looked at the used CPU resources and the CPU consumption is less than 50% and RAM consumption is also minimal.

Results from ros2 bag info ...

Files:             Sun_Mar-03-2024_12-40-43_PM_0.mcap
                   Sun_Mar-03-2024_12-40-43_PM_1.mcap
                   Sun_Mar-03-2024_12-40-43_PM_2.mcap
                   Sun_Mar-03-2024_12-40-43_PM_3.mcap
                   Sun_Mar-03-2024_12-40-43_PM_4.mcap
                   Sun_Mar-03-2024_12-40-43_PM_5.mcap
                   Sun_Mar-03-2024_12-40-43_PM_6.mcap
Bag size:          22.3 GiB
Storage id:        mcap
Duration:          83.49s
Start:             Mar  3 2024 12:40:48.803 (1709487648.803)
End:               Mar  3 2024 12:42:11.853 (1709487731.853)
Messages:          79567
Topic information: Topic: /camera/camera/accel/imu_info | Type: realsense2_camera_msgs/msg/IMUInfo | Count: 0 | Serialization Format: cdr
                   Topic: /camera/camera/accel/metadata | Type: realsense2_camera_msgs/msg/Metadata | Count: 6402 | Serialization Format: cdr
                   Topic: /camera/camera/accel/sample | Type: sensor_msgs/msg/Imu | Count: 6402 | Serialization Format: cdr
                   Topic: /camera/camera/aligned_depth_to_color/camera_info | Type: sensor_msgs/msg/CameraInfo | Count: 1789 | Serialization Format: cdr
                   Topic: /camera/camera/aligned_depth_to_color/image_raw | Type: sensor_msgs/msg/Image | Count: 1789 | Serialization Format: cdr
                   Topic: /camera/camera/color/camera_info | Type: sensor_msgs/msg/CameraInfo | Count: 1877 | Serialization Format: cdr
                   Topic: /camera/camera/color/image_raw | Type: sensor_msgs/msg/Image | Count: 1877 | Serialization Format: cdr
                   Topic: /camera/camera/color/metadata | Type: realsense2_camera_msgs/msg/Metadata | Count: 1876 | Serialization Format: cdr
                   Topic: /camera/camera/depth/camera_info | Type: sensor_msgs/msg/CameraInfo | Count: 1984 | Serialization Format: cdr
                   Topic: /camera/camera/depth/color/points | Type: sensor_msgs/msg/PointCloud2 | Count: 1789 | Serialization Format: cdr
                   Topic: /camera/camera/depth/image_rect_raw | Type: sensor_msgs/msg/Image | Count: 1984 | Serialization Format: cdr
                   Topic: /camera/camera/depth/metadata | Type: realsense2_camera_msgs/msg/Metadata | Count: 1984 | Serialization Format: cdr
                   Topic: /camera/camera/extrinsics/depth_to_accel | Type: realsense2_camera_msgs/msg/Extrinsics | Count: 1 | Serialization Format: cdr
                   Topic: /camera/camera/extrinsics/depth_to_color | Type: realsense2_camera_msgs/msg/Extrinsics | Count: 2 | Serialization Format: cdr
                   Topic: /camera/camera/extrinsics/depth_to_gyro | Type: realsense2_camera_msgs/msg/Extrinsics | Count: 1 | Serialization Format: cdr
                   Topic: /camera/camera/gyro/imu_info | Type: realsense2_camera_msgs/msg/IMUInfo | Count: 0 | Serialization Format: cdr
                   Topic: /camera/camera/gyro/metadata | Type: realsense2_camera_msgs/msg/Metadata | Count: 12578 | Serialization Format: cdr
                   Topic: /camera/camera/gyro/sample | Type: sensor_msgs/msg/Imu | Count: 12578 | Serialization Format: cdr
                   Topic: /camera/camera/imu | Type: sensor_msgs/msg/Imu | Count: 12578 | Serialization Format: cdr
                   Topic: /events/write_split | Type: rosbag2_interfaces/msg/WriteSplitEvent | Count: 6 | Serialization Format: cdr
                   Topic: /ouster/imu | Type: sensor_msgs/msg/Imu | Count: 5385 | Serialization Format: cdr
                   Topic: /ouster/metadata | Type: std_msgs/msg/String | Count: 1 | Serialization Format: cdr
                   Topic: /ouster/nearir_image | Type: sensor_msgs/msg/Image | Count: 1076 | Serialization Format: cdr
                   Topic: /ouster/os_driver/transition_event | Type: lifecycle_msgs/msg/TransitionEvent | Count: 3 | Serialization Format: cdr
                   Topic: /ouster/points | Type: sensor_msgs/msg/PointCloud2 | Count: 1076 | Serialization Format: cdr
                   Topic: /ouster/range_image | Type: sensor_msgs/msg/Image | Count: 1076 | Serialization Format: cdr
                   Topic: /ouster/reflec_image | Type: sensor_msgs/msg/Image | Count: 1076 | Serialization Format: cdr
                   Topic: /ouster/scan | Type: sensor_msgs/msg/LaserScan | Count: 1076 | Serialization Format: cdr
                   Topic: /ouster/signal_image | Type: sensor_msgs/msg/Image | Count: 1076 | Serialization Format: cdr
                   Topic: /parameter_events | Type: rcl_interfaces/msg/ParameterEvent | Count: 139 | Serialization Format: cdr
                   Topic: /robot_description | Type: std_msgs/msg/String | Count: 1 | Serialization Format: cdr
                   Topic: /rosout | Type: rcl_interfaces/msg/Log | Count: 78 | Serialization Format: cdr
                   Topic: /tf | Type: tf2_msgs/msg/TFMessage | Count: 0 | Serialization Format: cdr
                   Topic: /tf_static | Type: tf2_msgs/msg/TFMessage | Count: 7 | Serialization Format: cdr

The 3D points from both the lidar and real sense are dropped by around 500 & 700 and lidar IMU is lost by almost 3K messages.

yashmewada9618 commented 6 months ago

One thing I noticed in these bag files and it does happen sometimes is; that currently the bag split is at 3.2GB but the 6th bag file here had 4.2GB of data and after that, it started recording the 7th bag file.

fujitatomoya commented 6 months ago

after setting up the Cyclone DDS and changing the buffer params I was able to get every data with almost 0 latency and dropping without running the rosbag record.

@yashmewada9618 do you set the same QoS setting for rosbag2?

more specifically, the following QoS settings are the same with I was able to get every data with almost 0 latency and dropping case?

            "--qos-profile-overrides-path",
            Qos_override,
yashmewada9618 commented 6 months ago

Yes, I am using my sensor data QoS in the QoS override file. My file looks something like this,

# QOS override for rosbag record

/ouster/points:
  history: keep_last
  depth: 5
  reliability: best_effort
  durability: volatile
  deadline:
    # unspecified = inf
    sec: 0
    nsec: 0
  lifespan:
    # unspecified = inf
    sec: 0
    nsec: 0  
  liveliness: automatic
  liveliness_lease_duration:
    # unspecified = inf
    sec: 0
    nsec: 0
  avoid_ros_namespace_conventions: false

/ouster/imu:
  history: keep_last
  depth: 5
  reliability: best_effort
  durability: volatile
  deadline:
    # unspecified = inf
    sec: 0
    nsec: 0
  lifespan:
    # unspecified = inf
    sec: 0
    nsec: 0  
  liveliness: automatic
  liveliness_lease_duration:
    # unspecified = inf
    sec: 0
    nsec: 0
  avoid_ros_namespace_conventions: false

/camera/imu:
  history: keep_last
  depth: 5
  reliability: reliable
  durability: volatile
  deadline:
    # unspecified = inf
    sec: 0
    nsec: 0
  lifespan:
    # unspecified = inf
    sec: 0
    nsec: 0  
  liveliness: automatic
  liveliness_lease_duration:
    # unspecified = inf
    sec: 0
    nsec: 0
  avoid_ros_namespace_conventions: false

/camera/depth/color/points:
  history: keep_last
  depth: 10
  reliability: reliable
  durability: volatile
  deadline:
    # unspecified = inf
    sec: 0
    nsec: 0
  lifespan:
    # unspecified = inf
    sec: 0
    nsec: 0  
  liveliness: automatic
  liveliness_lease_duration:
    # unspecified = inf
    sec: 0
    nsec: 0
  avoid_ros_namespace_conventions: false

And my Sensor Qos for Ouster cloud and RS cloud is as:

Type: sensor_msgs/msg/PointCloud2

Publisher count: 1

Node name: os_driver
Node namespace: /ouster
Topic type: sensor_msgs/msg/PointCloud2
Endpoint type: PUBLISHER
GID: 01.10.e2.d9.fd.2b.76.71.fb.c4.b3.50.00.00.2d.03.00.00.00.00.00.00.00.00
QoS profile:
  Reliability: BEST_EFFORT
  History (Depth): KEEP_LAST (5)
  Durability: VOLATILE
  Lifespan: Infinite
  Deadline: Infinite
  Liveliness: AUTOMATIC
  Liveliness lease duration: Infinite

Subscription count: 0

RS_QoS

Type: sensor_msgs/msg/PointCloud2

Publisher count: 1

Node name: camera
Node namespace: /camera
Topic type: sensor_msgs/msg/PointCloud2
Endpoint type: PUBLISHER
GID: 01.10.a3.ed.28.0f.f4.42.0a.7e.85.d7.00.00.15.03.00.00.00.00.00.00.00.00
QoS profile:
  Reliability: RELIABLE
  History (Depth): KEEP_LAST (10)
  Durability: VOLATILE
  Lifespan: Infinite
  Deadline: Infinite
  Liveliness: AUTOMATIC
  Liveliness lease duration: Infinite

Subscription count: 0

Yes the 0 latency case is by checked by ros2 cli like echo, hz, bw and rviz and I am not using any Qos settings in those commands.

MichaelOrlov commented 6 months ago

@yashmewada9618 If you don't see any more messages after stopping recording like this

Cache buffers lost messages per topic:
[ros2-10]   /camera/camera/color/metadata: 329
[ros2-10]   /ouster/range_image: 219
[ros2-10]   /camera/camera/gyro/metadata: 2184
[ros2-10]   /ouster/nearir_image: 219
[ros2-10]   /ouster/reflec_image: 219
[ros2-10]   /ouster/signal_image: 219
[ros2-10]   /camera/camera/gyro/sample: 2184
[ros2-10]   /camera/camera/accel/metadata: 1110
[ros2-10]   /camera/camera/imu: 2184
[ros2-10]   /ouster/scan: 219
[ros2-10]   /camera/camera/color/image_raw: 329
[ros2-10]   /ouster/points: 727
[ros2-10]   /camera/camera/aligned_depth_to_color/image_raw: 327
[ros2-10]   /camera/camera/aligned_depth_to_color/camera_info: 327
[ros2-10]   /camera/camera/depth/image_rect_raw: 328
[ros2-10]   /camera/camera/depth/color/points: 327
[ros2-10]   /ouster/imu: 1935
[ros2-10]   /camera/camera/depth/camera_info: 328
[ros2-10]   /camera/camera/color/camera_info: 329
[ros2-10]   /camera/camera/accel/sample: 1110
[ros2-10]   /camera/camera/depth/metadata: 328
[ros2-10] Total lost: 15481

It does mean that you are losing messages not in the rosbag2 recorder but rather on the transport layer or on the publisher side. With one exclusion applies to this statement.

yashmewada9618 commented 6 months ago

I am not seeing any messages drops or lags when I am not recording these messages. Even while recording the bag files these drivers seem to work flawlessly. I was able to bypass the total lost messages: by using mcap instead of sqlite3. But when I play the recorded bag file or when I do ros2 bag info ... the duration and the number of messages don't make sense.

Regarding the transport layer, I am using CycloneDDS which was referred from here and my profile.xml looks like:

<?xml version="1.0" encoding="UTF-8"?>
<CycloneDDS xmlns="https://cdds.io/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://cdds.io/config
https://raw.githubusercontent.com/eclipse-cyclonedds/cyclonedds/master/etc/cyclonedds.xsd">
    <Domain id="any">
        <Internal>
            <SocketReceiveBufferSize min="200MB" />
        </Internal>
        <General>
            <Interfaces>
                <NetworkInterface name="lo" />
            </Interfaces>
            <AllowMulticast>
                false
            </AllowMulticast>
        </General>
        <Discovery>
            <ParticipantIndex>
                auto
            </ParticipantIndex>
            <Peers>
                <Peer Address="localhost" />
            </Peers>
            <MaxAutoParticipantIndex>
                120
            </MaxAutoParticipantIndex>
        </Discovery>
    </Domain>
</CycloneDDS>

I tried the FastRTPS profile but it seems it was dropping messages because the profile I wrote was not as I wanted viz a unicast (only share the messages to local host and not any other networked PC) with a bigger socket buffer size.

Hommus commented 5 months ago

I'm running into a similar issue Cache buffers lost messages per topic and I would appreciate guidance on how to debug it. As far as I can tell, the CPU and memory are under utilised (system specs provided at bottom of comment).

So far, I've:

Experiments

Topics are published at 10Hz. Each experiment is a 1 minute rosbag2 recording. Experiment results are averaged over 5 runs.

# CycloneDDS + allocated cpus
$ ros2 bag record /os_top/points /os_dome/points /tf /tf_static
...
[rosbag2_cpp]: Cache buffers lost messages per topic: 
    /os_dome/points: 200
    /os_top/points: 199
[run 1] Total lost: 399
[run 2] Total lost: 196
[run 3] Total lost: 368
[run 4] Total lost: 369
[run 5] Total lost: 435
[ avg ] Total lost: 353 (29.42% of 1200 expected msgs)
# Enabled mcap compression and QoS profile
$ ros2 bag record --storage-config-file mcap_config.yaml --qos-profile-overrides-path=qos.yaml /os_top/points /os_dome/points /tf /tf_static
[run 1] Total lost: 436
[run 2] Total lost: 443
[run 3] Total lost: 328
[run 4] Total lost: 82
[run 5] Total lost: 519
[ avg ] Total lost: 362 (30.17% of 1200 expected msgs)
# Split every 60 seconds
$ ros2 bag record --storage-config-file mcap_config.yaml --qos-profile-overrides-path=qos.yaml -d 60 /os_top/points /os_dome/points /tf /tf_static
[1st min] Total lost: 406
[2nd min] Total lost: 697
[3rd min] Total lost: 1060
[4th min] Total lost: 1391
[5th min] Total lost: 1485
24.75% of 6000 expected msgs

System

System: Jetson Orin AGX ROS: Iron RAM: 64GB Write speed: 1.1 GB/s

$ dd if=/dev/zero of=/tmp/output bs=5MB count=200; rm -f /tmp/output
200+0 records in
200+0 records out
1000000000 bytes (1.0 GB, 954 MiB) copied, 0.90935 s, 1.1 GB/s
yashmewada9618 commented 5 months ago

Hi @Hommus, so I was able to reduce the number of messages dropped to almost 1/2% by disabling the compression mode. The reason for this I suspect is the splitting, compression and then writing the data to the disk gets chunked up when you have so much incoming data (which in my case was 3D points from realsense and LiDAR).

Also if I stop the splitting of bag files or reduce the splitting value to some reasonable values the dropped messages is still at 1/2%.

But still when I record the realsense data as well as Ouster Lidar data, the messages dropped are significant.

Hommus commented 5 months ago

@yashmewada9618 The problem on my side was much simpler in the end. The hardware I'm using has two storage devices and the mcap files were being written to the incorrect one (i.e. the slow one).

The fast storage (NVMe) was available at /mnt/ssd but a docker volume was overwriting one of the sub directories /mnt/ssd/logging with a path from the slow storage (emmc) file system.

It's worth noting that with this problem fixed, I get 0 messages dropped, even with mcap compression and splitting enabled. I removed the qos settings on the subscriber side and the ouster lidars are set to default (reliable).

I also messed around with the CycloneDDS and sysctl settings but I'm not sure if it helped or not:

<?xml version="1.0" encoding="UTF-8" ?>
<CycloneDDS xmlns="https://cdds.io/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://cdds.io/config
https://raw.githubusercontent.com/eclipse-cyclonedds/cyclonedds/master/etc/cyclonedds.xsd">
    <Domain id="any">
        <General>
            <AllowMulticast>spdp</AllowMulticast>
            <MaxMessageSize>65500B</MaxMessageSize>
        </General>
        <Internal>
        <SocketReceiveBufferSize min="20MB"/>
                <Watermarks>
                    <WhcHigh>500kB</WhcHigh>
                </Watermarks>
        </Internal>
        <Discovery>
            <ParticipantIndex>auto</ParticipantIndex>
            <MaxAutoParticipantIndex>100</MaxAutoParticipantIndex>
        </Discovery>
    </Domain>
</CycloneDDS>
# Added the following to the bottom of /etc/sysctl.conf
net.core.wmem_max=67108864
net.core.wmem_default=67108864
net.core.rmem_max=67108864
net.core.rmem_default=67108864
net.ipv4.ipfrag_time=3
net.ipv4.ipfrag_high_thresh=134217728
Hommus commented 5 months ago

@MichaelOrlov It does make me wonder whether rosbag2 could provide a more explicit warning for this case:

Cache buffers lost messages per topic (due to insufficient write speed of drive)

MichaelOrlov commented 5 months ago

@MichaelOrlov It does make me wonder whether rosbag2 could provide a more explicit warning for this case:

Cache buffers lost messages per topic (due to insufficient write speed of drive)

@Hommus The "insufficient write speed of drive" is one of the many possible reasons.

BTW we have rosbag2_performance_benchmarking package that is inteneded to help when one encountered with the performance issue or messages drops in complicated scenarious. TLDR. It is allow to simulate a complicated scenarious when messages comming from multiple nodes and with different message sizes and througputs. For instance, you can refer to the https://github.com/ros2/rosbag2/blob/rolling/rosbag2_performance/rosbag2_performance_benchmarking/config/producers/automotive.yaml config.

MichaelOrlov commented 5 months ago

@yashmewada9618 Can we close this issue? It seems nothing wrong with rosbag2.

MichaelOrlov commented 4 months ago