ros2 / rosbag2

Apache License 2.0
272 stars 245 forks source link

More consistent playback/processing of bags when also recording #1243

Open antrose99 opened 1 year ago

antrose99 commented 1 year ago

Hi!

I have observed a very weird behavior where doing recording improves the performance when running my ros2 pipeline on recorded data.

I have a pipeline where i get LiDAR pointclouds at a rate of 10hz, which is processed by a ros2 node. I also have a pipeline that takes in various inputs at a rate of 100hz and estimates velocity from that. The outputs from these nodes are published at their respective frequency. A third node then subscribes to both of these and via callbacks get their respective outputs, with the goal of this third node to ultimately collect all estimations between each LiDAR outputs and do further processing. See flowchart flow

Now, you would assume that since the pointclouds are captured and processed at a rate of 10hz, and the velocity estimations at a rate of 100hz that you would consistently get ~10 calls to the estimation callback before the lidar callback is called. This behavior is what we see when running the system live, however, when running the system on played back bags this gets very inconsistent, with it sometimes collecting twice or even three times the number of expected estimation calls before the lidar callback (and then 0 for the next one).

The real confusing thing is that if you play back the bag whilst also recording then this does not happen. See screenshots below mcap

We are using the mcap file format for both playback and recording. This behavior appears both when running the pipeline in docker containers and locally, we are also using CycloneDDS.

Any thought or explanation to why recording is changing the behavior would be very much appreciated.

Thanks!

MichaelOrlov commented 1 year ago

Hi @antrose99 There are too many moving things in your configuration and it is almost impossible to correctly guess what is going on behind the scene.
It could be some concurrency issue in velocity estimation node and due to the extra context switches or different cache utilization when recording running in parallel or some issue related to the executor.

To isolate issue I would recommend to introspect your recording first to see if interval between recorded messages (i.e differences between message's timestamps from the same topic) are consistent and like you would expect.
Then try to measure the same interval between messages on receiver side during playback.

There are not too much magic on rosbag2 player side. Here is the place where we are taking next message from pre-buffered storage queue, waiting for its time and publishing those message: https://github.com/ros2/rosbag2/blob/8044dba28aa7f99247ccf29861f349969be07117/rosbag2_transport/src/rosbag2_transport/player.cpp#L570-L592 And all those operations happening in one thread sequentially. All extra conditions with skip_message_in_main_play_loop_ and cancel_wait_for_next_message_ works when we need to forcefully interrupt playback. This is happening in case of playback pause/resume or after calling play_next() API in pause mode.

a-krawciw commented 1 year ago

@antrose99 We have observed similar, confusing behaviour with Lidar rates, although we see the same stabilizing effect from a ros2 topic echo, so I don't think it's necessarily related to rosbag2 itself, although I'm not sure.

Did you ever resolve these issues? We would be interested to know if you were able: feel free to connect outside of this issue.

MichaelOrlov commented 1 year ago

I have a guess that it might be related to some sub-optimal implementation of the SingleThreadedExecutor which is used in rosbag2 player to run node and process publishers requests. Hopefully situation will be improved in Iron Irwin release. We are expecting some performance improvements to be made in SingleThreadedExecutor soon. cc: @clalancette @mjcarroll