High latency / Lost messages: Pub/Sub 10B at high pub frequency.

mauropasse commented 3 weeks ago

Hi! I've running some ROS2 benchmarks on rmw_zenoh (and other rmw's) using the iRobot ros2-performance framework, this time on a Raspberry Pi 4B and ROS2 Iron. The system tested consists on 2 processes, one for the publisher, one for the subscription.

- Pub/Sub - 10B - 2KHz
- Pub/Sub - 100KB - 250Hz
- Pub/Sub - 1MB - 50Hz
- Pub/Sub - 4MB - 10Hz

I noticed a spike on latency for the pub/sub - 10B - 2KHz scenario. I've been able to reproduce these results using also lower pub frequencies, and is noticeable also on x86.

From x86 benchmark output logs I got: On RPi4B, inspecting the events.txt I see every message being late, like if it has lost sync and a msg is read when a new one arrives:

Time_ms     Caller                   Code  Description         
1474        test->sub_node           1     msg 0 late. 1699us > 400us
1492        test->sub_node           1     msg 9 late. 1691us > 400us
1492        test->sub_node           1     msg 10 late. 491us > 400us
1495        test->sub_node           1     msg 11 late. 519us > 400us
1496        test->sub_node           1     msg 12 late. 480us > 400us
1499        test->sub_node           1     msg 13 late. 528us > 400us
1500        test->sub_node           1     msg 14 late. 421us > 400us
1502        test->sub_node           1     msg 15 late. 440us > 400us
1504        test->sub_node           1     msg 16 late. 451us > 400us
1506        test->sub_node           1     msg 17 late. 458us > 400us

So it looks like something starts failing when pub frequency gets high enough.

Something strange is that I run also a single-process test, with rclcpp intra-process disabled (pub/sub in same process) and the issue goes away, so it seems the issue is with multi-process applications.

mjcarroll commented 3 weeks ago

Something strange is that I run also a single-process test, with rclcpp intra-process disabled (pub/sub in same process) and the issue goes away, so it seems the issue is with multi-process applications.

So just to be clear:

pub/sub in same process, intraprocess on is fine (expected)
pub/sub in same process, intraprocess off is fine (reasonably expected)
pub/sub in different processes is not fine (unexpected)

kydos commented 3 weeks ago

How are you measuring the latency? If you do one way measurements, then for small data packets the behaviour you are seeing is probably due to batching (at Zenoh and TCP/IP level). If you want to skip batching please leverage the ultra-low latency transport (see https://zenoh.io/blog/2023-10-03-zenoh-dragonite/#support-for-ultra-low-latency).

Don't hesitate to reach us out on Zenoh's Discord server if you have any questions on configuration.

We'll try to reproduce on our side to see if we get a similar behaviour. On hour side it would be good to see if you have a similar behaviour running the latency test shipped with Zenoh (See https://github.com/eclipse-zenoh/zenoh/tree/main/examples/examples)

-- kydos

mauropasse commented 3 weeks ago

Hi, thanks for the suggestions. Before running my ROS2 app, I start zenoh with ./rmw_zenohd. Does this executable take config files as input? Like: ./rmw_zenohd -c lowlatency.json5, or there is other way? I'm asking because I don't get any output from ./rmw_zenohd --help. I'll try to get data from the latency test you mention.

How are you measuring the latency?

The iRobot ros2-performance benchmark shows in console and generates files with info about resources, latency, publish duration, CPU, memory, etc. We write the timestamp to the message, so when the subscription receives the message, it can obtain the latency.

Yadunund commented 3 weeks ago

Does this executable take config files as input? Like: ./rmw_zenohd -c lowlatency.json5, or there is other way?

You can find information in our README here on how to configure the session/router with non-default config files.

kydos commented 3 weeks ago

Hello @mauropasse, the lowlatency.json5 configuration should be passed to the ROS applications. It is not essential for the router as in the default configuration it is used only for off the robot communication. This can be done by means of session configuration as explained on the "Session and Router Configs" section of the Zenoh RMW README. Let me know if this does the trick.

imstevenpmwork commented 3 weeks ago

Hello @mauropasse and @mjcarroll, Would you mind explain us the configuration of the test you ran so we can replicate it on our side? Ideally the topology configuration file :D

mauropasse commented 3 weeks ago

I have tried using the configuration file to enable low latency mode, and it seems to have fixed the issue for the 10B message.

I have used the DEFAULT_RMW_ZENOH_SESSION_CONFIG.json5 but setting lowlatency:true and qos:disabled.

But for some reason, it doesn't work for bigger messages (100KB, 1MB, etc): The subscription doesn't get any messages, like if the nodes were not being discovered. Besides this problem, the publisher process seems to hang indefinitely.

I still see the router messages from both processes, [rmw_zenoh_cpp]: Successfully connected to a Zenoh router with id.

I've experienced this behavior both on x86 and in RPi4.

@imstevenpmwork the test topology is just a subscription to a 100KB message in one process, and a publisher at 250Hz in the other process. Let me know if you need more details.

imstevenpmwork commented 3 weeks ago

Hello @mauropasse , The lowlatency feature is specifically designed to function within the 64K size limit. It provides a latency optimisation for handling small messages without fragmentation support. Regarding the topology, would it be possible to send us the .json file you are loading to the test? So we make sure we're running exactly the same test setup 😄

mauropasse commented 3 weeks ago

Hi @imstevenpmwork, yes I was just testing and found that 64K is the limit in which things don't work. So what happens on a system having both small messages at high freq and big messages? Either we miss the small messages or the big ones?

... would it be possible to send us the .json ...

# Publisher json
{"nodes": [{"node_name": "pub_node","publishers": [{"topic_name": "test","msg_type": "stamped_vector","msg_size": 64000,"freq_hz": 250}]}]}

# Subscriber json
{"nodes": [{"node_name": "sub_node","subscribers":[{"topic_name":"test", "msg_type":"stamped_vector"}]}]}

imstevenpmwork commented 3 weeks ago

Hi @mauropasse

I believe the configuration name is a bit confusing and may need to be changed in the future. Essentially, it offers an option to slightly optimize for latency, provided you're dealing with small messages only. If this option is disabled, as it should be when handling large messages, the optimization won't occur. However, this doesn't mean that Zenoh (and Zenoh RMW) in its default configuration is slow for small messages.

Since Zenoh RMW is still under development, we're constantly identifying and fixing issues. That's why reports like yours are incredibly valuable to us 😄

I'll test the topology configuration on our end and let you know if we get similar results. If we do, we'll investigate further to identify the cause and develop a fix to ensure Zenoh RMW is as performant as Zenoh itself despite the message size used

fujitatomoya commented 1 week ago

/assign @clalancette

ros2 / rmw_zenoh

High latency / Lost messages: Pub/Sub 10B at high pub frequency. #198