Open mauropasse opened 3 weeks ago
Something strange is that I run also a single-process test, with rclcpp intra-process disabled (pub/sub in same process) and the issue goes away, so it seems the issue is with multi-process applications.
So just to be clear:
How are you measuring the latency? If you do one way measurements, then for small data packets the behaviour you are seeing is probably due to batching (at Zenoh and TCP/IP level). If you want to skip batching please leverage the ultra-low latency transport (see https://zenoh.io/blog/2023-10-03-zenoh-dragonite/#support-for-ultra-low-latency).
Don't hesitate to reach us out on Zenoh's Discord server if you have any questions on configuration.
We'll try to reproduce on our side to see if we get a similar behaviour. On hour side it would be good to see if you have a similar behaviour running the latency test shipped with Zenoh (See https://github.com/eclipse-zenoh/zenoh/tree/main/examples/examples)
-- kydos
Hi, thanks for the suggestions.
Before running my ROS2 app, I start zenoh with ./rmw_zenohd
.
Does this executable take config files as input? Like: ./rmw_zenohd -c lowlatency.json5
, or there is other way?
I'm asking because I don't get any output from ./rmw_zenohd --help
.
I'll try to get data from the latency test you mention.
How are you measuring the latency?
The iRobot ros2-performance benchmark shows in console and generates files with info about resources, latency, publish duration, CPU, memory, etc. We write the timestamp to the message, so when the subscription receives the message, it can obtain the latency.
Does this executable take config files as input? Like: ./rmw_zenohd -c lowlatency.json5, or there is other way?
You can find information in our README here on how to configure the session/router with non-default config files.
Hello @mauropasse, the lowlatency.json5
configuration should be passed to the ROS applications. It is not essential for the router as in the default configuration it is used only for off the robot communication. This can be done by means of session configuration as explained on the "Session and Router Configs" section of the Zenoh RMW README. Let me know if this does the trick.
Hello @mauropasse and @mjcarroll, Would you mind explain us the configuration of the test you ran so we can replicate it on our side? Ideally the topology configuration file :D
I have tried using the configuration file to enable low latency mode, and it seems to have fixed the issue for the 10B message.
I have used the DEFAULT_RMW_ZENOH_SESSION_CONFIG.json5 but setting lowlatency:true
and qos:disabled
.
But for some reason, it doesn't work for bigger messages (100KB, 1MB, etc): The subscription doesn't get any messages, like if the nodes were not being discovered. Besides this problem, the publisher process seems to hang indefinitely.
I still see the router messages from both processes, [rmw_zenoh_cpp]: Successfully connected to a Zenoh router with id
.
I've experienced this behavior both on x86 and in RPi4.
@imstevenpmwork the test topology is just a subscription to a 100KB message in one process, and a publisher at 250Hz in the other process. Let me know if you need more details.
Hello @mauropasse ,
The lowlatency
feature is specifically designed to function within the 64K size limit. It provides a latency optimisation for handling small messages without fragmentation support. Regarding the topology, would it be possible to send us the .json
file you are loading to the test? So we make sure we're running exactly the same test setup 😄
Hi @imstevenpmwork, yes I was just testing and found that 64K is the limit in which things don't work. So what happens on a system having both small messages at high freq and big messages? Either we miss the small messages or the big ones?
... would it be possible to send us the
.json
...
# Publisher json
{"nodes": [{"node_name": "pub_node","publishers": [{"topic_name": "test","msg_type": "stamped_vector","msg_size": 64000,"freq_hz": 250}]}]}
# Subscriber json
{"nodes": [{"node_name": "sub_node","subscribers":[{"topic_name":"test", "msg_type":"stamped_vector"}]}]}
Hi @mauropasse
I believe the configuration name is a bit confusing and may need to be changed in the future. Essentially, it offers an option to slightly optimize for latency, provided you're dealing with small messages only. If this option is disabled, as it should be when handling large messages, the optimization won't occur. However, this doesn't mean that Zenoh (and Zenoh RMW) in its default configuration is slow for small messages.
Since Zenoh RMW is still under development, we're constantly identifying and fixing issues. That's why reports like yours are incredibly valuable to us 😄
I'll test the topology configuration on our end and let you know if we get similar results. If we do, we'll investigate further to identify the cause and develop a fix to ensure Zenoh RMW is as performant as Zenoh itself despite the message size used
/assign @clalancette
Hi! I've running some ROS2 benchmarks on
rmw_zenoh
(and other rmw's) using the iRobot ros2-performance framework, this time on a Raspberry Pi 4B and ROS2 Iron. The system tested consists on 2 processes, one for the publisher, one for the subscription.I noticed a spike on latency for the pub/sub - 10B - 2KHz scenario.
I've been able to reproduce these results using also lower pub frequencies, and is noticeable also on x86.
From x86 benchmark output logs I got:
On RPi4B, inspecting the
events.txt
I see every message being late, like if it has lost sync and a msg is read when a new one arrives:So it looks like something starts failing when pub frequency gets high enough.
Something strange is that I run also a single-process test, with rclcpp intra-process disabled (pub/sub in same process) and the issue goes away, so it seems the issue is with multi-process applications.