When sending big messages, the node will eventually crash due to high delays (not using loaned messages)

AlexisTM commented 4 years ago

I sent messages of 33MBytes but the publishers/subscribers were set with a queue size of 10 without loaned messages, and the publisher crashed.

This means: If we are not using the zero-copy capability in all nodes (loaned message methodology), the nodes will crash; I would, therefore, expect the global/local planner and default nodes mapping nodes to have problems when running over iceoryx.

Publisher was:

this->create_publisher("topic", 10);

Subscriber was:

this->create_publisher("topic", 10);

When starting, it says the following but I expect the queue size to be 10.

2019-12-03 11:39:54.393 [Warning]: Cache size for subscribe too large 1000, limiting to MAX_RECEIVER_QUEUE_SIZE = 256

After a few messages, the node crashes due to lack of memory to be allocated.

2019-12-03 11:39:54.393 [Warning]: Cache size for subscribe too large 1000, limiting to MAX_RECEIVER_QUEUE_SIZE = 256
Publishing: Publishing: Publishing: Publishing: Publishing: Publishing: Publishing: Publishing: Publishing: Publishing: Publishing: Publishing: Publishing: Publishing: Publishing: Publishing: Mempool [m_chunkSize = 33554496, numberOfChunks = 5, used_chunks = 5 ] has no more space
left
MemoryManager: unable to acquire a chunk with a payload size of 33000010
The following mempools are available:
  MemPool [ ChunkSize = 96, PayloadSize = 32, ChunkCount = 10000 ]
  MemPool [ ChunkSize = 192, PayloadSize = 128, ChunkCount = 10000 ]
  MemPool [ ChunkSize = 1088, PayloadSize = 1024, ChunkCount = 2000 ]
  MemPool [ ChunkSize = 16448, PayloadSize = 16384, ChunkCount = 500 ]
  MemPool [ ChunkSize = 131136, PayloadSize = 131072, ChunkCount = 200 ]
  MemPool [ ChunkSize = 1048640, PayloadSize = 1048576, ChunkCount = 50 ]
  MemPool [ ChunkSize = 2097216, PayloadSize = 2097152, ChunkCount = 20 ]
  MemPool [ ChunkSize = 4194368, PayloadSize = 4194304, ChunkCount = 10 ]
  MemPool [ ChunkSize = 8388672, PayloadSize = 8388608, ChunkCount = 10 ]
  MemPool [ ChunkSize = 16777280, PayloadSize = 16777216, ChunkCount = 10 ]
  MemPool [ ChunkSize = 33554496, PayloadSize = 33554432, ChunkCount = 5 ]
Senderport [ service = std_msgs/msg/String, instance = /topic, event = data ] is unable to acquire a chunk of with payload size 33000010
POSH__SENDERPORT_ALLOCATE_FAILED

ICEORYX ERROR!

minimal_publisher: /home/alexis/tools/ros2/src/iceoryx/iceoryx_posh/source/popo/sender_port.cpp:185: iox::popo::SenderPort::reserveChunk(uint32_t, bool)::<lambda()>: Assertion `l_chunk && "Pool is running out of chunks"' failed.

This last error is (for me) due to delays in the subscriber side that makes the RouDi not being able to repurpose memory making the publisher dying while it is the subscriber's fault. Note that when I was doing tests on the raw Iceoryx; the typical delay was 50-150μs (18MB messages) but using rmw_iceoryx_cpp, I get a steadily increasing delay up to being few messages late (at 1Hz without any processing on a i9 machine).

Karsten1987 commented 4 years ago

You run essentially out of memory in your SHM pool. That's why the error message essentially says it's unable to allocate space for a new message.

The way I usually work around this is my increasing the shared memory pool size: https://github.com/eclipse/iceoryx/blob/master/iceoryx_posh/source/mepoo/mepoo_config.cpp#L44-L54

The last entry there can be set to 10/20, which gives you more chunks for large messages.

This is unfortunately far from being optimal, as these values are currently hardcoded. We have it on our list of features to make this more flexible and able to be set on startup.

With that configuration, I could sent around 4K images without any delay. I am a bit puzzled about the increasing delay though. @michael-poehnl any ideas?

budrus commented 4 years ago

@AlexisTM is NOT using the loaned messages API extension. So I would assume the increasing delay comes from the serialization which takes the longer the bigger the payload is. Our serialization in rmw_iceoryx is currently quite slow I guess :-(.

In rmw_iceoryx we take the queue size from the provided history qos. @Karsten1987 is there another source for the queue size? Should the "10" that @AlexisTM provides in the create_publisher call be propagated down to the rmw layer as history qos? Maybe the warning comes from another (built-in) subscriber that is using a queue size of 1000?
The warning we see comes from a history qos which is 1000 but the maximum constant in iceoryx is set to 256. So I'm wondering if this is your subscriber and we are not using the right parameter or if this is another subscriber.

If you want to have a queue size of 10 and can live with the fact that you will loose older chunks if the queue is overflowing, than this issue can be solved by increasing the number of chunks with 32 MB (e.g. to 20). If we than ensure that your desired queue size of 10 is used on iceoryx side (and not the 1000 is coming from your subscriber) it should crash no more. Currently we have a fail fast strategy. If your memory pool configuration is not sufficient to handle all the chunks that are blocked by queues and on user side

Having the memory pool configuration as a config file and not only as compile time setting is a feature that is quite on top of the stack.

AlexisTM commented 4 years ago

From not using loaned messages, I expect delays from serialization. The typical delays with other middlewares are (18MBytes):

rmw_fastrtps_cpp: 25ms
rmw_cyclonedds_cpp: 18ms
rmw_iceoryx_cpp with fixed sizes and loaned message: 0.1ms (Awesome!)
rmw_iceoryx_cpp with dynamic sizes and without loaned messages: > 1 second

The reason it crashes is a lack of memory, which is coming from the required depth history which is only buffering because the listener doesn't receive the data fast enough (too high delay).

For the queue size of 1000, the subscriptions are using the depth: https://github.com/ros2/rmw_iceoryx/blob/a8c95d42de562ecab12f0173e9ea34a694521b66/rmw_iceoryx_cpp/src/rmw_subscription.cpp#L98

But there is no mention in the publisher side: https://github.com/ros2/rmw_iceoryx/blob/master/rmw_iceoryx_cpp/src/rmw_publisher.cpp

budrus commented 4 years ago

So the good message is that we are 100 times faster with loaning messages. The bad one is that our "hack a thing to support non memcopy-able messages" serialization is 100 times slower.

I'll check with @Karsten1987 if we can find there another solution by reusing things that are already available in ROS2.

We currently have no use for the queue size on publisher side. We plan to support their a history QoS in future, but this is no queue but more a cache for messages. Currently we only support caching one on subscriber side which corresponds to the latched topic in ROS1

Could you check if it is no more crashing when increasing the chunks of the 32 MB mempool? https://github.com/eclipse/iceoryx/blob/master/iceoryx_posh/source/mepoo/mepoo_config.cpp#L44-L54

AlexisTM commented 4 years ago

NOTE: 0.1ms is the fastest we could go at for non-RT patched Linux.

Karsten1987 commented 4 years ago

@AlexisTM Could you share some code you used for benchmarking? I am trying to take a shot at this. I'd love to have a similar setup as yours to see how you'd produced these numbers to come up with comparable ones on my end.

AlexisTM commented 4 years ago

I am out of office (and don't have the code with me). It basically was: subscriber and publisher with both a queue of 10, sending a struct as:

struct Big data {
   uint8[33000000];
}

This was using the ROS2 API (no loaned messages)

Karsten1987 commented 4 years ago

@AlexisTM we modified our ROSCon demo a little bit to cope with loaned messages as well as "classic" method transport. In neither case we were able to reproduce the behavior of yours.

It would be great if you could give that demo a shot and post some of the results you get here.

To give you an idea on what we see on our machines:

When sending 4k images with 15 Hz with loaned messages:

[INFO] [image_transport_subscriber]: Received 75 messages
[INFO] [image_transport_subscriber]: Average round time 0.124256 milliseconds

When sending 4k images with 15 Hz using the classic approach:

[INFO] [image_transport_subscriber]: Received 104 messages
[INFO] [image_transport_subscriber]: Average round time 2.615024 milliseconds

Even when adding a string field to the 4k fixed size messages to force serialization, we get round trip times of about 20 milliseconds. Can you try to reproduce this?

Karsten1987 commented 4 years ago

@AlexisTM I am going to close this issue because I am considering this problem being addressed. Please feel free to re-open this ticket if you have further questions about it.

ros2 / rmw_iceoryx

When sending big messages, the node will eventually crash due to high delays (not using loaned messages) #5