Stress Testing RMB show missing responses under heavy load

sameh-farouk commented 1 year ago

While stress testing RMB, I observed this behavior. which can consistently reproduced.

To summarize, when I send a large number of messages quickly from twin 1 to twin 2, missing responses occur when the number of messages exceeds 2000. In some tests, I received only about 2700 responses out of 10,000 requests.

The TTL for test messages is set high enough and should not be an issue, Also no errors were found in the trace logs of both peers and relay, which is quite strange.

It is worth noting that this issue occurs on both the main branch and the last two releases, including the one on the main net.

sameh-farouk commented 1 year ago

After debugging and tracking, I observed that the responses were successfully received from the relay and subsequently pushed to Redis. However, they mysteriously vanished into thin air thereafter before picked by the handler process!

It’s interesting to learn that what was initially perceived as a bug turned out to be a feature! 😄 However, it seems like this feature could benefit from some rethinking. Under high load, the process responsible for handling responses seems to be competing with our storage implementation, which unconditionally truncates lists with more than 500 items. While this design choice was intended to prevent potential attacks, it didn’t account for situations where this value might be too low.

By revisiting the code comments related to storage, I can gain a deeper understanding of the rationale behind this decision but it might be worth exploring alternative approaches.

To illustrate, the current implementation does not provide adequate protection against the use of multiple queues to send messages. Since the number of queues/keys is unlimited, each one can hold up to 500 messages, making it possible for an attack to occur.

One approach that could be considered is to set a maxmemory and opt for using Redis key eviction policies.

muhamadazmy commented 1 year ago

I would honestly relying on redis configuration to avoid attack instead this should be built into rmb somehow. My idea was simply to trim slow queue so older messages are trimmed away if redis queue exceeded 500 messages. This small limit was set to avoid malicious clients from filling up peer memory with garbage messages that has no consumer for the queues.

One of the major flows in rmb design is that local clients to the peer can use any queues for commands which can cause:

conflicts if multiple handler are processing same cmd queue
reply queues which are unique not be able to be "validated" that they are still needed.

Unlike a synchronus RPC protocols like HTTP where client keeps a connection open for a response hence the server can abort or just even drop the messgage if client connection was lost. It also allow validating the request "command" which is equivalent to HTTP path in an http request, and of course the HTTP Method.

One solution is that command handlers need to tell rmb-peer that they are the consumers of those handler queues which means if rmb-peer received a message to an unknown handler it can return a NotFound like error.

The problem is this requires collaboration of the handlers sdk (to tell rmb) which means older SDKs that does not implement this will suddenely stop receiving messages if they are not updated to latest rmb.

This unfortunately only handle (request) messages. But response messages are another problem:

response messages are volatile queues that are there to receive a response of a previously sent request. They are right now just UUIDs
While we do track what messages are expected to have responses, so we can drop the ones that are responses to an unknown queue (that is already implemented) we can NOT know in advance how many response are expected to be so again a malicious server can still answer a simple request with millions of big messages that again fills up the response queue.
The limit of that queue is still also set to 500

I can only think of changing this value to something bigger like 10,000 but this is only pushing the problem a bit further

sameh-farouk commented 1 year ago

While we do track what messages are expected to have responses, so we can drop the ones that are responses to an unknown queue (that is already implemented) we can NOT know in advance how many response are expected to be so again a malicious server can still answer a simple request with millions of big messages that again fills up the response queue.

I had anticipated that sending 100 requests with some reply queue would result in 100 entries in the backlog, each with a unique ID but the same reply queue? However, I need to review that part of the code to better understand how it works. I don’t think I ever gave it much attention. If it is not working like this then maybe this should be reconsidered, no?

Regarding your suggestion of increasing the maximum queue limit, I would suggest implementing response timeouts to prevent stalled data from affecting system performance or causing Redis to crash. let me explain why I think this could be better option:

To avoid stale data completely, I would advise setting a TTL for response keys/queues.
When a client sends a request, it should specify a message expiration. Otherwise, Postman will provide a default of 300 seconds. Additionally, the client should set a timeout on the request (waiting on the reply queue). If there is still no response from the relay after X seconds, the client will ‘give up’ on waiting (by not doing so it can wait indefinitely for a response that could never come). Normally, the timeout won’t be longer than the message expiration time because the request or response won’t be processed further by RMB if it has expired. In other words, message expiration should be translated as the sender not willing to wait for the request outcome (response) longer than X seconds. So why do we allow responses to outlive the request lifetime by storing them indefinitely?

In cases where a response didn’t get consumed by the sender in the lifetime of the request, this means two things:

Messages toke longer than expected, or the handler is under heavy-load or slowing for some reason. In both cases, it is unlikely that the client is still interested in the response (as it should have timed out already) .
Or, the client has crashed and is unlikely to process the responses when it starts again.

To clean such stalled data, we should *set a TTL for reply queues to something like envelope TTL X**. X could be strict like 1 or a bit loose.

Also, it’s advisable to have monitoring and logging capabilities in the storage implementation to detect excessive accumulation of requests for unhandled commands or responses. I would then adjust the current logic to log such behavior (instead of dealing with it silently) to help identify bottlenecks or performance issues and detect any unusual patterns or anomalies.

muhamadazmy commented 1 year ago

Yes, I agree that TTL need to be set on the queues itself, and tbh I thought I did but seems i only set the ttl on the backlog tracker. The ttl can be updated always by the TTL set on the message (weather request or response)

I didn't get the part about using the reply queue for multiple messages. That's totally fine and is allowed by design it means something like the gridproxy have a single worker to receive responses while broadcasting messages to multiple nodes, their responses are gonna endup in the same response queue for that worker handler

threefoldtech / rmb-rs

Stress Testing RMB show missing responses under heavy load #161