slow_mode_delay for groups and supergroups caused problem for bots

halaei commented 3 years ago

Our Architecture

We are using long-polling and PHP (Laravel queues)[https://laravel.com/docs/8.x/queues] for handling updates. Updates are fetched in batch by getUpdates() and each one is sent as a job (message) into a queue. There are a limited number of workers (processes) handling jobs in the queue, e.g. 50 workers. Each worker handles jobs one at a time. So at the moment that all the 50 workers are handling jobs, the other jobs must wait in the queue until the first worker is done processing its current job.

Issue

A while ago our bot was became too slow. We find the reason is that it was added to a crowded supergroup with slow_mode_delay enabled. The problem with slow mode is that, when it is enabled and the time remaining before bot can send another message is less than 10 seconds, Telegram will hold the message for 10 seconds and send it and send back the http response after this long delay. A user in the group sends a command to our bot. Our bot answers with a message that has InlineKeyboardMarkup with some callback keyboards. Now the bot is banned for 10 seconds. Because the group is large, many users start to click on the callback buttons and our bot receives many requests, demanding bot to send many messages to the chat.

What is the bug

Because the bot is banned only for 10 seconds, Telegram kindly holds the messages for 10 seconds and send it to the user then answer the bot with a successful response after a 10 seconds delay. The good thing is that Telegram doesn't send us too many request errors. The bad thing is that all of our 50 workers will get stuck receiving response from sendMessage() for 10 seconds each, so many our queue gets overflowed, leading in huge delay in answering to requests from other chats and users.

The bug is the conflict between 2 assumptions we have:

As a client bot developer, assume the Telegram handles our HTTP requests as fast as possible so there where be no congestion in any kind of queue that we use. It doesn't matter if I am a PHP developer working with blocking sync architecture, or a non-blocking event-loop one, or even use webhooks. There is always a limit in how many concurrent updates I can hold in memory while processing.
As a server Telegram/bot developer, you think it is nicer to keep HTTP requests as long as they don't require more than 10 second delay, instead of immediately rejecting the request and ask the client to try later.

What I have done to fix the issue temporarily

Before the issue is solved by the Telegram solve, if ever it is going to be solved, I have no choice but somehow detect such groups and ban them from our service. Here is a piece of code to remove only one group for a bot that receives about 2000 updates a minute:

if ($request->update->getChat() && $request->update->getChat()->getId() == -100******) {
    // Ban spam supergroup
    return;
}

Let me know if you want the id of the mentioned chat, because I am a bit suspicious of it. And here is the amount of speed-up of processing jobs (almost 4x) after the change in the code:

Also I don't know why one bad supergroup slows down runtime of other requests from other chats as well. I am not sure but it seems a potential DOS vulnerability to me.

What I think Telegram can change

IMHO, Telegram can simply stop being nice and reject the HTTP requests as soon as possible with 429 status code instead of holding requests for 10 seconds. Another thing that can be done is to put some information in the update about how many seconds bot must wait before sending a message. Currently there is a slow_mode_delay which is only in the response of getChat() method. However, it is not convenient, because not only accessing the value requires an extra call to getChat(), but also it doesn't show how many actual seconds the bot needs to wait at the time of receiving an update.

Thanks for your help.

levlam commented 3 years ago

Telegram will hold the message for 10 seconds

Yes, but for at most 8 seconds.

Also I don't know why one bad supergroup slows down runtime of other requests from other chats as well.

It shouldn't. You graph doesn't show that other requests begin to process faster. It show that they are processed faster than average banned requests.

Telegram can simply stop being nice and reject the HTTP requests as soon as possible with 429 status code instead of holding requests for 10 seconds.

This would break delivery of a lot of otherwise sent messages. It can solve this specific issue, but there can be a lot of other issues which would cause slow responses. Network connection to Bot API servers can be slow or unreliable, Bot API server can be overloaded, Telegram servers can be overloaded. This implies that there is no way to guarantee that requests are handled fast, so a robust architecture shouldn't depend on that.

The most robust solution is to always send network requests asynchronously, so network requests are never block other network requests or processing of new updates.

Given your architecture with PHP and 50 workers which do synchronous network requests to Telegram API, I would suggest to switch to webhooks. Updates from the same chat are delivered successively to webhook to guarantee that they are processed in the correct order, so such chat can block no more than 1 worker simultaneously.

To fix this specific problem, you can call getChat after receiving a 429th error (you should receive it when the second message is sent within two second after the first one) and block processing updates from the chat for slow_mode_delay seconds. You can save received slow_mode_delay value and refresh it once 1-2 hours using getChat. But more general solutions should be also more simple.

halaei commented 3 years ago

Thanks for your response and suggestions:

Yes, but for at most 8 seconds.

For me, it makes development a bit more complicated than it should be. I can work around it one way or the other, but usually no one even consider the chance of having 8 seconds delay for many requests other than some rare and temporary cases.

This (sending immediate 429) would break delivery of a lot of otherwise sent messages.

I think no one would care about this break :) Telegram bots also can handle immediate 429 and retry if they really need to send the message. In case your concern is about non-bot users, I don't think this change affect non-bot users either; Android clients show a timer to the users so they practically can't even try sending messages before they are unblocked. In case you are afraid of breaking something, you may add some extra parameter to sendMessage() to let the bot developer choose which option they prefer the most, or how many seconds they are willing to wait. Or maybe a setting can be added to BotFather?

Also using webhooks makes not much a difference, because there has to be a max limit on the number of processes on php-fpm pool as well. But increasing number of processes with queues/webhook for sure will help, with the cost of using more memory. Also based on my experience, using queue system instead of webhooks makes PHP based bots faster because of the following reasons:

Workers spends less time and CPU because they don't have extra process initiation and framework bootstrap delay.
More importantly, queue workers can reuse opened https connections from the previous jobs, but web processes can't. Sending a Telegram request with a new connection is too slower (~340ms from our servers) than sending it via a previously opened connection (~80ms).

levlam commented 3 years ago

I think no one would care about this break

You underestimate, how much messages it helps to deliver.

Android clients show a timer to the users

And bots can do the same using getChat (this is what Android client does to show the timer).

Also using webhooks makes not much a difference

I told exact reason, why it helps a lot in this particular case. You would need 50+ such groups to block all 50 workers.

Workers spends less time and CPU because they don't have extra process initiation and framework bootstrap delay.

True, but this should be highly optimized in modern frameworks and very fast in the latest PHP versions.

Queue workers can reuse opened https connections from the previous jobs

True, but you can use a local forward proxy to send requests to api.telegram.org instead.

halaei commented 3 years ago

I missed your reason on why webhooks works better in this case. I see it now. Thanks for letting us know about it. Also the local forward proxy idea is an interesting idea. I don't know if it is a common pattern or a new idea :) Thanks again for sharing so much helpful information.

I leave my final feature request here. I understand it might not be a high priority or you may have reasons it's not a good feature: Let the bot developer somehow disable blocking of sendMessage() requests, by either a parameter in sendMessage, or some settings in botfather.

skrtdev commented 3 years ago

So @levlam you said that returning a 429 would break other things, so it seems that you can't know whether there is slow mode enable, nor if message can't be send because of slow mode. But you CAN know it, this is absolutely obvious, so why can't you return a 429 error ONLY when message can't be sent because of slow mode, and not anytime the request execution time exceeds 8 seconds?

tdlib / telegram-bot-api