me-no-dev / ESPAsyncWebServer

Async Web Server for ESP8266 and ESP32
3.78k stars 1.23k forks source link

AsyncWebSocketClient forgets to dequeue control messages. Fills up with "pings" and prints "ERROR: Too many messages queued". #1389

Open 2bam opened 7 months ago

2bam commented 7 months ago

EDIT: Already found the problem. Added a PR with the full description and solution https://github.com/me-no-dev/ESPAsyncWebServer/pull/1390


I think it's an omission here

https://github.com/me-no-dev/ESPAsyncWebServer/blob/f71e3d427b5be9791a8a2c93cf8079792c3a9a26/src/AsyncWebSocket.cpp#L527

I believe it should do the same cleanup as it does for message packets, also for control packets.

Or maybe the following if could be a while to flush all of them out.

https://github.com/me-no-dev/ESPAsyncWebServer/blob/f71e3d427b5be9791a8a2c93cf8079792c3a9a26/src/AsyncWebSocket.cpp#L498

My communication pattern is receiving human pressed keys which could be very fast and then very slow. Each key press gets a response sent. And the pings are more frequent and used to calculate round trip time. I start seeing problems when using pings/messages together and no problems at all if they are used separately (i.e. only pings, or only messages), regardless of the comm frequency. I'll keep investigating.

2bam commented 7 months ago

After further investigation, it seems at some point AsyncWebSocketClient::_onAck stops getting called. And as messages are never flushed control pings never get sent neither (finished=0) so they never get an ack neither. This happens after a burst of messages arrive very fast/at the same time. After every message received, I send an "OK" response. A ping and these "OK"s might get intertwined.

If I keep receiving messages (and responding "OK") at one point the message queue fills up, start getting ERROR: Too many messages queued message, but the message queue doesn't need to be full to create this stuck state.

I've added this debug code to look into the state of the queue each time I ping

void AsyncWebSocketClient::ping(uint8_t *data, size_t len)
{
  ets_printf("ping\n");
  int i = 0;
  for (auto it = _controlQueue.begin(); it != _controlQueue.end(); ++it)
  {
    ets_printf("ping ctrl q %d) fin%d op%d\n", i++, (*it)->finished() ? 1 : 0, (*it)->opcode());
  }
  i = 0;
  for (auto it = _messageQueue.begin(); it != _messageQueue.end(); ++it)
  {
    auto *x = (AsyncWebSocketBasicMessage *)(*it);
    ets_printf("ping mesg q %d) fin%d bwf%d len%d sent%d ack%d acked%d\n", i++, (*it)->finished() ? 1 : 0, (*it)->betweenFrames(), x->_len, x->_sent, x->_ack, x->_acked);
  }

  if (_status == WS_CONNECTED)
    _queueControl(new AsyncWebSocketControl(WS_PING, data, len));
}

The result:

ping ctrl q 0) fin0 op9
ping ctrl q 1) fin0 op9
ping ctrl q 2) fin0 op9
ping ctrl q 3) fin0 op9
ping ctrl q 4) fin0 op9
ping ctrl q 5) fin0 op9
ping ctrl q 6) fin0 op9
ping ctrl q 7) fin0 op9
ping ctrl q 8) fin0 op9
ping ctrl q 9) fin0 op9
ping ctrl q 10) fin0 op9
ping ctrl q 11) fin0 op9
ping ctrl q 12) fin0 op9
ping ctrl q 13) fin0 op9
ping ctrl q 14) fin0 op9
ping ctrl q 15) fin0 op9
[NOTE: this ctrl list keeps growing and growing with each ping! ]

ping mesg q 0) fin0 bwf0 len2 sent0 ack0 acked2
ping mesg q 1) fin0 bwf1 len2 sent0 ack0 acked0
ping mesg q 2) fin0 bwf1 len2 sent0 ack0 acked0
ping mesg q 3) fin0 bwf1 len2 sent0 ack0 acked0
ping mesg q 4) fin0 bwf1 len2 sent0 ack0 acked0
ping mesg q 5) fin0 bwf1 len2 sent0 ack0 acked0
ping mesg q 6) fin0 bwf1 len2 sent0 ack0 acked0
ping mesg q 7) fin0 bwf1 len2 sent0 ack0 acked0
ping mesg q 8) fin0 bwf1 len2 sent0 ack0 acked0
ping mesg q 9) fin0 bwf1 len2 sent0 ack0 acked0
ping mesg q 10) fin0 bwf1 len2 sent0 ack0 acked0
ping mesg q 11) fin0 bwf1 len2 sent0 ack0 acked0

It seems the message at front of the queue is in an ill state, having acked for more than it was actually sent.