Dead-Lock of async_tcp task and tcp/ip task when _async_queue fills up completely

ul-gh commented 3 years ago

Hi,

when using an AsyncEventSource instance on ESP32 platform, multiple issues arise due to multi-threaded access to the AsyncTCP event queue (_async_queue), the AsyncEventSource internal _messageQueue and locked usage of the LWIP callback API.

One of the issues was already adressed by Arjan Filius by adding mutex locking of access to the AsyncEventSource _messageQueue (see iafilius/ESPAsyncWebServer/commit/d924de1..)

On my AJAX API project, I tried above commit plus multiple other fixes, but after extensive tracing of heap memory issues etc in AsyncTCP and ESPAsyncWebServer when using the SSE event source, I ran into a dead-lock issue now which causes the async_tcp task watchdog timer rebooting the whole system.

As far as my observation is correct, the issue is that enabling the AsyncEventSource causes arbitrarily timed accesses of the AsyncTCP event queue from the LWIP tcp_ip task (by activating the LWIP _tcp_poll periodic callback and time-interleaved data arriving via the network/IP layer) while at the same time the LWIP API is called by the async_tcp task (sending responses) and also by the user application task calling the AsyncEventSource send() methods.(AsyncEventSourceClient::_queueMessage()... ultimately calling a AsyncClient "client->write()" call etc)

==> I recorded an example call trace using a hardware JTAG debugger, please see following attached PNG image: AsyncEventSource_multithreading_issue

For the application code triggering the issue, please excuse I did not yet prepare a minimum-code example yet, however please let me know if this would be helpful.

The relevant changes I made to ESPAsyncWebServer and AsyncTCP on the "dev" branches under: https://github.com/ul-gh/ESPAsyncWebServer/tree/dev and: https://github.com/ul-gh/AsyncTCP/tree/dev

The whole application using the ESPAsyncWebServer is: https://github.com/ul-gh/esp_ajax_if

ul-gh commented 3 years ago

Looking further into this, I realise the dead-lock is just what happens when the _async_queue of the async_tcp event loop fills up 100%, causing xQueueGenericSend() to block indefinitely.

Anyways, the _async_queue filling up completely seems to happen easily when periodic push messages are sent via SSE / via the AsyncEventSource. I have to look more into the "why" that happens but I suspect any one of the default or user defined AsyncCallbackWebHandlers blocks for a certain amount of time while SSE push messages are acknowledged.

If this is correct, then there should be a graceful handling of the event loop filling to the top (e.g. a sensible error message plus discarding further events).

Also, what helps in my case is increasing the _async_queue length, which by default is only 32. (This could be a configure option) I set this to 256 and indeed now I experience up to approx. 100 events queuing up from time to time, now finally without the dead-lock issue.

Please let me know if you need more input.

BlueAndi commented 3 years ago

You could measure in _handle_async_event() how long a event needs for handling, especially in case of LWIP_TCP_RECV. If there are peaks greater than 300 ms (as I see in your javascript), you could dig in there. Usually SPIFFS access needs quite long.

ul-gh commented 3 years ago

Thank you Andi, I did just that and the guess was correct.

I noticed blocking times of up to seven (!) seconds for some of the initial requests in the AsyncClient::_s_recv() handler. And yes, ultimately I could trace it down to the AsyncStaticWebHandler doing some SPIFFS access. What happened was I had the wrong initialisation order for the AsyncWebHandlers where the static handler serving the root (/) URI was called first before the specialised API endpoint handlers. The massive timeouts happened downwards from the .canHandle() method of the AsyncStaticWebHandler wjich was for every API endpoint called first-in-line looking for any local SPIFFS file with the same name, which of course did not exist.

I did not go into more detail there, why the search for a nonexistent SPIFFS file could possibly cause a multi-second timeout - but at least the result is clear now. The event timer task was pushing SSE events for the whole massive timeout and the _async_queue was filling up.

So this needs to be fixed:

If or when the async event queue fills up for whatever reason, this should not result in an infinite dead-lock of the tasks (This is a clear Denial-of-Service if a request is made for an invalid file name)
There should be a warning or error in the log output when a queue overflow happens
For the AsyncStaticWebHandler, if the massive delays are not caused by some bug or mis-configuration, (i.e. the SPIFFS just /is/ that slow, then it would help much if this was documented.

Also nice-to-have would be a configure option for the queue length, I guess.

I realize this is more than one issue here in one thread, but if this is appreciated, I can post individual issue numbers for these and maybe even provide a fix or two with a PR next week.

Anyways, it would be nice to know from others if these issues can be confirmed or what else should be considered.

ul-gh commented 3 years ago

I need now to add, a lot of what is reported in issue #825 very much looks like the same issue. See e.g. #825 (comment)

So fixing this will likely also fix #825.

stale[bot] commented 3 years ago

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

ul-gh commented 3 years ago

So although the original issue is the _async_queue filling up due to some SPIFFS or other time lag, it would be nice to know if there will by any official fix for this dead-lock issue.

stale[bot] commented 3 years ago

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

brandonros commented 3 years ago

How can I tell if https://github.com/me-no-dev/ESPAsyncWebServer/issues/921 is this?

ul-gh commented 3 years ago

Hi Brandon,

you could try if the version from my fork fixes your issue: https://github.com/ul-gh/ESPAsyncWebServer/

Regards, Ulrich

brandonros commented 3 years ago

@ul-gh your fork seems to be equal with master?

ul-gh commented 3 years ago

That was fast reply..

Pardon me, there is a "dev" branch which combines the patches for #837, #884 and #888. https://github.com/ul-gh/ESPAsyncWebServer/tree/dev

ul-gh commented 3 years ago

@brandonros, I now had a look at the changes I made: I did /not/ commit a fix for the async_tcp task dead-lock-issues, because there was a different underlying issue of the SPIFFS driver looking for nonexisting files causing long delays.

However, if you have a look at file AsyncTCP.cpp in line 105...120:

static inline bool _send_async_event(lwip_event_packet_t ** e_buf){
    return _async_queue && xQueueSend(_async_queue, e_buf, portMAX_DELAY) == pdPASS;
}

You could add some debugging to this to see if this is your issue: If you change portMAX_DELAY to zero or some small value, then the behaviour of _send_async_event() will change to not block indefinitely.

You could also check the return value of the FreeRTOS xQueueSend() call and see if the queue overflows in your application.

Regards, Ulrich

brandonros commented 3 years ago

It seems to happen whenever I try to transmit more than 40-50 bytes

I am not sure if the WiFi part is choking or the TCP part is fighting for contention with my “CAN/GPIO” section.

I will try these fixes. Your help is beyond appreciated. I have tried async, not async, scheduler with timeout, pinning tasks to cores. I am ears for any possible solutions. I will link the problem code for a skim if you don’t mind.

https://gist.github.com/brandonros/c4288c12beb171747258d6a1120b22bc

justoke commented 3 years ago

If you change portMAX_DELAY to zero or some small value, then the behaviour of _send_async_event() will change to not block indefinitely.

Hi. I've been battling stability around ESPAsyncWebServer and BLE(related to what h2zero posted above) with crashes related usually to errors in AsyncTCP. I have tried your suggestion and it seems to have made an immediate difference and may even have resolved the issue. I'll leave it running overnight and report back. Thank you.

justoke commented 3 years ago

Well, the performance is much improved. Requests and responses are near instant. However, it did not resolve the seemingly random crashes. This is a common error trace:

ul-gh commented 3 years ago

Glad to hear this is helping.

But I am afraid the call trace above does not show any unusual behavior.

It does not say the reason why the panic handler was called, this should have been in one of the adjacent lines of text output. Also, even though this might be the task which ultimately crashed, it is not necessarily the root cause of the crash, as any other task could have caused memory corruption etc.

If you please open up a new issue thread with more information, we might be able to help.

CelliesProjects commented 3 years ago

My observation is that using String in ASyncTCP or AsyncWebServer will at some point result in a crash.

stale[bot] commented 3 years ago

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

avillacis commented 3 years ago

Still not fixed (possibly unfixable) in main branch.

I am using the following combination for my projects involving ESP32 and web server:

AsyncTCPSock as a replacement of AsyncTCP - this gets rid of the queue deadlocks, as the event loop does NOT use a queue, but a select() call.
ESPAsyncWebServer (my fork with patches by 0xFEEDC0DE64 applied for Arduino) - this fixes several races between tasks that result in memory corruption, especially when using websocket/SSE messages from tasks other than the AsyncTCP task.

Both are needed - one or the other is not enough. As I wrote AsyncTCPSock specifically to be a drop-in replacement for AsyncTCP, I am dogfooding it on my projects.

stale[bot] commented 3 years ago

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

stale[bot] commented 3 years ago

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

avillacis commented 3 years ago

Still not fixed.

stale[bot] commented 3 years ago

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

rdnn commented 3 years ago

Thank you @avillacis and @ul-gh for your work here. I've been chasing an issue where large events were not being sent, and your work has helped me immensely.

stale[bot] commented 3 years ago

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

avillacis commented 3 years ago

Still not fixed.

stale[bot] commented 3 years ago

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

stale[bot] commented 2 years ago

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

avillacis commented 2 years ago

Still not fixed.

stale[bot] commented 2 years ago

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

stale[bot] commented 2 years ago

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

CelliesProjects commented 2 years ago

Keep open!

stale[bot] commented 2 years ago

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

Adesin-fr commented 2 years ago

Also have a problem with this ! I have a simple sketch with multiple web entry points. If I spoof one endpoint with a curl bash loop, it is OK, but when I spoof two entry points (which both only just replies "OK"), it crashes with a watchdog not reset on async_tcp...

stale[bot] commented 1 year ago

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

avillacis commented 1 year ago

Still not fixed. Feeding the stalebot.

stale[bot] commented 1 year ago

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

stale[bot] commented 1 year ago

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Bevanj86 commented 9 months ago

avillacis, thanks for the modified libs. I've been having a hard time tracking down a problem with a ESP hosted websocket glitching to high latency at random. It's meant to spit out ~500bytes @ 100mS intervals, but every now and then would glitch out for a second or two. Lots of mucking around with wireshark, and writing diagnostic code. I was convinced it was something to do with TCP.

Whatever you've changed in those modified libraries, it looks to have fixed things.

me-no-dev / ESPAsyncWebServer

Dead-Lock of async_tcp task and tcp/ip task when _async_queue fills up completely #876