Closed matt123p closed 5 years ago
How about some comments that explain what all this is for?
The basic idea behind the code is to stop a problem that occurs when the client disconnects asynchronously. When this happens the lwip_fin function is called and this closes the pcb.
If this close happens at just the wrong time then it occurs between a tcp* function being scheduled and executed. When the tcp* function is finally executed it uses the pcb handle it was given, which is now invalid.
The way this patch solves the problem is to have a flag (one per pcb connection) which when zero means the pcb is OK to use and when non-zero indicates the pcb has been closed asynchronously.
Before using the pcb handle the flag is checked and if the pcb has already been closed the tcp_* function is not called and the crash avoided.
In the lwip_fin function the flag is set to non zero.
To cope with multiple pcbs open at the same time, instead of a single flag there is an array of flags. At construction the AsyncClient picks a slot in the array to use as its flag. It picks a slot that has been used the longest time ago. It knows this because it will have the lowest value in it. It must do this because we do not want to reuse a slot that has just been closed because there might still be a call scheduled using that slot.
I hope this makes sense. If you want to share your code then I could take a look at it with you.
The only change you really need to make is to make sure you check the closed slot array before using the pcb handle.
Thanks, Matt.
Thanks for your work with this @matt123p - I was seeing issues related to this that I could not figure out.
Thanks for the explanations, Matt! What do you think of putting them into a comment in the code?
Where this is causing headaches is here https://github.com/me-no-dev/AsyncTCP/pull/48/files#diff-5c977e380a3eb2cac7b8881fd6fd9dcbR550-R576, it's not the functionality but the nightmare of needing to call a _tcp function from a C callback in the TLS code. I'm not a C++ whizz, so most likely there is a better solution.
Still getting some random heap crashes with this when loading a page.
This is the sequence that caused this crash:
CORRUPT HEAP: Bad tail at 0x3ffd6440. Expected 0xbaad5678 got 0x00000000
assertion "head != NULL" failed: file "/Users/ficeto/Desktop/ESP32/ESP32/esp-idf-public/components/heap/multi_heap_poisoning.c", line 214, function: multi_heap_free
abort() was called at PC 0x400f8e63 on core 1
Backtrace: 0x4008c814:0x3ffd5c40 0x4008ca45:0x3ffd5c60 0x400f8e63:0x3ffd5c80 0x4008c489:0x3ffd5cb0 0x40084e6a:0x3ffd5cd0 0x400864ad:0x3ffd5cf0 0x4000bec7:0x3ffd5d10 0x400d8abb:0x3ffd5d30 0x400887b9:0x3ffd5d60
Rebooting...
which decodes to:
0x4008c814: invoke_abort at /Users/ficeto/Desktop/ESP32/ESP32/esp-idf-public/components/esp32/panic.c line 155
0x4008ca45: abort at /Users/ficeto/Desktop/ESP32/ESP32/esp-idf-public/components/esp32/panic.c line 170
0x400f8e63: __assert_func at ../../../.././newlib/libc/stdlib/assert.c line 63
0x4008c489: multi_heap_free at /Users/ficeto/Desktop/ESP32/ESP32/esp-idf-public/components/heap/multi_heap_poisoning.c line 214
0x40084e6a: heap_caps_free at /Users/ficeto/Desktop/ESP32/ESP32/esp-idf-public/components/heap/heap_caps.c line 268
0x400864ad: _free_r at /Users/ficeto/Desktop/ESP32/ESP32/esp-idf-public/components/newlib/syscalls.c line 42
0x400d8abb: _async_service_task(void*) at C:\Users\John\Documents\Arduino\libraries\AsyncTCP\src\AsyncTCP.cpp line 182
0x400887b9: vPortTaskWrapper at /Users/ficeto/Desktop/ESP32/ESP32/esp-idf-public/components/freertos/port.c line 143
I have a ws.cleanupClients(); in my webserver loop as well.
Also seeing a lot of [W][AsyncTCP.cpp:942] _poll(): pcb is NULL messages
The only other variable I can think of is I have a captive portal, which is maybe why it's crashing by just opening the browser (it doesn't redirect with Android).
@circuitsetup could you raise this as an issue and include the smallest amount of code that reproduces the problem?
I am currently using this code with websockets, file serving from SPIFF and serving custom HTTP requests and I have found it very stable.
@matt123p I did some more testing and it appears it is an issue with the captive portal (https://github.com/espressif/arduino-esp32/tree/master/libraries/DNSServer). As soon as I removed that code, I couldn't get it to crash. It's a bit tricky because it seems to crash at a different point every time - I guess depending on how the requests are handled.
I made a mistake with the last PR in which it was possible for a closed_slot to not be freed. This version fixes that and also is generally tidier and makes it slightly easier to read the code.