mathieucarbou / ESPAsyncWebServer

Asynchronous HTTP and WebSocket Server Library for ESP32, ESP8266 and RP2040
https://mathieu.carbou.me/ESPAsyncWebServer/
GNU Lesser General Public License v3.0
87 stars 17 forks source link

[BUG] CORRUPT HEAP: Bad tail on clearQueue() when freeing buffer (vector of bytes) #120

Closed MikeSoperRubicon closed 1 month ago

MikeSoperRubicon commented 1 month ago

Hi there, thank you for your wonderful library that allows one to upgrade to use ArduinoJson7...

Description platform = espressif32@6.8.1 framework = espidf, arduino

My issue is as follows: When changing from https://github.com/esphome/ESPAsyncWebServer to https://github.com/mathieucarbou/ESPAsyncWebServer And opening a websocket connection, there is instantly a crash which occurs once the websocket is open. This did not happen before on the esphome version. When comparing libraries, it appears the issue is in the _onAck function where your library runs _clearQueue(). The esphome library runs _cleanBuffers() instead which checks if a buffer is available to delete before doing so. I suspect the issue is upon freeing the memory for a websocket event in the queue.

Board: esp32s3

Ethernet adapter used: W5500

Stack trace

CORRUPT HEAP: Bad tail at 0x3fccb2b4. Expected 0xbaad5678 got 0xbaad5600

assert failed: multi_heap_free IDF\components\heap\multi_heap_poisoning.c:259 (head != NULL)

Backtrace: 0x40379bb6:0x3fcb8b40 0x403808c9:0x3fcb8b60 0x4038c29d:0x3fcb8b80 0x403880fd:0x3fcb8ca0 0x4037ac2d:0x3fcb8cc0 0x4037b1fd:0x3fcb8ce0 0x4037b2dd:0x3fcb8d10 0x4214d859:0x3fcb8d30 0x4203575e:0x3fcb8d50 0x420375e1:0x3fcb8d70 0x4203765d:0x3fcb8d90 0x4216c9a1:0x3fcb8db0 0x4216cc05:0x3fcb8df0
  #0  0x40379bb6 in panic_abort at C:/Users/michael.soper/.platformio/packages/framework-espidf/components/esp_system/panic.c:408
  #1  0x403808c9 in esp_system_abort at C:/Users/michael.soper/.platformio/packages/framework-espidf/components/esp_system/esp_system.c:137
  #2  0x4038c29d in __assert_func at C:\Users\michael.soper\.platformio\packages\framework-espidf\components\newlib/assert.c:85
  #3  0x403880fd in multi_heap_free at C:\Users\michael.soper\.platformio\packages\framework-espidf\components\heap/multi_heap_poisoning.c:259 (discriminator 1)
  #4  0x4037ac2d in heap_caps_free at C:\Users\michael.soper\.platformio\packages\framework-espidf\components\heap/heap_caps.c:382
  #5  0x4037b1fd in trace_free at C:\Users\michael.soper\.platformio\packages\framework-espidf\components\heap\include/heap_trace.inc:118
  #6  0x4037b2dd in __wrap_free at C:\Users\michael.soper\.platformio\packages\framework-espidf\components\heap\include/heap_trace.inc:163
  #7  0x4214d859 in operator delete(void*) at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/gcc/libstdc++-v3/libsupc++/del_op.cc:49
  #8  0x4203575e in __gnu_cxx::new_allocator<unsigned char>::deallocate(unsigned char*, unsigned int) at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\ext/new_allocator.h:125
      (inlined by) std::allocator_traits<std::allocator<unsigned char> >::deallocate(std::allocator<unsigned char>&, unsigned char*, unsigned int) at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\bits/alloc_traits.h:462
      (inlined by) std::_Vector_base<unsigned char, std::allocator<unsigned char> >::_M_deallocate(unsigned char*, unsigned int) at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\bits/stl_vector.h:304
      (inlined by) std::_Vector_base<unsigned char, std::allocator<unsigned char> >::~_Vector_base() at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\bits/stl_vector.h:285
      (inlined by) std::vector<unsigned char, std::allocator<unsigned char> >::~vector() at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\bits/stl_vector.h:570
      (inlined by) void __gnu_cxx::new_allocator<std::vector<unsigned char, std::allocator<unsigned char> > >::destroy<std::vector<unsigned char, std::allocator<unsigned char> > >(std::vector<unsigned char, std::allocator<unsigned char> >*) at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\ext/new_allocator.h:140
      (inlined by) void std::allocator_traits<std::allocator<std::vector<unsigned char, std::allocator<unsigned char> > > >::destroy<std::vector<unsigned char, std::allocator<unsigned char> > >(std::allocator<std::vector<unsigned char, std::allocator<unsigned char> > >&, std::vector<unsigned char, std::allocator<unsigned char> >*) at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\bits/alloc_traits.h:487
      (inlined by) std::_Sp_counted_ptr_inplace<std::vector<unsigned char, std::allocator<unsigned char> >, std::allocator<std::vector<unsigned char, std::allocator<unsigned char> > >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\bits/shared_ptr_base.h:554
  #9  0x420375e1 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\bits/shared_ptr_base.h:155
      (inlined by) std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\bits/shared_ptr_base.h:148
      (inlined by) std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\bits/shared_ptr_base.h:728
      (inlined by) std::__shared_ptr<std::vector<unsigned char, std::allocator<unsigned char> >, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\bits/shared_ptr_base.h:1167
      (inlined by) std::shared_ptr<std::vector<unsigned char, std::allocator<unsigned char> > >::~shared_ptr() at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\bits/shared_ptr.h:103
      (inlined by) AsyncWebSocketMessage::~AsyncWebSocketMessage() at .pio/libdeps/apex_edge/ESP Async WebServer/src/AsyncWebSocket.h:130
      (inlined by) void __gnu_cxx::new_allocator<AsyncWebSocketMessage>::destroy<AsyncWebSocketMessage>(AsyncWebSocketMessage*) at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\ext/new_allocator.h:140
      (inlined by) void std::allocator_traits<std::allocator<AsyncWebSocketMessage> >::destroy<AsyncWebSocketMessage>(std::allocator<AsyncWebSocketMessage>&, AsyncWebSocketMessage*) at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\bits/alloc_traits.h:487
      (inlined by) std::deque<AsyncWebSocketMessage, std::allocator<AsyncWebSocketMessage> >::pop_front() at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\bits/stl_deque.h:1594
      (inlined by) AsyncWebSocketClient::_clearQueue() at .pio/libdeps/apex_edge/ESP Async WebServer/src/AsyncWebSocket.cpp:305
      (inlined by) AsyncWebSocketClient::_onAck(unsigned int, unsigned int) at .pio/libdeps/apex_edge/ESP Async WebServer/src/AsyncWebSocket.cpp:334      
  #10 0x4203765d in std::_Function_handler<void (void*, AsyncClient*, unsigned int, unsigned int), AsyncWebSocketClient::AsyncWebSocketClient(AsyncWebServerRequest*, AsyncWebSocket*)::{lambda(void*, AsyncClient*, unsigned int, unsigned int)#2}>::_M_invoke(std::_Any_data const&, void*&&, AsyncClient*&&, unsigned int&&, AsyncClient*&&) at .pio/libdeps/apex_edge/ESP Async WebServer/src/AsyncWebSocket.cpp:282
      (inlined by) _M_invoke at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\bits/std_function.h:297
  #11 0x4216c9a1 in std::function<void (void*, AsyncClient*, unsigned int, unsigned int)>::operator()(void*, AsyncClient*, unsigned int, unsigned int) const at c:\users\michael.soper\.platformio\packages\toolchain-xtensa-esp32s3\xtensa-esp32s3-elf\include\c++\8.4.0\bits/std_function.h:687
      (inlined by) AsyncClient::_sent(tcp_pcb*, unsigned short) at .pio/libdeps/apex_edge/Async TCP/src/AsyncTCP.cpp:973
  #12 0x4216cc05 in AsyncClient::_s_sent(void*, tcp_pcb*, unsigned short) at .pio/libdeps/apex_edge/Async TCP/src/AsyncTCP.cpp:1385
      (inlined by) _handle_async_event at .pio/libdeps/apex_edge/Async TCP/src/AsyncTCP.cpp:170
      (inlined by) _async_service_task at .pio/libdeps/apex_edge/Async TCP/src/AsyncTCP.cpp:199

ELF file SHA256: 03e348b959072f30
mathieucarbou commented 1 month ago

You have a corrupted heap. It happens when something is accessing a pointer to an object that has been dereferenced by another task.

You need to check your code and work on minimal reproductible case that is not involving your app code if you think there is a bug in this library.

There has been so many perf and memory improvements that running faster might also highlight some issues you never saw with the ESPhome fork. Also, this fork is using shared pointers: the queue is a list of websocket messages, a websocket message holds a buffer, which is a shared pointers on a vector of bytes. It helps solve a bunch of memory problems happening with the original fork when optimizing websocket buffer sending / cleanup.

I will keep this issue opened for now but I am pretty sure, looking at your stack trace and error that you have a concurrent access to a pointer that is going out of scope.

When the message from the queue goes out of scope it is freed, then destructor is called, the shared PTR sees that there is no more owners so it allows the destruction of the vector, with fails.

So maybe you had a vector that was created without a shared PTR which has been added to the WS queue, then it went out of scope ?

Also, you can have a look at the recommended settings to tweak AsyncTCP and ESPAsyncWebServer (flags) in the readme, and you can also post a link to your GitHub project so that we can check your code.

mathieucarbou commented 1 month ago

About what you saw:

this fork:

void AsyncWebSocketClient::_clearQueue() {
  while (!_messageQueue.empty() && _messageQueue.front().finished())
    _messageQueue.pop_front();
}

esphone:

void AsyncWebSocket::_cleanBuffers()
{
  AsyncWebLockGuard l(_lock);

  for(AsyncWebSocketMessageBuffer * c: _buffers){
    if(c && c->canDelete()){
        _buffers.remove(c);
    }
  }
}++

They do the same thing, except on different data structures.

MikeSoperRubicon commented 1 month ago

Thank you very much for your fast response. I will work on a minimal reproducible code sample to see where I went wrong.

You have a corrupted heap. It happens when something is accessing a pointer to an object that has been dereferenced by another task.

You need to check your code and work on minimal reproductible case that is not involving your app code if you think there is a bug in this library.

There has been so many perf and memory improvements that running faster might also highlight some issues you never saw with the ESPhome fork. Also, this fork is using shared pointers: the queue is a list of websocket messages, a websocket message holds a buffer, which is a shared pointers on a vector of bytes. It helps solve a bunch of memory problems happening with the original fork when optimizing websocket buffer sending / cleanup.

I will keep this issue opened for now but I am pretty sure, looking at your stack trace and error that you have a concurrent access to a pointer that is going out of scope.

When the message from the queue goes out of scope it is freed, then destructor is called, the shared PTR sees that there is no more owners so it allows the destruction of the vector, with fails.

So maybe you had a vector that was created without a shared PTR which has been added to the WS queue, then it went out of scope ?

Also, you can have a look at the recommended settings to tweak AsyncTCP and ESPAsyncWebServer (flags) in the readme, and you can also post a link to your GitHub project so that we can check your code.

MikeSoperRubicon commented 1 month ago

Problem has been solved thank you. Turns out to have indeed been a corrupted pointer. Thanks for your speedy assistance.