shogunxam / ESP32_BLETracker

ESP32 Bluetooth LE Tracker
MIT License
61 stars 13 forks source link

FHEM Mode exceptions #20

Closed lucky2k12 closed 3 years ago

lucky2k12 commented 3 years ago

Thank you @shogunxam for the nice firmware.

I compiled the newest git version with platform.io. Decvices I use:

I am trying to use the FHEM collectord mode, but I get lots of reboots: " 2021-02-17 18:30:28 BLETracker initialized
2021-02-17 18:30:27 Software reset due to exception/panic
2021-02-17 18:30:27 Connected to ...
2021-02-17 18:29:25 Software reset due to exception/panic
2021-02-17 18:29:25 Connected to ...
2021-02-17 18:28:11 Software reset due to exception/panic
2021-02-17 18:28:11 Connected to ...
2021-02-17 18:27:43 Software reset due to exception/panic

" Otherwise the funcion is good, but stability is not :(

How can I contribute to the debugging process?

shogunxam commented 3 years ago

Hi @lucky2k12 thankyou to give a try to my firmware. In order to debug the issue I need to have the call stack of the crash. You should connect the ESP32 to the PC and flash the debug variant of the firmware and monitor the serial console. When the crash occurs you will see on the console the backtrace. Unfortunately this is not enough, you can try to use this script EspArduinoExceptionDecoder to extract the callstack. Alternativelly:

Some tests you can do to help me to identify the problem are:

Best Regards

lucky2k12 commented 3 years ago

Thank yoou for your help!

I flashed the debug version but I don't get any messages in the console. What could I be doing wrong here?

Also I had already disabled the battery monitoring and reduced the list to one device. But I'm still getting a reboot each 1-5 min. In the Logs window when refreshing quite often a warning "Error Reading logs!!!". When pushing back and going to logs again, it's mostly ok. But sometimes the esp32 is completely dead and I have to power cycle.

One other thing: I get 2 warnings about "ESP32_BLETracker\include" and "packages}framework-arduinoespressif32\tools\sdk\include\nimble" not found. Otherwise the compile runs through, not sure what this means.

Sorry to waste your time, but I am somehow lost.

Edit: I managed to build/flash in virtualbox and now I can see what goes on in the console. This time the firmware seems much more stable too. I'll keep you posted, Thank you once again :)

shogunxam commented 3 years ago

Hi lucky2k12, about the warning related to ESP32_BLETracker\include I'm aware about it, PlatformIo search for that folder by default but I'm not using it. About the other related to 'nimble' I don't know why it's shown, I'm not using it at all, and this is the reason why it build without errors. Let me know about your discoveries.

lucky2k12 commented 3 years ago

Hi shogunxam, THX for the information, good to see that I can probably ignore those warnings.

I noticed that when the browser does not respond any more, the serial console continues to send messages, So it seems that the WiFi connection dies after some time, while the BLE connection ist still working. The Wifi-Signal is stable, since the module is only about 1m away from my Router.

I watched a tutorial video from Andreas Spiess at YT (#176) regarding Wifi and BLE on ESP32 modules. His recommendation was to switch on BLE or Wifi, but not both at the same time, since they use the same Antenna. Could that be the reason for my wifi problem? Also see his sketch here: https://github.com/SensorsIot/Bluetooth-BLE-on-Arduino-IDE/blob/master/BLE_Proximity_Sensor/BLE_Proximity_Sensor.ino

Sorry, I was not yet able to get a backtrace, I'm hopefully be able to look into that backtrace / XTENSA_GDB stuff this weekend,

shogunxam commented 3 years ago

Hi lucky2k12, yes on the ESP32 Wi-Fi and BT share the same channel. Espressif did some work to mitigate the problem in the ESP-IDF 4.2, currently the firmware is built against a version of framework-arduinoespressif32 using ESP-IDF 3.3. Anyway turn on and off the Wi-Fi and the BT is not an option, I'm using different thread working in parallel and they all expect to have a live Wi-Fi connection (the threads managing the collectord's connections, the WebServer ... ) in the example you posted the code involving the BT and the code involving the WiF-Fi are sequential so it' more easy turn on and off them when needed. Moreover turning off the Wi-Fi will lead to continues disconnection form the collectord. I'll try to do some tests using the framework-arduinoespressif32 using ESP-IDF 4.2. If you can provide the backtrace it would be great.

shogunxam commented 3 years ago

Another useful information could be your collectord's configuration. Which timeouts have you set in your configuration?

lucky2k12 commented 3 years ago

Thank you for the explanation.

I'm working on the installation of esp-idf. I got some backtrace copy/pasted from the serial log, but don't know how to examine it further.

I have the following timeout parameters set in collectord: presence_timeout=180 absence_timeout=20

shogunxam commented 3 years ago

I have integrated latest release of arduino-esp32 (v1.0.5-rc7). I have let run the firmware for 3 hours without any crash and it's still running. Some time the Web Page is not fully loaded because of some socket issue, in case you have to refresh the page. Give a try to this branch Integrate_arduino-esp32_1.0.5-rc7.

To examine the backtrace just copy in a file all the text after backtrace: , it's something like: 0x400e14ed:0x3ffb5030 0x400d0802:0x3ffb5050 then as I wrote in the previous comment you can use the EspArduinoExceptionDecoder (follow the instructions in the project page) or you can use the script (linux only) esp32-backtrace.sh present in my project ./esp32-backtrace.sh .pio/build/esp32dev-ble-fhem-debug/firmware.elf <your_backtrace_file> it will output something like;

BT-0: 0x400dbfd4 is in WiFiClient::read(unsigned char*, unsigned int) (/home/ryo/.platformio/packages/framework-arduinoespressif32@src-6b51b0b6f41eb69bf7b0ccc64ebe7f33/libraries/WiFi/src/WiFiClient.cpp:107).
107             if(!dst || !len || (_pos == _fill && !fillBuffer())){
BT-1: 0x401ea9be is in WiFiClient::read() (/home/ryo/.platformio/packages/framework-arduinoespressif32@src-6b51b0b6f41eb69bf7b0ccc64ebe7f33/libraries/WiFi/src/WiFiClient.cpp:345).
345         int res = read(&data, 1);
BT-2: 0x400ea435 is in Stream::timedRead() (/home/ryo/.platformio/packages/framework-arduinoespressif32@src-6b51b0b6f41eb69bf7b0ccc64ebe7f33/cores/esp32/Stream.cpp:36).
36              c = read();
BT-3: 0x400ea4cd is in Stream::readStringUntil(char) (/home/ryo/.platformio/packages/framework-arduinoespressif32@src-6b51b0b6f41eb69bf7b0ccc64ebe7f33/cores/esp32/Stream.cpp:330).
330         int c = timedRead();
BT-4: 0x400e320e is in WebServer::_parseRequest(WiFiClient&) (/home/ryo/.platformio/packages/framework-arduinoespressif32@src-6b51b0b6f41eb69bf7b0ccc64ebe7f33/libraries/WebServer/src/Parsing.cpp:70).
70        String req = client.readStringUntil('\r');
BT-5: 0x400e13c9 is in WebServer::handleClient() (/home/ryo/.platformio/packages/framework-arduinoespressif32@src-6b51b0b6f41eb69bf7b0ccc64ebe7f33/libraries/WebServer/src/WebServer.cpp:309).
309             if (_parseRequest(_currentClient)) {
BT-6: 0x400d2316 is in WebServerLoop(void*) (main/OTAWebServer.cpp:606).
606           server->handleClient();
BT-7: 0x4008ff3a is in vPortTaskWrapper (esp-idf-public/components/freertos/port.c:143).

Check only inside the script if the path to the debugger is correct XTENSA_GDB=~/.platformio/packages/toolchain-xtensa32/bin/xtensa-esp32-elf-gdb it should if you have done a default installation of PlatformIo (if you are not used to linux environment, ~/ it's your current user home folder)

EDIT: I found (and fixed) an issue in the FHEM flow causing the missing reconnection to the Wi-Fi in case the connection is lost.

lucky2k12 commented 3 years ago

Thank you, with that branch it is more stable, but I still get exception/panic after around 15min and the webserver is not very stable too. I had no success with esp-32-backtrace.sh, some problems with librarys ncurses. How could I send you the *.elf file?

Backtrace: 0x400dbe94:0x3ffde0f0 0x401eb846:0x3ffde130 0x400ea3a5:0x3ffde160 0x400ea43d:0x3ffde180 0x400e3112:0x3ffde1a0 0x400e12c5:0x3ffde3c0 0x400d2322:0x3ffde410 0x4009008a:0x3ffde430 Backtrace: 0x4008f55f:0x3ffe7580 0x4008f7dd:0x3ffe75a0 0x401b857f:0x3ffe75c0 0x401b85c6:0x3ffe75e0 0x401b7a35:0x3ffe7600 0x401b7780:0x3ffe7620 0x401b7eed:0x3ffe7640 0x400e45f7:0x3ffe7660 0x400e46de:0x3ffe7690 0x400e47cc:0x3ffe76d0 0x400e4a21:0x3ffe76f0 0x400e8135:0x3ffe7740 0x400e5e02:0x3ffe7830 0x400f64b1:0x3ffe7880 0x400f00b2:0x3ffe78a0 0x4009008a:0x3ffe78d0 Backtrace: 0x4008f55f:0x3ffe72b0 0x4008f7dd:0x3ffe72d0 0x401b857f:0x3ffe72f0 0x401b85c6:0x3ffe7310 0x401b7677:0x3ffe7330 0x401b778e:0x3ffe7350 0x400e80f2:0x3ffe7370 0x400e5e02:0x3ffe7460 0x400f64b1:0x3ffe74b0 0x400f00b2:0x3ffe74d0 0x4009008a:0x3ffe7500 Backtrace: 0x400dbe94:0x3ffde0d0 0x401eb846:0x3ffde110 0x400ea3a5:0x3ffde140 0x400ea43d:0x3ffde160 0x400e3112:0x3ffde180 0x400e12c5:0x3ffde3a0 0x400d2322:0x3ffde3f0 0x4009008a:0x3ffde410 Backtrace: 0x4008f55f:0x3ffe7680 0x4008f7dd:0x3ffe76a0 0x401b857f:0x3ffe76c0 0x401b85c6:0x3ffe76e0 0x401b7a35:0x3ffe7700 0x401b7780:0x3ffe7720 0x400e80f2:0x3ffe7740 0x400e5e02:0x3ffe7830 0x400f64b1:0x3ffe7880 0x400f00b2:0x3ffe78a0 0x4009008a:0x3ffe78d0 Backtrace: 0x4008f55f:0x3ffe7630 0x4008f7dd:0x3ffe7650 0x401b857f:0x3ffe7670 0x401b85c6:0x3ffe7690 0x401b7a35:0x3ffe76b0 0x401b7780:0x3ffe76d0 0x401b7eed:0x3ffe76f0 0x400e45f7:0x3ffe7710 0x400e82c5:0x3ffe7740 0x400e5e02:0x3ffe7830 0x400f64b1:0x3ffe7880 0x400f00b2:0x3ffe78a0 0x4009008a:0x3ffe78d0 Backtrace: 0x4008f55f:0x3ffe70b0 0x4008f7dd:0x3ffe70d0 0x401b857f:0x3ffe70f0 0x401b85c6:0x3ffe7110 0x401b7677:0x3ffe7130 0x401b778e:0x3ffe7150 0x400e80f2:0x3ffe7170 0x400e5e02:0x3ffe7260 0x400f64b1:0x3ffe72b0 0x400f00b2:0x3ffe72d0 0x4009008a:0x3ffe7300

shogunxam commented 3 years ago

I think you can attach it to the comment

lucky2k12 commented 3 years ago

finally i got something out of the debugger: BT-0: 0x4008f55f is in invoke_abort (esp-idf-public/components/esp32/panic.c:156). 156 esp-idf-public/components/esp32/panic.c: Datei oder Verzeichnis nicht gefunden. BT-1: 0x4008f7dd is in abort (esp-idf-public/components/esp32/panic.c:171). 171 esp-idf-public/components/esp32/panic.c: Datei oder Verzeichnis nicht gefunden. BT-2: 0x401b856f is in __cxxabiv1::__terminate(void (*)()) (/builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:47). 47 /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc: Datei oder Verzeichnis nicht gefunden. BT-3: 0x401b85b6 is in std::terminate() (/builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:57). 57 /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc: Datei oder Verzeichnis nicht gefunden. BT-4: 0x401b7667 is in __cxxabiv1::__cxa_throw(void*, std::type_info*, void (*)(void*)) (/builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_throw.cc:87). 87 /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_throw.cc: Datei oder Verzeichnis nicht gefunden. BT-5: 0x401b777e is in operator new(unsigned int) (/builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/new_op.cc:54). 54 /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/new_op.cc: Datei oder Verzeichnis nicht gefunden. BT-6: 0x400e80e2 is in BLEScan::handleGAPEvent(esp_gap_ble_cb_event_t, esp_ble_gap_cb_param_t*) (/home/ludwig/.platformio/packages/framework-arduinoespressif32/libraries/BLE/src/BLEScan.cpp:112). 112 BLEAdvertisedDevice *advertisedDevice = new BLEAdvertisedDevice(); BT-7: 0x400e5df2 is in BLEDevice::gapEventHandler(esp_gap_ble_cb_event_t, esp_ble_gap_cb_param_t*) (/home/ludwig/.platformio/packages/framework-arduinoespressif32/libraries/BLE/src/BLEDevice.cpp:270). 270 BLEDevice::getScan()->handleGAPEvent(event, param); BT-8: 0x400f64a1 is in btc_gap_ble_cb_handler (esp-idf-public/components/bt/bluedroid/btc/profile/std/gap/btc_gap_ble.c:54). 54 esp-idf-public/components/bt/bluedroid/btc/profile/std/gap/btc_gap_ble.c: Datei oder Verzeichnis nicht gefunden. BT-9: 0x400f00a2 is in btc_task (esp-idf-public/components/bt/common/btc/core/btc_task.c:163). 163 esp-idf-public/components/bt/common/btc/core/btc_task.c: Datei oder Verzeichnis nicht gefunden. BT-10: 0x4009008a is in vPortTaskWrapper (esp-idf-public/components/freertos/port.c:143). 143 esp-idf-public/components/freertos/port.c: Datei oder Verzeichnis nicht gefunden. sorry, formating looks awful. I don't want to attach the file public, because wifi data inside :(

shogunxam commented 3 years ago

This should be enough

lucky2k12 commented 3 years ago

This one looks a little bit different: T-0: 0x4008f55f is in invoke_abort (esp-idf-public/components/esp32/panic.c:156). 156 esp-idf-public/components/esp32/panic.c: Datei oder Verzeichnis nicht gefunden. BT-1: 0x4008f7dd is in abort (esp-idf-public/components/esp32/panic.c:171). 171 esp-idf-public/components/esp32/panic.c: Datei oder Verzeichnis nicht gefunden. BT-2: 0x401b856f is in __cxxabiv1::__terminate(void (*)()) (/builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:47). 47 /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc: Datei oder Verzeichnis nicht gefunden. BT-3: 0x401b85b6 is in std::terminate() (/builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:57). 57 /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc: Datei oder Verzeichnis nicht gefunden. BT-4: 0x401b7a25 is in __cxxabiv1::__cxa_allocate_exception(std::size_t) (/builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_alloc.cc:268). 268 /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_alloc.cc: Datei oder Verzeichnis nicht gefunden. BT-5: 0x401b7770 is in operator new(unsigned int) (/builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/new_op.cc:54). 54 /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/new_op.cc: Datei oder Verzeichnis nicht gefunden. BT-6: 0x401b7edd is in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_create(unsigned int&, unsigned int) (/builds/idf/crosstool-NG/.build/xtensa-esp32-elf/build/build-cc-gcc-final/xtensa-esp32-elf/libstdc++-v3/include/ext/new_allocator.h:104). 104 /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/build/build-cc-gcc-final/xtensa-esp32-elf/libstdc++-v3/include/ext/new_allocator.h: Datei oder Verzeichnis nicht gefunden. BT-7: 0x400e45e7 is in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag) (/home/ludwig/.platformio/packages/toolchain-xtensa32/xtensa-esp32-elf/include/c++/5.2.0/bits/basic_string.tcc:223). 223 _M_data(_M_create(__dnew, size_type(0))); BT-8: 0x400e82b5 is in BLEScan::handleGAPEvent(esp_gap_ble_cb_event_t, esp_ble_gap_cb_param_t*) (/home/ludwig/.platformio/packages/toolchain-xtensa32/xtensa-esp32-elf/include/c++/5.2.0/bits/basic_string.h:195). 195 _M_construct(__beg, __end, _Tag()); BT-9: 0x400e5df2 is in BLEDevice::gapEventHandler(esp_gap_ble_cb_event_t, esp_ble_gap_cb_param_t*) (/home/ludwig/.platformio/packages/framework-arduinoespressif32/libraries/BLE/src/BLEDevice.cpp:270). 270 BLEDevice::getScan()->handleGAPEvent(event, param); BT-10: 0x400f64a1 is in btc_gap_ble_cb_handler (esp-idf-public/components/bt/bluedroid/btc/profile/std/gap/btc_gap_ble.c:54). 54 esp-idf-public/components/bt/bluedroid/btc/profile/std/gap/btc_gap_ble.c: Datei oder Verzeichnis nicht gefunden. BT-11: 0x400f00a2 is in btc_task (esp-idf-public/components/bt/common/btc/core/btc_task.c:163). 163 esp-idf-public/components/bt/common/btc/core/btc_task.c: Datei oder Verzeichnis nicht gefunden. BT-12: 0x4009008a is in vPortTaskWrapper (esp-idf-public/components/freertos/port.c:143). 143 esp-idf-public/components/freertos/port.c: Datei oder Verzeichnis nicht gefunden.

shogunxam commented 3 years ago

Looking to your backtrace seems you have finished the memory, there is no more memory to allocate the BLEAdvertisedDevice. Is this happening while you are playing with the WebServer? Or is it happening also if you don't do anything?

lucky2k12 commented 3 years ago

It's happening also when only watching.

lucky2k12 commented 3 years ago

This one just happened with closed browser BT-0: 0x4008f55f is in invoke_abort (esp-idf-public/components/esp32/panic.c:156). 156 esp-idf-public/components/esp32/panic.c: Datei oder Verzeichnis nicht gefunden. BT-1: 0x4008f7dd is in abort (esp-idf-public/components/esp32/panic.c:171). 171 esp-idf-public/components/esp32/panic.c: Datei oder Verzeichnis nicht gefunden. BT-2: 0x401b856f is in __cxxabiv1::__terminate(void (*)()) (/builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:47). 47 /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc: Datei oder Verzeichnis nicht gefunden. BT-3: 0x401b85b6 is in std::terminate() (/builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc:57). 57 /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_terminate.cc: Datei oder Verzeichnis nicht gefunden. BT-4: 0x401b7667 is in __cxxabiv1::__cxa_throw(void*, std::type_info*, void (*)(void*)) (/builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_throw.cc:87). 87 /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/eh_throw.cc: Datei oder Verzeichnis nicht gefunden. BT-5: 0x401b777e is in operator new(unsigned int) (/builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/new_op.cc:54). 54 /builds/idf/crosstool-NG/.build/src/gcc-5.2.0/libstdc++-v3/libsupc++/new_op.cc: Datei oder Verzeichnis nicht gefunden. BT-6: 0x400e80e2 is in BLEScan::handleGAPEvent(esp_gap_ble_cb_event_t, esp_ble_gap_cb_param_t*) (/home/ludwig/.platformio/packages/framework-arduinoespressif32/libraries/BLE/src/BLEScan.cpp:112). 112 BLEAdvertisedDevice *advertisedDevice = new BLEAdvertisedDevice(); BT-7: 0x400e5df2 is in BLEDevice::gapEventHandler(esp_gap_ble_cb_event_t, esp_ble_gap_cb_param_t*) (/home/ludwig/.platformio/packages/framework-arduinoespressif32/libraries/BLE/src/BLEDevice.cpp:270). 270 BLEDevice::getScan()->handleGAPEvent(event, param); BT-8: 0x400f64a1 is in btc_gap_ble_cb_handler (esp-idf-public/components/bt/bluedroid/btc/profile/std/gap/btc_gap_ble.c:54). 54 esp-idf-public/components/bt/bluedroid/btc/profile/std/gap/btc_gap_ble.c: Datei oder Verzeichnis nicht gefunden. BT-9: 0x400f00a2 is in btc_task (esp-idf-public/components/bt/common/btc/core/btc_task.c:163). 163 esp-idf-public/components/bt/common/btc/core/btc_task.c: Datei oder Verzeichnis nicht gefunden. BT-10: 0x4009008a is in vPortTaskWrapper (esp-idf-public/components/freertos/port.c:143). 143 esp-idf-public/components/freertos/port.c: Datei oder Verzeichnis nicht gefunden.

shogunxam commented 3 years ago

It's always the same problem some one try to allocate the memory but it's not enough. In the log on the console you should have also some prints with the available heap size could you check if the size is decreasing?. Meanwhile I'm trying to understand the root of the issue. EDIT: The memory seems to decrease dramatically after every publication on the client like the WiFiClient is not releasing it.

lucky2k12 commented 3 years ago

Yes, the heap space gets used up really fast, then there are some Client Disconnect Messages just before the restart.

`Free heap: 26068 Number device discovered: 0 New connection from 192.168.178.101:46926 7C:2F:80:94:A6:A7|20 on request (7C2F8094A6A7): absence;rssi=unreachable;daemon=BLETracker V3.0R

New connection from 192.168.178.101:46964 ED:E9:B6:8C:A5:9E|20 on request (EDE9B68CA59E): absenhable;daemon=BLETracker V3.0R

NTracker V3.0R

New connection from 192.168.178.101:46972 7C:2F:80:94:A6:3B|20 on request (7C2F8094A63B): absence;rssi=unreachable;daemon=BLETracker V3.0R

periodic report (7C2F8094A654): absence;rssi=unreachable;daemon=BLETracker V3.0R

INFO: Running mainloop Free heap: 1684 Number device discovered: 0 periodic report (D66DF0E97C0E): absence;rssi=unreachable;daemon=BLETracker V3.0R

Client Disconnect Client Disconnect Client Disconnect Client Disconnect Client Disconnect Client Disconnect

..............................................................Failed connecting to the network: timeout error!!! ets Jun 8 2016 00:22:57

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT) configsip: 0, SPIWP:0xee clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00 mode:DIO, clock div:2 load:0x3fff0018,len:4 load:0x3fff001c,len:1100 load:0x40078000,len:10924 ho 0 tail 12 room 4 load:0x40080400,len:6360 entry 0x400806b4 INFO: Running setup SPIFFSLogger first_record 144, num_written_record 200, maxLogsNum 200

.-------------------- Connected to ... IP address: 192.168.178.129 Software reset via esp_restart WebServer Setup mDNS responder started INFO: Running mainloop Free heap: 35964 Number device discovered: 0 INFO: Running mainloop Free heap: 36516 Number device discovered: 0`

shogunxam commented 3 years ago

I have found the problem, the collectord generate several connections, usually one for each tracked device, and every time the socket timeout close the connection it reopen another one. From my side this means create a new WiFiClient and create a new thread to manage the connection. Each thread generate a memory consumption and at the end this lead to absence of memory causing the crash. I'm working in order to manage the connection in a static pool avoiding the memory allocation for each client and to handle all the connections sequentially in a single thread, this should mitigate the problem.

shogunxam commented 3 years ago

Hi lucky2k12, I made some changes in this branch fhem_client_pool. Could you give a try to it? I suspect there are still some issues but in 3rd party libraries I use. Let me know. Best regards.

lucky2k12 commented 3 years ago

Thank you for the quick update. First impression is that the heap space seems to be much more stable.

The webserver still seems to be unstable, do I have to erase memory somehow? Sometimes I even get the config-page for the MQTT configuration with my username and IP, which is not compiled into the binary, so I suspect it to linger around somewhere and maybe affecting stability.

At least I did not see some errors or reboots in the console until now.

shogunxam commented 3 years ago

MQTT is disabled. You see them in the log just because those informations are present in the settings and they are print to console when you acces some pages in example the configuration page. The maximum number of clients running at same time is set by default to 4. I have raised that number to 10 to allow to track at maximum 10 devices, but I'm not sure the esp32 can handle all those connections plus the webserver. I tried to open some pages while the fhem module was running and they worked, I haven't executed an heavy stress test. You could try to disable some modules from the configuration (persistent logs, webserver) to see if you get some improvements. Try to build also the firmware in release mode, and switch to debug only if you need to get logs and backtrace.

shogunxam commented 3 years ago

Hi lucky2k12, in your previous comment you wrote:

Sometimes I even get the config-page for the MQTT configuration with my username and IP, which is not compiled into the binary

This happens because by default the page is loaded with the MQTT form, when the page finishes to load a JS script is executed and it makes another request to the ESP32 to get the current configuration, in this configuration there is a flag to hide the MQTT form when the FHEM module is built. Because of some socket issue the connection is closed/interrupted and the configuration is not received by the browser so you will continue to see the MQTT form until you refresh the page. If you enable the developer tools of the browser you will see see the error in the browser console.

shogunxam commented 3 years ago

Hi @lucky2k12 , could you confirm the WebServer become stuck when you try to read the logs in the web panel?

lucky2k12 commented 3 years ago

Hi @shogunxam ,

Sorry for replying so late, I'm quite loaded with work. Thank you for the explanations, that sounds logical.

I am doing some tests with disabled OTA and logfile (and ENABLE_BLE_TRACKER_WHITELIST) in user_config.h But I was not able to reach long term stability until now. I suspect an NTP issue and config not saved properly, maybe I am doing something wrong. webserver still unstable after changing some config, no reload possible serial console reports apparently w/o issues

I will investigate further, but have limited time for this right now, sorry :(

shogunxam commented 3 years ago

Most of the crashes happens because a memory issue. I fear there is a memory fragmentation caused by WiFiClient module, so, after few time, there is no more a free memory block of the proper size to satisfy the allocation request. I have limited as much as possible all the dynamic allocation in my code but seems not to be enough.

lucky2k12 commented 3 years ago

After some days of running I can report that the serial console is still reporting BLE Devices and the free heap is between 12k-14k, also I observed no backtraces.

I can reach the device by ping, but the webserver seems to be dead and my FHEM doesn't receive messages as well.

If you have an idea how I could help you with the further debugging, please advise.

shogunxam commented 3 years ago

Hi, I have just pushed a new change in the branch stabilize_fhemmodule#20. I have removed the thread and now all the operations are synchronous. So now the firmware scan for the device and when the scan is finished it handle the FHEM's clients. This means collectord has to wait for the duration of scan before get a response. I'm not an experienced FHEM user (I have just created a VM on the fly to test it), so I hope this is not an issue. Could you please give a try and let me know if the situation now is better?

lucky2k12 commented 3 years ago

Thank you, I compiled the new branch and It ran fine for some minutes, then I changed the configuration scan-period to 20, hit "submit" and got a crash. But the change was persistent. The output in the serial console:

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT) configsip: 0, SPIWP:0xee clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00 mode:DIO, clock div:2 load:0x3fff0018,len:4 load:0x3fff001c,len:1100 load:0x40078000,len:10924 ho 0 tail 12 room 4 load:0x40080400,len:6360 entry 0x400806b4 INFO: Running setup SPIFFSLogger first_record 22, num_written_record 200, maxLogsNum 200

.-------------------- Connected to ABC IP address: 192.168.178.129 Software reset via esp_restart WebServer Setup mDNS responder started INFO: Running mainloop Main loop Free heap: 34048 Number device discovered: 0 INFO: Device discovered, Address: D3524C0D8341 , RSSI: -64 INFO: Running mainloop Main loop Free heap: 33828 Number device discovered: 1 INFO: Tracked device discovered, Address: D3524C0D8341 , RSSI: -70 New connection from 192.168.178.101:38802 7C:2F:80:94:A6:54|20 on request (7C2F8094A654): absence;rssi=unreachable;daemon=BLETracker V3.0R

Edit: After the crash the webserver and the connection to FHEM was broken, but ping was working just as in the other fhem branch.

shogunxam commented 3 years ago

Thai is not a crash. When you change the configuration I have to reset the esp to make the changes effective. Edit:

New connection from 192.168.178.101:38802 7C:2F:80:94:A6:54|20 on request (7C2F8094A654):

Means it's connected to collectord

lucky2k12 commented 3 years ago

Ah, ok, thank you for the explanation :) Those FHEM Messeges dissapear after some time, and so does the FHEM Connection. I'm not shure why/when this happens. The webserver is still running this time, but there is an Error displayed in the logs section "Error Reading logs!!! Retring in 5 seconds..." and no logs displayed. I will experiment and watch, then inform you again about my findings.

shogunxam commented 3 years ago

The message

"Error Reading logs!!! Retring in 5 seconds..."

is shown because there is a mismatch between the size of the data send and the data received, I haven't yet understand the cause of the problem, it should depend from the WiFiClient ESP32-Arduino library implementation. I have managed it so that, as the message states, the page try to reload the data automatically after 5 seconds.

lucky2k12 commented 3 years ago

Ok, after 1.5h FHEM does not connect any more, maybe thats also a misconfiguration of my collectord.conf, I will have to dig further. I restarted the collectord already wo success :( But the webserver is still accessible, only the log refuses to show anything except the error message.

shogunxam commented 3 years ago

I'm running the firmware since 3h without crash and the webserver is up and running (I can see the logs as well), but from the fhem console the last update is one hour old. I fear waiting for the BLE completion causes some kind of disconnection (timeout perhaps) and after that fhem is not able to get the state, this was the main reason to use a separate thread for the activity related to the fhem, but this causes crashes. Unfortunately I have disabled the logs on the console, do you continue to see messages like the ones I listed below?

New connection from 192.168.178.101:38802
7C:2F:80:94:A6:54|20
on request (7C2F8094A654):

or

periodic report ....
lucky2k12 commented 3 years ago

no lepresenced packet for FHEM and no such messages since 13:22 as far as I can tell. webserver and serial console seem still fine, apart of the log problem.

lucky2k12 commented 3 years ago

Wait. I restarted fhem and now periodic reports are arising, but only of absent devices. The FHEM New connecitons are still missing and still no report since 13:22 from my present device. The webserver is also absent. Not shure if this has any connection to the other issue though.

maybe something happened here in the serial console, (could you introduce timestamps? I dont know when this happened.) NFO: Running mainloop Main loop Free heap: 21292 Number device discovered: 3 INFO: Tracked device discovered, Address: D3524C0D8341 , RSSI: -71 periodic report (7C2F8094A63B): absence;rssi=unreachable;daemon=BLETracker V3.0R

periodic report (7C2F8094A698): absence;rssi=unreachable;daemon=BLETracker V3.0R

INFO: Running mainloop Main loop Free heap: 21292 Number device discovered: 3 INFO: Tracked device discovered, Address: D3524C0D8341 , RSSI: -72 Client Disconnect 192.168.178.101:38924 for device 7C:2F:80:94:XX:XX Client Disconnect 192.168.178.101:38986 for device 7C:2F:80:94:YY:YY Client Disconnect 192.168.178.101:39072 for device FB:8F:C4:77:ZZZ:ZZ Client Disconnect 192.168.178.101:39126 for device 7C:2F:80:94:CC:CC Client Disconnect 192.168.178.101:39212 for device F8:78:AE:40:AA:AA Client Disconnect 192.168.178.101:39286 for device D6:6D:F0:E9:AA:AA INFO: Running mainloop Main loop Free heap: 33332 Number device discovered: 3 INFO: Tracked device discovered, Address: D3524C0D8341 , RSSI: -65 INFO: Running mainloop Main loop Free heap: 33364 Number device discovered: 3 INFO: Tracked device discovered, Address: D3524C0D8341 , RSSI: -67 INFO: Running mainloop Main loop Free heap: 33364 Number device discovered: 3 INFO: Tracked device discovered, Address: D3524C0D8341 , RSSI: -72 New connection from 192.168.178.101:46406 7C:2F:80:94:A6:98|20 on request (7C2F8094A698): absence;rssi=unreachable;daemon=BLETracker V3.0R

shogunxam commented 3 years ago

Have you tried to set the scan period lower than the timeout expected by FHEM?. Because now the flow is synchronous so if FHEM expects a connection in x seconds now for sure the esp cannot respond before y seconds where y is the scan period. Yes I will add time stamps, me too I need them 😃 otherwise I become mad. FYI you can dump the console to a text file, open a Platformio CLI and type

pio device monitor > log.txt

to see in on the screen open another bash and type

tail -f log.txt
shogunxam commented 3 years ago

After 1h and half I have your same situation, I got a new connection for a device I have already, and the old connection is closed but after i got a crash and esp stop to receive from the Wifi BT-0: 0x4008f593 is in invoke_abort (esp-idf-public/components/esp32/panic.c:156). BT-1: 0x4008f811 is in abort (esp-idf-public/components/esp32/panic.c:171). BT-2: 0x4008c6af is in lock_init_generic (esp-idf-public/components/newlib/locks.c:81). BT-3: 0x4008c6d5 is in lock_acquire_generic (esp-idf-public/components/newlib/locks.c:134). BT-4: 0x4008c829 is in _lock_acquire_recursive (esp-idf-public/components/newlib/locks.c:171). BT-5: 0x4017d496 is in _vfprintf_r (../../../.././newlib/libc/stdio/vfprintf.c:860). BT-6: 0x40180705 is in vprintf (../../../.././newlib/libc/stdio/vprintf.c:39). BT-7: 0x4008474d is in esp_log_writev (esp-idf-public/components/log/log.c:214). BT-8: 0x40174f21 is in esp_log_write (esp-idf-public/components/log/log.c:223). BT-9: 0x401630b6 is in _udp_recv (esp-idf-public/components/mdns/mdns_networking.c:123). BT-10: 0x4019427d is in udp_input (esp-idf-public/components/lwip/lwip/src/core/udp.c:401). BT-11: 0x4019791d is in ip4_input (esp-idf-public/components/lwip/lwip/src/core/ipv4/ip4.c:744). BT-12: 0x4019c362 is in ethernet_input (esp-idf-public/components/lwip/lwip/src/netif/ethernet.c:184). BT-13: 0x4018ac03 is in tcpip_thread (esp-idf-public/components/lwip/lwip/src/api/tcpip.c:135). BT-14: 0x400900a6 is in vPortTaskWrapper (esp-idf-public/components/freertos/port.c:143).

lucky2k12 commented 3 years ago

Have you tried to set the scan period lower than the timeout expected by FHEM?. Because now the flow is synchronous so if FHEM expects a connection in x seconds now for sure the esp cannot respond before y seconds where y is the scan period.

Yes, I had set it back to default (10s) yesterday, so this shold not be the problem.

Yes I will add time stamps, me too I need them 😃 otherwise I become mad. FYI you can dump the console to a text file, open a Platformio CLI and type pio device monitor > log.txt

Thank you for this hint, It will help with searching throgh the log. I'm happy to read that you can now also reproduce the connection loss after some time.

shogunxam commented 3 years ago

I made another change in the branch (https://github.com/shogunxam/ESP32_BLETracker/tree/stabilize_fhem_module_%2320) Now ESP is up and running since 4h and half, the WebServer is responsive and FHEM is receiving the updates as well. I tried to implement a progressive scan for the BLE devices in order to make the connections to FHEM more responsive. Now it scans for 1 second, then handle the connections and continue to scan and so on until the scan period is reached, than check for the batteries and the loop restart. I hope you don't loose some devices in this way. Let me know. I have added the timestamp to the logs, if your router acts as NTP server I suggest to replace the NTP_SERVER with it's address because more responsive, otherwise you have to wait for the pool.ntp.org sever response and some logs at the beginning can have the wrong time. Note, the right time, can be available only after the WiFi connection.

lucky2k12 commented 3 years ago

That sounds good, I flashed the new version, but now I'm getting this in the console:

st:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0018,len:4
load:0x3fff001c,len:1100
load:0x40078000,len:10924
ho 0 tail 12 room 4
load:0x40080400,len:6360
entry 0x400806b4
shogunxam commented 3 years ago

Yes I have removed the debug prints from the release variant if you want them build the debug variant. Crash will be anyway logged on the console (they are out of my control 😃)

lucky2k12 commented 3 years ago

Got it running again. First I was looking for brownouts, which I found on google, but I did not change anything in my environment. Then I discovered my user_config.h had been overwritten. Still new to that git stuff :P Put in credentials and good to go again :) I will keep you posted.

lucky2k12 commented 3 years ago

Hi @shogunxam , Thand you for the change to progressive scan; It seems to be a huge improvement:

I would be happy to contribute to the further debugging.

shogunxam commented 3 years ago

Thank you to have pointed out the typo, I'll fix it. I f you want contribute with your commits you have to:

Moreover, the part related to html stuff has to be preprocessed before commit, because project link a compressed version of the html pages ,.css and .js so you have to uncomment the following lines in platform.io

;extra_scripts =
;  pre:htmlcompress.py

this will run the script htmlcompress.py, it will compress the files and convert them into a byte array that can be included in the sources. So that you have to commit the *.gz.h as well

😃

lucky2k12 commented 3 years ago

Thank you for the quick tutorial! Indeed I managed to make my first pull-request this way. Only strange thing I don't yet understand that the config.js is not listed as changed.

Do I interpret it right that all files in the html directory get compressed before getting flashed to save some space?

shogunxam commented 3 years ago

Perhaps because I have fixed it before you made your PR 🤣 . The gz.h are changed because gzip introduce a time stamp. Any way I will accept your PR. Thanks to contributing.

lucky2k12 commented 3 years ago

Thank you for accepting my PR and the explanation :) No issues any more (so far) with the FHEM connection. I'm tracking 5 MiBands and 5 GTags now. Should I close this issue and open another one for the webserver crashes?

My observation: When I add 5 new mac entrys and hit Submit, the esp restarts, but the webserver does not come back. When I add 1-2 new mac addresses it worked most of the time. It seems not to be possible to copy/paste mac values (at least with contained : and editing afterwards) and just hit enter to add them to the list. you have to click on add.

I see in the sourcecode of OTAWebServer::postUpdateConfig() that mqtt*, scanperiod and whiteList gets handled, not shure how the mac list gets saved though. Is this in OTAWebServer::getSysInfoData()?

Also when playing with logs I had some instability of the web server. It never came back without power cycling the device.

shogunxam commented 3 years ago

Fine, I merge the PR into develop for the next release and I close this issue. Yes please open another issue for the webserver giving as much details as possible and how I can try to reproduce it.

It seems not to be possible to copy/paste mac values (at least with contained : and editing afterwards) and just hit enter to add them to the list. you have to click on add.

Yes this is to avoid to insert invalid format (invalid at least for my code 😝)

I see in the sourcecode of OTAWebServer::postUpdateConfig() that mqtt*, scanperiod and whiteList gets handled, not shure how the mac list gets saved though. Is this in OTAWebServer::getSysInfoData()?

OTAWebServer::postUpdateConfig() is called when you save the configuration, MAC addresses are passed as key of a boolean parameter i.e "AB12CD34EF562":"true" in this way with one line I get both the MAC and the 'Read-Battery' value, this is the piece of code

    else //other are mac address with battery check in the form "AE13FCB45BAD":"true"
    {
      bool batteryCheck = server.arg(i) == "true";
      newSettings.AddDeviceToWhiteList(server.argName(i), batteryCheck);
    }