stuartpittaway / diyBMSv4ESP32

diyBMS v4 code for the ESP32 and new controller hardware
Other
166 stars 78 forks source link

WIFI Issue - random reboot when STA is lost (investigation) #276

Open stuartpittaway opened 5 months ago

stuartpittaway commented 5 months ago

This ticket is to investigate seemingly random reboots of the controller (often related to also losing WIFI STA) with latest firmware version Release-2023-12-27-12-02

May be related to #239

stuartpittaway commented 5 months ago

Test on my development rig:

ESP32 connected to WIFI hot spot on mobile phone (Android). On boot up, controller report (filtered for wifi events only, logging for MQTT increased to DEBUG level)

D (6469) diybms: starting wifi_init_sta
I (6492) diybms: WIFI SSID: XXXXXXXXXXXXXXXX
I (6569) diybms: Hostname: DIYBMS-005CED90
D (6570) diybms: wifi_init_sta finished
I (13606) diybms: WIFI_EVENT_STA_START
D (13707) diybms: total_free_byte=156976 total_allocated_byte=132724 largest_free_blk=110580 min_free_byte=154632 alloc_blk=360 free_blk=5 total_blk=365
I (15392) diybms: WIFI_EVENT_STA_DISCONNECTED
I (15395) diybms: WIFI connect quick retry 1
I (17809) diybms: WIFI_EVENT_STA_DISCONNECTED
I (17812) diybms: WIFI connect quick retry 2
I (17941) diybms: WIFI_EVENT_STA_CONNECTED channel=11, rssi=-41
I (17966) diybms: IP ADDRESS HAS CHANGED
I (17969) diybms: Request time from time.google.com
I (17970) diybms: Timezone=UTC0DST
I (17971) diybms: The current date/time is: Thu Jan  1 00:00:10 1970
I (17996) diybms: You can access DIYBMS interface at http://DIYBMS-005CED90.local or http://192.168.1.87
W (18512) diybms-mqtt: MQTT enabled, but not yet init
W (19608) diybms-mqtt: MQTT enabled, but not yet init
I (43745) diybms-mqtt: MQTT counters: Err_Con=0,Err_Trans=0,Conn=0,Disc=0
I (43746) diybms-mqtt: esp_mqtt_client_init
I (43750) diybms-mqtt: esp_mqtt_client_start
I (44254) diybms-mqtt: MQTT_EVENT_CONNECTED
I (46619) diybms-mqtt: Rule status payload
D (46627) diybms-mqtt: Topic:emon/diybms2/rule, ID:0, Length:103
I (46628) diybms-mqtt: Outputs status payload
D (46634) diybms-mqtt: Topic:emon/diybms2/output, ID:0, Length:25
I (48542) diybms-mqtt: MQTT Payload for cell data

Data is successfully transmitted to MQTT server and web interface is working as expected.

Upon terminating the WIFI hot spot on the Android phone:

I (284130) diybms: WIFI_EVENT_STA_DISCONNECTED
E (284132) TRANSPORT_BASE: poll_read select error 113, errno = Software caused connection abort, fd = 51
E (284133) MQTT_CLIENT: Poll read error: 119, aborting connection
I (284140) diybms-mqtt: MQTT_EVENT_DISCONNECTED
I (284207) diybms-mqtt: MQTT counters: Err_Con=0,Err_Trans=0,Conn=1,Disc=1
I (284233) diybms-mqtt: Stopping MQTT client
W (286282) diybms-mqtt: MQTT enabled, but not connected
W (289710) diybms-mqtt: MQTT enabled, but not connected
W (291285) diybms-mqtt: MQTT enabled, but not connected
W (291285) diybms-mqtt: MQTT enabled, but not connected
W (291286) diybms-mqtt: MQTT enabled, but not connected
I (299155) diybms: WIFI connect quick retry 1
W (301288) diybms-mqtt: MQTT enabled, but not yet init
I (301569) diybms: WIFI_EVENT_STA_DISCONNECTED
I (301571) diybms: WIFI connect quick retry 2
I (301709) diybms-rules: Set error 2:ModuleCountMismatch
I (301710) diybms: Active errors=1
W (301711) diybms-mqtt: MQTT enabled, but not yet init
I (303985) diybms: WIFI_EVENT_STA_DISCONNECTED
I (303988) diybms: WIFI connect quick retry 3

** removed similar messages **

I (313650) diybms: WIFI_EVENT_STA_DISCONNECTED
I (313653) diybms: WIFI connect quick retry 7
I (313713) diybms-rules: Set error 2:ModuleCountMismatch
I (313714) diybms: Active errors=1
W (313715) diybms-mqtt: MQTT enabled, but not yet init
I (314266) diybms: Trying to connect WIFI
E (314267) wifi:sta is connecting, return error
ESP_ERROR_CHECK_WITHOUT_ABORT failed: esp_err_t 0x3007 (ESP_ERR_WIFI_CONN) at 0x4008ea0b
file: "src/main.cpp" line 4187
func: void loop()
expression: esp_wifi_connect()
I (316066) diybms: WIFI_EVENT_STA_DISCONNECTED

** removed similar messages **
I (554684) diybms: Trying to connect WIFI
I (436909) diybms: WIFI_EVENT_STA_DISCONNECTED
E (436910) diybms: Connect to WIFI AP failed, tried 28 times

Upon re-enabling the WIFI hot spot on the Android phone:

I (765022) diybms: Trying to connect WIFI
I (765122) diybms: WIFI_EVENT_STA_CONNECTED channel=11, rssi=-48
I (765150) diybms: IP ADDRESS HAS CHANGED
I (765150) diybms: Request time from time.google.com
I (765151) diybms: Timezone=UTC0DST
I (765152) diybms: The current date/time is: Tue Feb  6 11:56:59 2024
I (765174) diybms: You can access DIYBMS interface at http://DIYBMS-005CED90.local or http://192.168.1.87
I (795081) diybms-mqtt: MQTT counters: Err_Con=0,Err_Trans=0,Conn=0,Disc=0
I (795081) diybms-mqtt: esp_mqtt_client_init
I (795086) diybms-mqtt: esp_mqtt_client_start
I (795276) diybms-mqtt: MQTT_EVENT_CONNECTED

The code in the controller is designed for the following action when a loss of WIFI is detected (event WIFI_EVENT_STA_DISCONNECTED)

After 25 times, the message reported is Connect to WIFI AP failed, tried XXX times.

Once the 25 attempts have failed, esp_wifi_connect() is called inside the main loop, approx. every 30 seconds, reported as "Trying to connect WIFI"

As can be seen from the above logs, the development rig environment as described appears to work correctly and recovers from WIFI disconnection and errors successfully.

stuartpittaway commented 5 months ago

Related to #220

stuartpittaway commented 5 months ago

Ok, managed to get a GURU if I repeat disable wifi hotspot and quickly re-enable it.

I (1759137) diybms: WIFI connect quick retry 1
Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x401b5f3e  PS      : 0x00060a30  A0      : 0x801b6023  A1      : 0x3ffd8f00
A2      : 0x3ffb62d4  A3      : 0xffffffff  A4      : 0x00000000  A5      : 0xffffffff  
A6      : 0x00000000  A7      : 0x3ffe3458  A8      : 0x3ffdae70  A9      : 0x3ffd8e70
A10     : 0x00000000  A11     : 0x00000001  A12     : 0x3ffe2928  A13     : 0x3ffe2928  
A14     : 0x3ffe3428  A15     : 0x3ffe3462  SAR     : 0x00000004  EXCCAUSE: 0x0000001c
EXCVADDR: 0x00000000  LBEG    : 0x4008c0e1  LEND    : 0x4008c0f1  LCOUNT  : 0xfffffffe  

Backtrace: 0x401b5f3b:0x3ffd8f00 0x401b6020:0x3ffd8f50

  #0  0x401b5f3b:0x3ffd8f00 in handler_execute at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:139
      (inlined by) esp_event_loop_run at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:590
  #1  0x401b6020:0x3ffd8f50 in esp_event_loop_run_task at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:115 (discriminator 15)  
stuartpittaway commented 5 months ago

Possible fix firmware (experimental) diybms_controller_firmware_experimental_bug276.zip

jetronic18s commented 5 months ago

Hello Stuart, I also noticed that the DIYBMS (Firmware 2023-11-28) was restarting. It seems to have restarted 3 times in a very short time. Unfortunately, I cannot yet say whether this is related to the WLAN. I will try to do tests with WLAN until the end of the week.

Screenshot_20240206_143232

Screenshot_20240206_143246

I could see from the uptime of the controller that it has really restarted.

stuartpittaway commented 5 months ago

It seems to have restarted 3 times in a very short time.

It seems to trigger a reboot if the WIFI connection is lost and restored within a second or two, but it looks like a bug in the controller code (as expected!) so I'm hoping this version works as expected.

jetronic18s commented 5 months ago

A few days ago I also observed an internal BMS error, which is really strange. I have never seen such errors before. I have been using the system for over a year without ever seeing anything like this. It may be important for the analysis

Screenshot_20240129_221621_nl victronenergy_edit_1054046487623017

red0909 commented 5 months ago

the experimental firmware does not start on my esp, black screen. tried with two different esp32 and two different computers

stuartpittaway commented 5 months ago

the experimental firmware does not start on my esp, black screen. tried with two different esp32 and two different computers

This isn't a complete flash image - if you re-flash the "release" version, then use the over the air upgrade feature to apply this experimental one.

red0909 commented 5 months ago

ok now it is running. disconnected wifi several times, no reboot. now i need to wait some days and watch how my inverter behaves

red0909 commented 5 months ago

now it is running for two days no issues so far. but i noticed that the controller refuses to connect to network with hidden ssid, this was possible with december firmware but the reconnect problem was there even if the wifi ssid was not hidden.

if this is the trade off for a stable running controller i can live with it, maby not for all user?

stuartpittaway commented 5 months ago

I've not made any changes to the wifi stack - so a hidden SSID shouldn't be a problem.

I've a log file from another user who has tested this firmware and unfortunately it didn't solve his reboot. He uses a Fritzbox which does appear to be a common problem with ESP32 hardware.

CONTROLLER - ver:cbe2f3314cf6ac9e3db3e1cdb27aa386e6facbcc compiled 2024-02-06T12:40:00.542Z
ESP32 Chip model = 1, Rev 1, Cores=2, Features=50

I (245621) diybms: WIFI_EVENT_STA_DISCONNECTED
I (245621) diybms: ShutdownAllNetworkServices
I (245621) diybms-web: httpd_stop
I (245722) diybms: stop mdns
I (245734) diybms: WIFI connect quick retry 1
Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x401b5f92  PS      : 0x00060030  A0      : 0x801b6077  A1      : 0x3ffd8da0  
A2      : 0x3ffb6328  A3      : 0xffffffff  A4      : 0x00000000  A5      : 0xffffffff  
A6      : 0x00000000  A7      : 0x3ffe2fc8  A8      : 0x3ffdad40  A9      : 0x3ffd8d10  
A10     : 0x00000000  A11     : 0x00000001  A12     : 0x3ffe2438  A13     : 0x3ffe2438  
A14     : 0x3ffe2f98  A15     : 0x3ffe2fd2  SAR     : 0x00000004  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000000  LBEG    : 0x4008c0e1  LEND    : 0x4008c0f1  LCOUNT  : 0xfffffffe  

Backtrace: 0x401b5f8f:0x3ffd8da0 0x401b6074:0x3ffd8df0

which decodes as

0x401b5f92: handler_execute at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:145
0x401b5f92: esp_event_loop_run at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:590
0x401b5f8f: handler_execute at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:139
0x401b5f8f: esp_event_loop_run at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:590
0x401b6074: esp_event_loop_run_task at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_event/esp_event.c:115
N1c084 commented 5 months ago

Hi Stuart Do you think an ESP32 with Ethernet port can solve the pb ?

Le ven. 9 févr. 2024 à 10:24, Stuart Pittaway @.***> a écrit :

I've not made any changes to the wifi stack - so a hidden SSID shouldn't be a problem.

I've a log file from another user who has tested this firmware and unfortunately it didn't solve his reboot. He uses a Fritzbox which does appear to be a common problem with ESP32 hardware.

CONTROLLER - ver:cbe2f3314cf6ac9e3db3e1cdb27aa386e6facbcc compiled 2024-02-06T12:40:00.542Z ESP32 Chip model = 1, Rev 1, Cores=2, Features=50

I (245621) diybms: WIFI_EVENT_STA_DISCONNECTED I (245621) diybms: ShutdownAllNetworkServices I (245621) diybms-web: httpd_stop I (245722) diybms: stop mdns I (245734) diybms: WIFI connect quick retry 1 Guru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled.

Core 0 register dump: PC : 0x401b5f92 PS : 0x00060030 A0 : 0x801b6077 A1 : 0x3ffd8da0 A2 : 0x3ffb6328 A3 : 0xffffffff A4 : 0x00000000 A5 : 0xffffffff A6 : 0x00000000 A7 : 0x3ffe2fc8 A8 : 0x3ffdad40 A9 : 0x3ffd8d10 A10 : 0x00000000 A11 : 0x00000001 A12 : 0x3ffe2438 A13 : 0x3ffe2438 A14 : 0x3ffe2f98 A15 : 0x3ffe2fd2 SAR : 0x00000004 EXCCAUSE: 0x0000001c EXCVADDR: 0x00000000 LBEG : 0x4008c0e1 LEND : 0x4008c0f1 LCOUNT : 0xfffffffe

Backtrace: 0x401b5f8f:0x3ffd8da0 0x401b6074:0x3ffd8df0

— Reply to this email directly, view it on GitHub https://github.com/stuartpittaway/diyBMSv4ESP32/issues/276#issuecomment-1935585625, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYDJ6M2NQHTGUZJ5ENNVAT3YSXTLFAVCNFSM6AAAAABC32IJ2GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZVGU4DKNRSGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

stuartpittaway commented 5 months ago

Do you think an ESP32 with Ethernet port can solve the pb ?

No idea, I don't have one and it would also need significant code changes to make it work

HerrFrodo1 commented 5 months ago

YES! LAN is the solution!!! ;-)

red0909 commented 5 months ago

I've a log file from another user who has tested this firmware and unfortunately it didn't solve his reboot. He uses a Fritzbox which does appear to be a common problem with ESP32 hardware.

have he tested this with other esp32?

well i dont have a fritzbox, but i had also to replace my wifi router because some esp32 have not connected to my previous one...

Linusten commented 5 months ago

Sadly i have no logs, but also a Fritz!Box and the same issues.

red0909 commented 5 months ago

Sadly i have no logs, but also a Fritz!Box and the same issues.

try to make a wifi hotspot on your phone and connect to that. if it will not reboot so the fritzbox is the issue

HerrFrodo1 commented 4 months ago

@red0909

well i dont have a fritzbox, but i had also to replace my wifi router because some esp32 have not connected to my previous one...<

Which other router you bought?

HerrFrodo1 commented 4 months ago

@red0909

try to make a wifi hotspot on your phone and connect to that. if it will not reboot so the fritzbox is the issue

The Hotspot on Iphone is not the right way for testing. It only shares the Internet with a connected WiFi subscriber. It does not create an internal network that can be accessed. Calling the web app seems to be a possible source of the problem - possibly in connection with MQTT. I tried it.

Yesterday I switched off the Fritz!Box WiFi and tested a TP-Link Accesspoint(TL-WR841N). There was still a problem with ESP32 crashing. Interesting thing....with the Fritz!WLAN Repeater, the crashes usually occurred after the WiFi was switched off. With TP-Link, the crashes now happen when you turn on the WiFi... and only after you open the web app.

stuartpittaway commented 4 months ago

It only shares the Internet with a connected WiFi subscriber. It does not create an internal network that can be accessed.

You can access the DIYBMS web interface directly from the phone web browser, when testing in this fashion.

red0909 commented 4 months ago

Which other router you bought?

dlink dsr-250n

it need some tricky fw updates 5 times to the new fw but this router is not longer supported and should not be for internet use.

i use it offline my network for my inverters and this bms is offline. could use only a 8 port switch but the diy bms require wifi, its the only device in my network using wifi. i dont trust wifi for critical devices, the diy bms needs a password too or at least a simple 4digit pin.

@stuartpittaway i disconnect wifi sometimes to see what happens. this experimental fw still running good, no reboots here. on a cheap fake esp32

HerrFrodo1 commented 4 months ago

@red0909

... i dont trust wifi for critical devices, ...

It's the same with me. diyBMS is the only device on my network without LAN :-( Our WiFi is switched off from 9 p.m. to 6 a.m. Then the most important data from the diyBMS comes from the Victron Cerbo GX via the battery Can-Bus.

stuartpittaway commented 4 months ago

the diy bms needs a password too or at least a simple 4digit pin.

Security isn't really possible on these sort of devices (ESP32) - at least not without a full TLS encryption layer/certificates - otherwise any sort of password or PIN is pointless as they could be sniffed off the network.

red0909 commented 4 months ago

so 14 days now with experimental fw, no reboot no problems with the wifi.

stuartpittaway commented 3 months ago

Hi @red0909 been 3 weeks now, whats the feedback?

red0909 commented 3 months ago

no problems as far i can see, but i am not using mqtt or homeasistant. running stable no reboot with cheap esp32 module canbus signal is stable too

jetronic18s commented 2 months ago

Hello Stuart,

I installed a DIYBMS a few days ago. A controller board v4.5 on a 18s1p battery.

I have installed the last 4 official releases on the controller and whenever the Fritzbox was rebooted or the wifi was turned off. The controller board is restarted.

I then installed the beta "diybms_controller_firmware_experimental_bug276.zip" and the problem was gone. I must have restarted the Fritzbox 2-3 times without a problem.

Today the power was probably off for about 1h during the installation. So the Fritzbox was off and the controller board restarted. I was able to determine this through the uptime and also the undervoltage error (relay dropped out briefly).

The DIYBMS is connected to the router as follows (MESH is active): Fritzbox <--> Repeater 1750e <--> DIYBMS

Unfortunately I have no access to the serial console of the controller

Nobody wants to hear that here, sorry.

red0909 commented 2 months ago

@jetronic18s what powersupply do you have for the controller? have you measured the voltage at the controller screws? i think this is some sort of a power issue

jetronic18s commented 2 months ago

@red0909 I supply the controller via a DCDC (Mean Well DDR-30L-5) from the battery. I have exactly 5V in idle mode. In the event of a fault, I would not be able to measure the voltage.

stuartpittaway commented 2 months ago

The DIYBMS is connected to the router as follows (MESH is active): Fritzbox <--> Repeater 1750e <--> DIYBMS

Yes, that appears to be the common pattern of failure - using Fritzbox along with mesh/repeater wifi units.

Very similar to this problem... https://github.com/arendst/Tasmota/discussions/14986