tbnobody / OpenDTU

Software for ESP32 to talk to Hoymiles/TSUN/Solenso Inverters
GNU General Public License v2.0
1.82k stars 508 forks source link

Crash when loading app.js #1705

Closed ahanak closed 7 months ago

ahanak commented 9 months ago

What happened?

I have an olimex esp32 poe board. When I access the websites over ethernet, the board crashes.

Accessing the web-interface with a browser only shows the title "OpenDTU" in the browser-tab and an empty page. Further investigation with firefox debugging tools shows me that the html is downloaded but the request to app.js times out.

To Reproduce Bug

  1. Select env:olimex_esp32_poe
  2. Flash the firmware using vscode
  3. Open Serial Console
  4. Run curl --output - http://[ETHERNET_IP]/js/app.js

What I see is some binary output (what i do not expect). The device reboots and later on the connection times out: curl: (56) Recv failure: Die Verbindung wurde vom Kommunikationspartner zurückgesetzt

Expected Behavior

I would expect the device to show the webinterface also when accessed via Ethernet.

Install Method

Self-Compiled

What git-hash/version of OpenDTU?

21ec72f4c00fab1385b1fb93780b52cd6d4e4ff7

Relevant log/trace output

20:55:15.614 > E (117779) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
20:55:15.622 > E (117779) task_wdt:  - async_tcp (CPU 0/1)
20:55:15.625 > E (117779) task_wdt: Tasks currently running:
20:55:15.630 > E (117779) task_wdt: CPU 0: IDLE
20:55:15.633 > E (117779) task_wdt: CPU 1: loopTask
20:55:15.636 > E (117779) task_wdt: Aborting.
20:55:15.639 > 
20:55:15.639 > abort() was called at PC 0x4011a5cc on core 0
20:55:15.644 > 
20:55:15.644 > 
20:55:15.644 > Backtrace: 0x40083a41:0x3ffbec6c |<-CORRUPTED
20:55:15.710 > 
20:55:15.710 >   #0  0x40083a41:0x3ffbec6c in panic_abort at /Users/ficeto/Desktop/ESP32/ESP32S2/esp-idf-public/components/esp_system/panic.c:408
20:55:15.710 > 
20:55:15.710 > 
20:55:15.710 > 
20:55:15.710 > 
20:55:15.710 > ELF file SHA256: e2aa60b703a4c3f0
20:55:15.710 > 
20:55:15.710 > E (5452) esp_core_dump_flash: Core dump flash config is corrupted! CRC=0x7bd5c66f instead of 0x0
20:55:15.710 > Rebooting...
20:55:15.710 > ets Jul 29 2019 12:21:46
20:55:15.710 > 
20:55:15.710 > rst:0xc (SW_CPU_RESET),boot:0x1b (SPI_FAST_FLASH_BOOT)
20:55:15.710 > configsip: 0, SPIWP:0xee
20:55:15.710 > clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
20:55:15.710 > mode:DIO, clock div:2
20:55:15.710 > load:0x3fff0030,len:1184
20:55:15.710 > load:0x40078000,len:13232
20:55:15.710 > load:0x40080400,len:3028
20:55:15.710 > entry 0x400805e4
20:55:16.327 > E (664) esp_core_dump_flash: No core dum����ѥѥ���found!
20:55:16.330 > E (664) esp_core_dump_flash: No core dump partition found!
20:55:16.357 > 
20:55:16.358 > Starting OpenDTU
20:55:16.358 > Initialize FS... done
20:55:16.360 > Reading configuration... done
20:55:16.493 > Reading PinMapping... [   174][E][vfs_api.cpp:105] open(): /littlefs/pin_mapping.json does not exist, no permits for creation
20:55:16.504 > using default config done
20:55:16.508 > Initialize Network... ETH start
20:55:18.344 > ETH connected
20:55:18.394 > done
20:55:18.395 > Setting Hostname... Initialize NTP... done
20:55:18.397 > Initialize SunPosition... done
20:55:18.400 > Initialize MqTT... done
20:55:18.402 > Initialize WebApi... done
20:55:18.409 > Initialize Display... done
20:55:18.412 > Initialize LEDs... done
20:55:18.412 > Check for default DTU serial... done
20:55:18.417 > Initialize Hoymiles interface... NRF: Connection successful
20:55:18.423 >   Setting radio PA level... 
20:55:18.426 >   Setting DTU serial... 
20:55:18.429 >   Setting poll interval... 
20:55:18.432 > done
20:55:18.432 > Switch to Ethernet mode
20:55:18.434 > Configuring Ethernet DHCP IP... done
20:55:18.442 > Setting Hostname... done
20:55:18.949 > ETH got IP: [ETHERNET_IP]
20:55:18.949 > Network connected

Anything else?

tbnobody commented 9 months ago

The interesting part comes before:

20:55:15.614 > E (117779) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
20:55:15.622 > E (117779) task_wdt:  - async_tcp (CPU 0/1)
20:55:15.625 > E (117779) task_wdt: Tasks currently running:
20:55:15.630 > E (117779) task_wdt: CPU 0: IDLE
20:55:15.633 > E (117779) task_wdt: CPU 1: loopTask
20:55:15.636 > E (117779) task_wdt: Aborting.

And it seems that it crashes in the async_tcp task which is a 3rd party library. 20:55:15.622 > E (117779) task_wdt: - async_tcp (CPU 0/1)

I also tested it with my Olimex board and an NRF24... Works perfectly without problems.

ahanak commented 9 months ago

There is nothing "before" this. I repeated the curl command multiple times, but I just get a loop of what I posted.

Now that I've loaded my config using WIFI, the log only shows additional lines for TX and missing RX. (I think because it's dark outside):

1:54:11.387 > RX Period End
21:54:11.387 > All missing
21:54:11.387 > Nothing received, resend whole request
21:54:11.393 > TX AlarmData Channel: 3 --> XXX
21:54:12.194 > RX Period End
21:54:12.194 > All missing
21:54:12.194 > Nothing received, resend count exeeded
21:54:12.194 > Fetch inverter: XXX
21:54:14.738 > E (104564) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
21:54:14.747 > E (104564) task_wdt:  - async_tcp (CPU 0/1)
21:54:14.749 > E (104564) task_wdt: Tasks currently running:
21:54:14.755 > E (104564) task_wdt: CPU 0: IDLE
21:54:14.758 > E (104564) task_wdt: CPU 1: IDLE
21:54:14.760 > E (104564) task_wdt: Aborting.

Now I've also tried a brand-new Olimex board that I had lying around with nothing connected (no NRF24). The error is the same.

Maybe I've messed up something when compiling it myself? Is there maybe a binary that I can try?

I have to say that: This project is really great work :1st_place_medal:. I know so many people using it and I became really addicted analyzing graphs generated from the data.

tbnobody commented 9 months ago

I am just using the pre-compiled binary from github (generic_esp32.bin) and the device profile called olimex_esp32_poe.json

Please see this manual

vaterlangen commented 9 months ago

Same happens with my Olimex ESP32 POE board (CMT connected to GPIO 3/4/32/33/34/35 with multi uF capacitor in supplies) using generic v24.1.26 pre-compiled binary. As it is a little bit tricky to reach, I cannot connect directly to it - so no access to HW console

BluGen commented 9 months ago

I struggled with the same issue. It turned out the image got corrupted during flashing. Tested self compiled image as well as the released ones. You can fix your board by erasing the flash memory and reflashing it. Described here: Flash ESP32 Before erasing, I needed to jumper GPIO0 to GND. I used some tweezers. But be carefull, 3.3V and 5V pads are right next to it.

What caused it? My guess is a noise USB Connection. Hope this helps you out

BluGen commented 9 months ago

I have to revise my last answer. In the past two weeks I tinkered a lot with OpenDTU and an Olimex ESP32-POE with WROVER module and managed to reproduce the exact same issue. openDTU works fine if connected via WIFI, but crashes if webinterface is opened. A continuous ping command is handled correctly. The strange part is, if javascript is disabled, the facicon.ico and facicon.png will load fine and no crash on ESP side.

Aside from this, there are two versions of the Olimex ESP32-PoE, one with ESP32 WROOM chip and one with ESP32 WROVER chip. The later one uses a slightly different pin-out, GPIO16 and GPIO17 are used internally for the PSRAM. GPIO0 will enable the ethernet chip. This can be fixed in the pin_mapping.json with clk_mode = 1 in the eth section.

I will see if I manage to get JTAG debugging working and then share further details.

evanjarrett commented 9 months ago

I ran into this issue today as well. When using esp32-poe-wroom, powered and networked through the ethernet, and reaching the web interface over a device also connected to ethernet on my network, the device crashes and reboots.

Accessing the webpage over any device using wifi works fine, no crashes. (ie, laptop, phone)

My crash logs look the same as the one in the original report. ping always responds, until you try and reach the web interface either via curl or browser, then you can see opendtu go offline.

[1708122035.844975] 64 bytes from opendtu: icmp_seq=3 ttl=255 time=0.393 ms
[1708122036.868926] 64 bytes from opendtu: icmp_seq=4 ttl=255 time=0.406 ms
[1708122037.892933] 64 bytes from opendtu: icmp_seq=5 ttl=255 time=0.418 ms
[1708122045.060601] From ubuntu-server icmp_seq=9 Destination Host Unreachable
[1708122045.060613] From ubuntu-server icmp_seq=10 Destination Host Unreachable
[1708122045.060616] From ubuntu-server icmp_seq=11 Destination Host Unreachable
[1708122045.968229] 64 bytes from opendtu: icmp_seq=12 ttl=255 time=908 ms
[1708122046.061316] 64 bytes from opendtu: icmp_seq=13 ttl=255 time=0.803 ms

edit: Just for the record, accessing the API /api/livedata/status works fine from any device

BluGen commented 7 months ago

It took longer then expected, but I traced the reset of the device to the watchdog timer. Only when generating a large response with 100KB+ data, the reset occurs. I tested this by limiting the response size in WebApi_webapp.cpp for the "/js/app.js" ressource. This tells me the code in general works, but it takes to long to reach the point where the watchdog timer is reset. The thing I do not understand, how is the behavior different between wifi and ethernet. I dug through the code, but found no switch to a different lib for the external ethernet chip.

I am not familiar with the scheduler framework for the ESP32s realtime os, but is it possible to wrap the webapi ressource handler into a scheduler tasks so it will get interrupted near the end of the watchdog?

evanjarrett commented 7 months ago

Based on that research I've found a potential solution. I created a build that splits the app.js into vendor && app with vite's splitVendorChunkPlugin. This creates 2 files that are both under 100kb gzipped.

✨ [vite-plugin-compression]:algorithm=gzip - compressed file successfully: 
webapp_dist//home/data/OpenDTU/webapp_dist/index.html.gz     0.73kb / gzip: 0.40kb
webapp_dist//home/data/OpenDTU/webapp_dist/zones.json.gz     15.24kb / gzip: 3.96kb
webapp_dist//home/data/OpenDTU/webapp_dist/js/vendor.js.gz   259.89kb / gzip: 86.28kb
webapp_dist//home/data/OpenDTU/webapp_dist/js/app.js.gz      511.19kb / gzip: 90.51kb

For me, I am no longer getting any crashing when visiting the page. This still doesn't explain why there is a 100kb limit, and why it only happens when using strictly ethernet (as in both esp board AND pc are on the same eth network)

That said, the limit exists in AsyncTCP. So at best my recommendation for this repo would be to implement chunking to keep files under 100kb.


As a tangent, but something that didn't help me, You can turn off async_tcp's watchdog by passing in this build flag -DCONFIG_ASYNC_TCP_RUNNING_CORE=-1

But that just means that when async_tcp crashes, there is no rebooting. I was still getting the same problem, but would have to manually reset the board.

tbnobody commented 7 months ago

If there would be a limit of 100kb, why is it then working here without any problems on all modules (olimex, wt32-eth01, TTGO T-Internet-POE and T-ETH-Lite)

evanjarrett commented 7 months ago

The 100kb isn't a hard limit, it just seems that's about where I start running into issues.

I unfortunately don't know how to debug these esp boards. Something in the web libraries is locking up. I thought there was maybe a memory alloc issue but it should be pulling from progmem.

I have found I can sometimes get the app.js to load when throttling chrome (chromium) to "slow 3G" via network tools. Screenshot from 2024-03-25 10-29-57

In the meantime, splitting the web into chunks seems to help the problem (I might have had 1 crash on this build) My other solution would be to move the web interface completely off the hardware, and just make api calls to the esp32 to get the data

BluGen commented 7 months ago

There is no hardcoded 100kb limit. It was just a rough limit I noticed during testing, 100kb still worked, 150kb crashed. Could be somewhere in the middle. I did some further testing and tinkering, it crashes above 125kb (130kb is crashing).

Another point i noticed, it happens independently of the esp32's clock frequency. Tested with 240MHz, 80MHz and 10MHz. With 10MHz, it already crashes during boot with a watchdog timeout in the wifi setup routines. But that does not affect us. With 80 or 240Mz the size limit remains the same. That points to me, that the watchdog is directly tied to the cpu clock and not some external RTC clock.

For debugging I used a custom board profile for my olimex esp32 poe wrvoer model and set the core debug level to its highest value.

[env:olimex_esp32_poe_wrover] ; https://www.olimex.com/Products/IoT/ESP32/ESP32-POE/open-source-hardware build_type = debug board = esp32-poe ;board_build.f_cpu = 240000000L ;debug_tool = esp-usb-bridge ;upload_protocol = esp-usb-bridge build_flags = ${env.build_flags} -DHOYMILES_PIN_MISO=-1 -DHOYMILES_PIN_MOSI=-1 -DHOYMILES_PIN_SCLK=-1 -DHOYMILES_PIN_IRQ=-1 -DHOYMILES_PIN_CE=-1 -DHOYMILES_PIN_CS=-1 -DOPENDTU_ETHERNET -DETH_PHY_TYPE=ETH_PHY_LAN8720 -DETH_PHY_POWER=12 -DETH_PHY_MDC=23 -DETH_PHY_MDIO=18 -DETH_CLK_MODE=ETH_CLOCK_GPIO0_OUT -DBOARD_HAS_PSRAM -mfix-esp32-psram-cache-issue -DCORE_DEBUG_LEVEL=5

The professional way to debug the ESP32 would be by using the JTAG interface. You can even use a raspberry pi as jtag adapter. I found a guide for this here: https://blog.wokwi.com/gdb-debugging-esp32-using-raspberry-pi/

But unfortunately the Olimex ESP32 PoE boards are using one of the necessary pins to enable the ethernet chip. It would require a small modification to "fix" this.

Currently I'm digging through the AsyncTCP implementation, as the original watchdog timeout results from here. But that code is not so easy to understand than the WebApi code. Just a sidenote, I tried to run the response handler for the webapi inside a scheduler task. It worked, but did not solve the issue.

To me it is still a mystery why it works on some boards, but not on others. Your WT32-ETH01 and TTGO T-Internet-POE use the same network chip as the Olimex board. So it uses the same driver and network code too. I have tested it with your precompiled generic build as well as a self compiled version (for more logging output).

MartinHL commented 7 months ago

Just wanted to tell that I also ran into this issue. I'm using v24.3.22 on a ESP32-S (from AZ delivery) with a connected LAN8720 ETH Board. Accessing http://[IP-ADDRESS]/api/livedata/status works with no issue but accessing the "normal" Webpage results in a crash and a reboot. Needless to say that via Wifi it works without issues. I'm sorry that I'm not able to provide more valuable information...

vaterlangen commented 7 months ago

Same happens with my Olimex ESP32 POE board (CMT connected to GPIO 3/4/32/33/34/35 with multi uF capacitor in supplies) using generic v24.1.26 pre-compiled binary. As it is a little bit tricky to reach, I cannot connect directly to it - so no access to HW console

I am able to use the WebUI on my Android/iOS Phones (connected to LAN via WiFi AP) without any issues, but always end up with a DTU crash/restart when trying to access the WebUI on my Linux PC with wired LAN connection - OpenDTU is directly connected to wired LAN. I already compared the HTTP traffic and found no real difference between PC and Phone - maybe there is a difference in the TCP layer. I will check this next week.

Update: Becoming even more interesting when using a Linux Laptop. When connected via WiFi, everything works fine. When (same device, even same browser tab) connected via wired LAN, OpenDTU crashes when WebUI is forced reloaded. Further analysis of captured network traffic will be done later.

vaterlangen commented 7 months ago

Finally, I created dumps for wireless (~320Mb/s) [WiFi] and wired (10Mb/s, 100Mb/s and 1000Mb/s - set on the Laptop interface) [LAN] connection. Thereof, loading app.js worked for WiFi and 10M LAN - for 100M and 1G LAN result in OpenDTU crash/reboot. OpenDTU is running on Olimex ESP32 POE and is connected via the embedded 100M LAN port (switch also reports 100M connection speed).

Packet Size

The first noticeable difference is in the packet size sent by OpenDTU, which is much larger when using WiFi. Notice: 10M LAN suffers retransmissions, so packet count may be misleading.

packet_size Upper Left: WiFi | Lower Left: 1G LAN | Upper Right: 10M LAN | Lower Right: 100M LAN

Speed

The second one is - obviously - the packet rate round trip time between WiFi and LAN connection. Using 1G LAN, almost the full transfer (149 of 189 kB) is done in less than 100 ms, while the full transfer (189 kB) via WiFi takes over 4 seconds. Using 100M LAN seems to be slightly better for OpenDTU - maybe the slightly higher latency between laptop and switch. Notice: 10M LAN not shown due to retransmissions.

1G LAN (the flat line at 0,1 s marks the time of crash (TCP retransmission and RST take multiple seconds from there)

1g_seq 1g_rtt

100M LAN (the flat line at 0,1 s marks the time of crash (TCP retransmission and RST take multiple seconds from there)

100m_seq 100m_rtt

WiFi

wifi_seq wifi_rtt

Wrap Up

The smaller TCP packets may indicate that the network stack of OpenDTU is running at the performance limit when connected via 1G/100M LAN, as the stack/app seems unable to deliver more data to be sent within one packet. The higher latency when using WiFi connection seems to allow OpenDTU to prepare more data for the next packet - resulting in larger packet sizes. Maybe, the watchdog is only fed, when some watermark in the output buffer is reached? Or only, when processing returns - what never happens with low latency traffic? Both would result in a watchdog reset.

Update: Testing was done with v24.3.31

MartinHL commented 7 months ago

@vaterlangen: Thanks for your analysis. I can confirm this. Once I have my DTU connected to the same Gigabit Ethernet Switch as my workstation (Linux) I have these crashes when connecting to the DTU via Firefox browser. But if I connect from my mobile phone via WIFI it actually works. Tested with v24.3.22.

tbnobody commented 7 months ago

Could you try this? https://github.com/me-no-dev/AsyncTCP/issues/121

evanjarrett commented 7 months ago

Awesome that fix appears to work for me on latest master branch! Tried with 128, 256 and 512 all of those work for me. changing back to 32 causes the crash.

My platformio pulls in the esphome fork of AsyncTCP https://github.com/esphome/AsyncTCP But they also have the same 32 value in there

tbnobody commented 7 months ago

That sounds promising! I think the explanation is this: https://arduino.stackexchange.com/a/90131

Will implement a patch which will automatically applied before compiling.

Can maybe someone else verify that this change is working? (I cannot reproduce this with my ethernet boards)

BluGen commented 7 months ago

I can confirm this fix works for my Olimex ESP32-PoE Wrover Board too.

Tested this with the currently pulled in esphome fork of AsyncTCP.

@vaterlangen Thank you for your in depth analysis. I would never thought about looking at the packet size itself.

vaterlangen commented 7 months ago

Can confirm fix in v24.4.12 for my setup. Thanks, @tbnobody !

Now the request was served successfully via wired LAN in less than 200ms:

seq_fixed

MartinHL commented 6 months ago

Looks like the issue is present again with v24.4.24. I see daily reboots as my Cerbo GX keeps pinging the DTU constantly all the time. With version v24.4.12 I had a runtime of 12 days without reset until the update to v24.4.24. Anybody who can confirm this?

tbnobody commented 6 months ago

Are you using the generic or the generic_esp32 firmware?

MartinHL commented 6 months ago

I'm using generic (honestly I don't understand the difference to generic_esp32).

tbnobody commented 6 months ago

I'm using generic

But you are using the precompiled version right? Please try this one and look wether it fixes the problem.

generic.zip

honestly I don't understand the difference to generic_esp32

Did you read the documentation? https://www.opendtu.solar/firmware/howto/migrate_generic/ Whats unclear so far? How to write it better?

evanjarrett commented 6 months ago

In the latest update which updates the ESPAsyncWebServer version. AsyncTCP-esphome has been replaced with a new fork: https://github.com/mathieucarbou/ESPAsyncWebServer/commit/6670c2e4e413c4746bfbc54d8c1dc339d77f61f3#diff-64cc7cdde6789dc1c1c4a6f406cd1ff6a6a00027097788e2d52fc27b0d75502e

This fork has the new patch applied downstream, but with a slightly different variable name https://github.com/mathieucarbou/AsyncTCP/commit/de284cc5cc8a56839dfb3a49958e0ba33ab87408

Basically OpenDTU shouldn't need to patch, just need to change: -DCONFIG_ASYNC_TCP_EVENT_QUEUE_SIZE=128 with -DCONFIG_ASYNC_TCP_QUEUE_SIZE=128 in the build_flags.

tbnobody commented 6 months ago

Basically OpenDTU shouldn't need to patch, just need to change:

Thats what I did in the provided generic.bin mentioned above. Just want to doublecheck that its really working again

But @evanjarrett if you can confirm that this change fixes the problem I will include it in the next release.

evanjarrett commented 6 months ago

I'm having a hard time recreating this, I see the code changes, and my initial comment seems like the answer, but I can't recreate.

For me, the code runs fine using ESPAsyncWebServer 2.9.3 and 2.9.4 with the patch_apply and the the old CONFIG_ASYNC_TCP_EVENT_QUEUE_SIZE in place. As soon as I try and change the variable to what is in the new versions, OR remove the patch_apply script from the build, it breaks again.

I've full cleaned each time before rebuilding, and even tried wiping the flash on the olimex poe.

MartinHL commented 6 months ago

Tested it but the device still reboots after some hours. Bildschirmfoto vom 2024-04-30 21-37-30

BTW: Thanks for the link to the explanation generic vs. generic_esp32. This difference is now clear to me. Is the table "Overview of pre-compiled binaries" up to date? It is stated that opendtu-generic.bin is deprecated but it is still updated. Maybe a few words to why it is important which type of USB connection I have would help. I'm using a PCB with USB-serial converter. So should I use opendtu-generic_esp32s3.bin then? But I'm using generic.bin until now.

I'm using generic

But you are using the precompiled version right? Please try this one and look wether it fixes the problem.

generic.zip

github-actions[bot] commented 5 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.