Closed ahanak closed 7 months ago
The interesting part comes before:
20:55:15.614 > E (117779) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
20:55:15.622 > E (117779) task_wdt: - async_tcp (CPU 0/1)
20:55:15.625 > E (117779) task_wdt: Tasks currently running:
20:55:15.630 > E (117779) task_wdt: CPU 0: IDLE
20:55:15.633 > E (117779) task_wdt: CPU 1: loopTask
20:55:15.636 > E (117779) task_wdt: Aborting.
And it seems that it crashes in the async_tcp
task which is a 3rd party library.
20:55:15.622 > E (117779) task_wdt: - async_tcp (CPU 0/1)
I also tested it with my Olimex board and an NRF24... Works perfectly without problems.
There is nothing "before" this. I repeated the curl
command multiple times, but I just get a loop of what I posted.
Now that I've loaded my config using WIFI, the log only shows additional lines for TX and missing RX. (I think because it's dark outside):
1:54:11.387 > RX Period End
21:54:11.387 > All missing
21:54:11.387 > Nothing received, resend whole request
21:54:11.393 > TX AlarmData Channel: 3 --> XXX
21:54:12.194 > RX Period End
21:54:12.194 > All missing
21:54:12.194 > Nothing received, resend count exeeded
21:54:12.194 > Fetch inverter: XXX
21:54:14.738 > E (104564) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
21:54:14.747 > E (104564) task_wdt: - async_tcp (CPU 0/1)
21:54:14.749 > E (104564) task_wdt: Tasks currently running:
21:54:14.755 > E (104564) task_wdt: CPU 0: IDLE
21:54:14.758 > E (104564) task_wdt: CPU 1: IDLE
21:54:14.760 > E (104564) task_wdt: Aborting.
Now I've also tried a brand-new Olimex board that I had lying around with nothing connected (no NRF24). The error is the same.
Maybe I've messed up something when compiling it myself? Is there maybe a binary that I can try?
I have to say that: This project is really great work :1st_place_medal:. I know so many people using it and I became really addicted analyzing graphs generated from the data.
I am just using the pre-compiled binary from github (generic_esp32.bin) and the device profile called olimex_esp32_poe.json
Please see this manual
Same happens with my Olimex ESP32 POE board (CMT connected to GPIO 3/4/32/33/34/35 with multi uF capacitor in supplies) using generic v24.1.26 pre-compiled binary. As it is a little bit tricky to reach, I cannot connect directly to it - so no access to HW console
I struggled with the same issue. It turned out the image got corrupted during flashing. Tested self compiled image as well as the released ones. You can fix your board by erasing the flash memory and reflashing it. Described here: Flash ESP32 Before erasing, I needed to jumper GPIO0 to GND. I used some tweezers. But be carefull, 3.3V and 5V pads are right next to it.
What caused it? My guess is a noise USB Connection. Hope this helps you out
I have to revise my last answer. In the past two weeks I tinkered a lot with OpenDTU and an Olimex ESP32-POE with WROVER module and managed to reproduce the exact same issue. openDTU works fine if connected via WIFI, but crashes if webinterface is opened. A continuous ping command is handled correctly. The strange part is, if javascript is disabled, the facicon.ico and facicon.png will load fine and no crash on ESP side.
Aside from this, there are two versions of the Olimex ESP32-PoE, one with ESP32 WROOM chip and one with ESP32 WROVER chip. The later one uses a slightly different pin-out, GPIO16 and GPIO17 are used internally for the PSRAM. GPIO0 will enable the ethernet chip. This can be fixed in the pin_mapping.json with clk_mode = 1 in the eth section.
I will see if I manage to get JTAG debugging working and then share further details.
I ran into this issue today as well. When using esp32-poe-wroom, powered and networked through the ethernet, and reaching the web interface over a device also connected to ethernet on my network, the device crashes and reboots.
Accessing the webpage over any device using wifi works fine, no crashes. (ie, laptop, phone)
My crash logs look the same as the one in the original report. ping always responds, until you try and reach the web interface either via curl or browser, then you can see opendtu go offline.
[1708122035.844975] 64 bytes from opendtu: icmp_seq=3 ttl=255 time=0.393 ms
[1708122036.868926] 64 bytes from opendtu: icmp_seq=4 ttl=255 time=0.406 ms
[1708122037.892933] 64 bytes from opendtu: icmp_seq=5 ttl=255 time=0.418 ms
[1708122045.060601] From ubuntu-server icmp_seq=9 Destination Host Unreachable
[1708122045.060613] From ubuntu-server icmp_seq=10 Destination Host Unreachable
[1708122045.060616] From ubuntu-server icmp_seq=11 Destination Host Unreachable
[1708122045.968229] 64 bytes from opendtu: icmp_seq=12 ttl=255 time=908 ms
[1708122046.061316] 64 bytes from opendtu: icmp_seq=13 ttl=255 time=0.803 ms
edit:
Just for the record, accessing the API /api/livedata/status
works fine from any device
It took longer then expected, but I traced the reset of the device to the watchdog timer. Only when generating a large response with 100KB+ data, the reset occurs. I tested this by limiting the response size in WebApi_webapp.cpp for the "/js/app.js" ressource. This tells me the code in general works, but it takes to long to reach the point where the watchdog timer is reset. The thing I do not understand, how is the behavior different between wifi and ethernet. I dug through the code, but found no switch to a different lib for the external ethernet chip.
I am not familiar with the scheduler framework for the ESP32s realtime os, but is it possible to wrap the webapi ressource handler into a scheduler tasks so it will get interrupted near the end of the watchdog?
Based on that research I've found a potential solution.
I created a build that splits the app.js into vendor && app with vite's splitVendorChunkPlugin
.
This creates 2 files that are both under 100kb gzipped.
✨ [vite-plugin-compression]:algorithm=gzip - compressed file successfully:
webapp_dist//home/data/OpenDTU/webapp_dist/index.html.gz 0.73kb / gzip: 0.40kb
webapp_dist//home/data/OpenDTU/webapp_dist/zones.json.gz 15.24kb / gzip: 3.96kb
webapp_dist//home/data/OpenDTU/webapp_dist/js/vendor.js.gz 259.89kb / gzip: 86.28kb
webapp_dist//home/data/OpenDTU/webapp_dist/js/app.js.gz 511.19kb / gzip: 90.51kb
For me, I am no longer getting any crashing when visiting the page. This still doesn't explain why there is a 100kb limit, and why it only happens when using strictly ethernet (as in both esp board AND pc are on the same eth network)
That said, the limit exists in AsyncTCP. So at best my recommendation for this repo would be to implement chunking to keep files under 100kb.
As a tangent, but something that didn't help me, You can turn off async_tcp's watchdog by passing in this build flag
-DCONFIG_ASYNC_TCP_RUNNING_CORE=-1
But that just means that when async_tcp crashes, there is no rebooting. I was still getting the same problem, but would have to manually reset the board.
If there would be a limit of 100kb, why is it then working here without any problems on all modules (olimex, wt32-eth01, TTGO T-Internet-POE and T-ETH-Lite)
The 100kb isn't a hard limit, it just seems that's about where I start running into issues.
I unfortunately don't know how to debug these esp boards. Something in the web libraries is locking up. I thought there was maybe a memory alloc issue but it should be pulling from progmem.
I have found I can sometimes get the app.js to load when throttling chrome (chromium) to "slow 3G" via network tools.
In the meantime, splitting the web into chunks seems to help the problem (I might have had 1 crash on this build) My other solution would be to move the web interface completely off the hardware, and just make api calls to the esp32 to get the data
There is no hardcoded 100kb limit. It was just a rough limit I noticed during testing, 100kb still worked, 150kb crashed. Could be somewhere in the middle. I did some further testing and tinkering, it crashes above 125kb (130kb is crashing).
Another point i noticed, it happens independently of the esp32's clock frequency. Tested with 240MHz, 80MHz and 10MHz. With 10MHz, it already crashes during boot with a watchdog timeout in the wifi setup routines. But that does not affect us. With 80 or 240Mz the size limit remains the same. That points to me, that the watchdog is directly tied to the cpu clock and not some external RTC clock.
For debugging I used a custom board profile for my olimex esp32 poe wrvoer model and set the core debug level to its highest value.
[env:olimex_esp32_poe_wrover] ; https://www.olimex.com/Products/IoT/ESP32/ESP32-POE/open-source-hardware build_type = debug board = esp32-poe ;board_build.f_cpu = 240000000L ;debug_tool = esp-usb-bridge ;upload_protocol = esp-usb-bridge build_flags = ${env.build_flags} -DHOYMILES_PIN_MISO=-1 -DHOYMILES_PIN_MOSI=-1 -DHOYMILES_PIN_SCLK=-1 -DHOYMILES_PIN_IRQ=-1 -DHOYMILES_PIN_CE=-1 -DHOYMILES_PIN_CS=-1 -DOPENDTU_ETHERNET -DETH_PHY_TYPE=ETH_PHY_LAN8720 -DETH_PHY_POWER=12 -DETH_PHY_MDC=23 -DETH_PHY_MDIO=18 -DETH_CLK_MODE=ETH_CLOCK_GPIO0_OUT -DBOARD_HAS_PSRAM -mfix-esp32-psram-cache-issue -DCORE_DEBUG_LEVEL=5
The professional way to debug the ESP32 would be by using the JTAG interface. You can even use a raspberry pi as jtag adapter. I found a guide for this here: https://blog.wokwi.com/gdb-debugging-esp32-using-raspberry-pi/
But unfortunately the Olimex ESP32 PoE boards are using one of the necessary pins to enable the ethernet chip. It would require a small modification to "fix" this.
Currently I'm digging through the AsyncTCP implementation, as the original watchdog timeout results from here. But that code is not so easy to understand than the WebApi code. Just a sidenote, I tried to run the response handler for the webapi inside a scheduler task. It worked, but did not solve the issue.
To me it is still a mystery why it works on some boards, but not on others. Your WT32-ETH01 and TTGO T-Internet-POE use the same network chip as the Olimex board. So it uses the same driver and network code too. I have tested it with your precompiled generic build as well as a self compiled version (for more logging output).
Just wanted to tell that I also ran into this issue. I'm using v24.3.22 on a ESP32-S (from AZ delivery) with a connected LAN8720 ETH Board. Accessing http://[IP-ADDRESS]/api/livedata/status works with no issue but accessing the "normal" Webpage results in a crash and a reboot. Needless to say that via Wifi it works without issues. I'm sorry that I'm not able to provide more valuable information...
Same happens with my Olimex ESP32 POE board (CMT connected to GPIO 3/4/32/33/34/35 with multi uF capacitor in supplies) using generic v24.1.26 pre-compiled binary. As it is a little bit tricky to reach, I cannot connect directly to it - so no access to HW console
I am able to use the WebUI on my Android/iOS Phones (connected to LAN via WiFi AP) without any issues, but always end up with a DTU crash/restart when trying to access the WebUI on my Linux PC with wired LAN connection - OpenDTU is directly connected to wired LAN. I already compared the HTTP traffic and found no real difference between PC and Phone - maybe there is a difference in the TCP layer. I will check this next week.
Update: Becoming even more interesting when using a Linux Laptop. When connected via WiFi, everything works fine. When (same device, even same browser tab) connected via wired LAN, OpenDTU crashes when WebUI is forced reloaded. Further analysis of captured network traffic will be done later.
Finally, I created dumps for wireless (~320Mb/s) [WiFi] and wired (10Mb/s, 100Mb/s and 1000Mb/s - set on the Laptop interface) [LAN] connection. Thereof, loading app.js
worked for WiFi and 10M LAN - for 100M and 1G LAN result in OpenDTU crash/reboot. OpenDTU is running on Olimex ESP32 POE
and is connected via the embedded 100M LAN port (switch also reports 100M connection speed).
The first noticeable difference is in the packet size sent by OpenDTU, which is much larger when using WiFi. Notice: 10M LAN suffers retransmissions, so packet count may be misleading.
Upper Left: WiFi | Lower Left: 1G LAN | Upper Right: 10M LAN | Lower Right: 100M LAN
The second one is - obviously - the packet rate round trip time between WiFi and LAN connection. Using 1G LAN, almost the full transfer (149 of 189 kB) is done in less than 100 ms, while the full transfer (189 kB) via WiFi takes over 4 seconds. Using 100M LAN seems to be slightly better for OpenDTU - maybe the slightly higher latency between laptop and switch. Notice: 10M LAN not shown due to retransmissions.
The smaller TCP packets may indicate that the network stack of OpenDTU is running at the performance limit when connected via 1G/100M LAN, as the stack/app seems unable to deliver more data to be sent within one packet. The higher latency when using WiFi connection seems to allow OpenDTU to prepare more data for the next packet - resulting in larger packet sizes. Maybe, the watchdog is only fed, when some watermark in the output buffer is reached? Or only, when processing returns - what never happens with low latency traffic? Both would result in a watchdog reset.
Update: Testing was done with v24.3.31
@vaterlangen: Thanks for your analysis. I can confirm this. Once I have my DTU connected to the same Gigabit Ethernet Switch as my workstation (Linux) I have these crashes when connecting to the DTU via Firefox browser. But if I connect from my mobile phone via WIFI it actually works. Tested with v24.3.22.
Could you try this? https://github.com/me-no-dev/AsyncTCP/issues/121
Awesome that fix appears to work for me on latest master branch! Tried with 128, 256 and 512 all of those work for me. changing back to 32 causes the crash.
My platformio pulls in the esphome fork of AsyncTCP https://github.com/esphome/AsyncTCP But they also have the same 32 value in there
That sounds promising! I think the explanation is this: https://arduino.stackexchange.com/a/90131
Will implement a patch which will automatically applied before compiling.
Can maybe someone else verify that this change is working? (I cannot reproduce this with my ethernet boards)
I can confirm this fix works for my Olimex ESP32-PoE Wrover Board too.
Tested this with the currently pulled in esphome fork of AsyncTCP.
@vaterlangen Thank you for your in depth analysis. I would never thought about looking at the packet size itself.
Can confirm fix in v24.4.12 for my setup. Thanks, @tbnobody !
Now the request was served successfully via wired LAN in less than 200ms:
Looks like the issue is present again with v24.4.24. I see daily reboots as my Cerbo GX keeps pinging the DTU constantly all the time. With version v24.4.12 I had a runtime of 12 days without reset until the update to v24.4.24. Anybody who can confirm this?
Are you using the generic or the generic_esp32 firmware?
I'm using generic (honestly I don't understand the difference to generic_esp32).
I'm using generic
But you are using the precompiled version right? Please try this one and look wether it fixes the problem.
honestly I don't understand the difference to generic_esp32
Did you read the documentation? https://www.opendtu.solar/firmware/howto/migrate_generic/ Whats unclear so far? How to write it better?
In the latest update which updates the ESPAsyncWebServer version. AsyncTCP-esphome has been replaced with a new fork: https://github.com/mathieucarbou/ESPAsyncWebServer/commit/6670c2e4e413c4746bfbc54d8c1dc339d77f61f3#diff-64cc7cdde6789dc1c1c4a6f406cd1ff6a6a00027097788e2d52fc27b0d75502e
This fork has the new patch applied downstream, but with a slightly different variable name https://github.com/mathieucarbou/AsyncTCP/commit/de284cc5cc8a56839dfb3a49958e0ba33ab87408
Basically OpenDTU shouldn't need to patch, just need to change:
-DCONFIG_ASYNC_TCP_EVENT_QUEUE_SIZE=128
with -DCONFIG_ASYNC_TCP_QUEUE_SIZE=128
in the build_flags.
Basically OpenDTU shouldn't need to patch, just need to change:
Thats what I did in the provided generic.bin mentioned above. Just want to doublecheck that its really working again
But @evanjarrett if you can confirm that this change fixes the problem I will include it in the next release.
I'm having a hard time recreating this, I see the code changes, and my initial comment seems like the answer, but I can't recreate.
For me, the code runs fine using ESPAsyncWebServer 2.9.3 and 2.9.4 with the patch_apply and the the old CONFIG_ASYNC_TCP_EVENT_QUEUE_SIZE
in place.
As soon as I try and change the variable to what is in the new versions, OR remove the patch_apply script from the build, it breaks again.
I've full cleaned each time before rebuilding, and even tried wiping the flash on the olimex poe.
Tested it but the device still reboots after some hours.
BTW: Thanks for the link to the explanation generic vs. generic_esp32. This difference is now clear to me. Is the table "Overview of pre-compiled binaries" up to date? It is stated that opendtu-generic.bin is deprecated but it is still updated. Maybe a few words to why it is important which type of USB connection I have would help. I'm using a PCB with USB-serial converter. So should I use opendtu-generic_esp32s3.bin then? But I'm using generic.bin until now.
I'm using generic
But you are using the precompiled version right? Please try this one and look wether it fixes the problem.
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.
What happened?
I have an olimex esp32 poe board. When I access the websites over ethernet, the board crashes.
Accessing the web-interface with a browser only shows the title "OpenDTU" in the browser-tab and an empty page. Further investigation with firefox debugging tools shows me that the html is downloaded but the request to
app.js
times out.To Reproduce Bug
env:olimex_esp32_poe
curl --output - http://[ETHERNET_IP]/js/app.js
What I see is some binary output (what i do not expect). The device reboots and later on the connection times out:
curl: (56) Recv failure: Die Verbindung wurde vom Kommunikationspartner zurückgesetzt
Expected Behavior
I would expect the device to show the webinterface also when accessed via Ethernet.
Install Method
Self-Compiled
What git-hash/version of OpenDTU?
21ec72f4c00fab1385b1fb93780b52cd6d4e4ff7
Relevant log/trace output
Anything else?