tbnobody / OpenDTU

Software for ESP32 to talk to Hoymiles/TSUN/Solenso Inverters
GNU General Public License v2.0
1.82k stars 509 forks source link

since 23.12.18 I see "Software reset CPU" once a day at random time #1603

Open chris299 opened 10 months ago

chris299 commented 10 months ago

What happened?

since I tested and updated to solve #1397 with 23.12.18 I see now "Software reset CPU" every day at random times, which erases all values to 0.

I already changed the power source but the problem persists, anyway, a power-related reset shows a "Vbat power on reset" instead, so I assume a software issue within the firmware.

grafik

To Reproduce Bug

happens every day....

Expected Behavior

no reset without reason.... (as it was before)

Install Method

Pre-Compiled binary from GitHub

What git-hash/version of OpenDTU?

23.12.28

Relevant log/trace output

No response

Anything else?

No response

schlimmchen commented 10 months ago

Do you have the ability to capture the serial output of your device? There most certainly is valuable information printed through the serial connection right before the device reboots. Please try to capture that moment and post the excerpt from the serial output.

chris299 commented 10 months ago

@schlimmchen thank you for your answer. I'm willing to try, but I have no idea how to do this. I assume, I need to connect the ESP32 to my computer, but how to proceed then? (I'm using Ubuntu) Is there a command to create a serial connection and dump the received content to a file?

schlimmchen commented 10 months ago

I'm using Ubuntu

Great!

You will see a device pop up when connecting your ESP32. It may be /dev/ttyUSB0 (or USB1, USB2, etc.) or /dev/ttyACM0 (or 1 or 2 etc.). Connect to it using a serial terminal. I recommend minicom for this application (sudo apt install minicom). This should do (in a terminal):

minicom -D /dev/ttyUSB0 -C /tmp/capture.txt -b 115200

You should see the output you also see in the Web console. If that's true, just leave it like that until the unwanted reboot occurs. You can also look at /tmp/capture.txt in parallel to make sure the messages are captured there. Then have a look at the file /tmp/capture.txt and isolate the messages right before the unwanted reboot. Paste them here as a code block.

chris299 commented 10 months ago

@schlimmchen thank you for your advice, however, ~~I can't get it to work.... I attached the ESP32 to the computer but all I get from dmesg is~~ EDIT: got it to work by removing brltty, however... I'll paste the log as soon as it happens again...

stefan123t commented 10 months ago

@chris299 yes as some people on Discord (you?) mentioned under Linux Mint brltty (Braille Text Console Service) gets installed initially for Accessibility reasons, i.e. Blind people need that input initially to install / use a computer! Hence if you need that port /dev/ttyUSB0 free for minicom or other console processes to bind to you may have to stop / uninstall brltty

chris299 commented 10 months ago

@stefan123t it was not me on Discord... it took a while but it just happened again attached the full log, reset took place just a few minutes before end of log opendtulog.tar.gz

schlimmchen commented 10 months ago
RX Period End
All missing
Nothing received, resend whole request
TX RealTimeRunData 865.00 MHz --> 15 83 74 81 74 80 16 22 96 80 0B 00 65 99 82 AF 00 00 00 00 00 00 00 00 0D E1 83 
RX Period End
All missing
Nothing received, resend whole request
TX RealTimeRunData 865.00 MHz --> 15 83 74 81 74 80 16 22 96 80 0B 00 65 99 82 AF 00 00 00 00 00 00 00 00 0D E1 83 
RX Period End
All missing
Nothing received, resend whole request
TX RealTimeRunData 865.00 MHz --> 15 83 74 81 74 80 16 22 96 80 0B 00 65 99 82 AF 00 00 00 00 00 00 00 00 0D E1 83 
RX Period End
All missing
Nothing received, resend whole request
TX RealTimeRunData 865.00 MHz --> 15 83 74 81 74 80 16 22 96 80 0B 00 65 99 82 AF 00 00 00 00 00 00 00 00 0D E1 83 
RX Period End
All missing
Nothing received, resend count exeeded
TX SystemConfigPara 865.00 MHz --> 15 83 74 81 74 80 16 22 96 80 05 00 65 99 82 AF 00 00 00 00 00 00 00 00 C3 EE 4C 
RX Period End
All missing
Nothing received, resend whole request
TX SystemConfigPara 865.00 MHz --> 15 83 74 81 74 80 16 22 96 80 05 00 65 99 82 AF 00 00 00 00 00 00 00 00 C3 EE 4C 
Websocket: [/livedata][42] connect
RX Period End
All missing
Nothing received, resend whole request
TX SystemConfigPara 865.00 MHz --> 15 83 74 81 74 80 16 22 96 80 05 00 65 99 82 AF 00 00 00 00 00 00 00 00 C3 EE 4C 
RX Period End
All missing
Nothing received, resend whole request
TX SystemConfigPara 865.00 MHz --> 15 83 74 81 74 80 16 22 96 80 05 00 65 99 82 AF 00 00 00 00 00 00 00 00 C3 EE 4C 
CORRUPT HEAP: Bad head at 0x3ffde188. Expected 0xabba1234 got 0x00000000

assert failed: multi_heap_free multi_heap_poisoning.c:259 (head != NULL)

Backtrace: 0x40083a41:0x3ffdcfc0 0x4008eca1:0x3ffdcfe0 0x40094f59:0x3ffdd000 0x40094b51:0x3ffdd130 0x40083fe1:0x3ffdd150 0x40094f89:0x3ffdd170 0x4019e2dd:0x3ffdd190 0x400f937a:0x3ffdd1b0 0x400f990b:0x3ffdd1d0 0x400fbb45:0x3ffdd1f0 0x400f6fe7:0x3ffdd210 0x400f6ff3:0x3ffdd230 0x401a16d6:0x3ffdd250 0x401a19f2:0x3ffdd280 0x401a1a08:0x3ffdd2a0 0x401a1b8e:0x3ffdd2c0

ELF file SHA256: 73d74efa8cfd9041

E (9515) esp_core_dump_flash: Core dump flash config is corrupted! CRC=0x7bd5c66f instead of 0x0
Rebooting...

Good job.

  1. What exact version were you on when this happened (GIT commit SHA)?
  2. Please have a look at the System Info page of the web app. How much heap is used?
  3. Does the heap usage change over a period of a couple of hours?
  4. Do you have TLS enabled for MQTT?
chris299 commented 10 months ago

@schlimmchen

  1. I currently run version 23.12.31 But it happened also with the version before
  2. I see now (after the reset) 44% or 123kB used.
  3. I looked now and then in the past after the memory usage but I did not notice that it changed too much. I'll keep an eye out for it now...
  4. I don't use MQTT. I run the IoBroker adapter which pools the data as far as I know
chris299 commented 10 months ago

@schlimmchen as restarts did not happen the last two days I assume the problem is related to the HMS restart in the afternoon / daily yield correction which did not occur as well...

fred777 commented 10 months ago

I'm running v24.1.14 with two HMS and have software resets on both cpus too....

(didn't notice until I stumbled upon this ticket)

....and I do use MQTT, but without TLS

....and I have never experienced values being reset to zero unexpectedly

image

chris299 commented 9 months ago

With 24.1.21 I still see the issue, but with a slightly different log output:

assert failed: tcp_update_rcv_ann_wnd IDF/components/lwip/lwip/src/core/tcp.c:951 (new_rcv_ann_wnd <= 0xffff)

Backtrace: 0x40083a41:0x3ffb2cc0 0x4008eca1:0x3ffb2ce0 0x40094f59:0x3ffb2d00 0x4012f4a6:0x3ffb2e30 0x4012f554:0x3ffb2e50 0x401a188a:0x3ffb2e70 0x4012bd60:0x3ffb2e90

ELF file SHA256: e7eb5e57151e0ce0

E (11950) esp_core_dump_flash: Core dump flash config is corrupted! CRC=0x7bd5c66f instead of 0x0
Rebooting...
chris299 commented 9 months ago

issue still there with current version 24.1.26:

TX SystemConfigPara 865.00 MHz --> 15 84 67 55 36 80 16 22 96 80 05 00 65 BC 0E 59 00 00 00 00 00 00 00 00 BC BC BC 
CORRUPT HEAP: Bad tail at 0x3ffde5a4. Expected 0xbaad5678 got 0x00000000

assert failed: multi_heap_free multi_heap_poisoning.c:259 (head != NULL)

Backtrace: 0x40083a41:0x3ffdd360 0x4008eca1:0x3ffdd380 0x40094f59:0x3ffdd3a0 0x40094b51:0x3ffdd4d0 0x40083fe1:0x3ffdd4f0 0x40094f89:0x3ffdd510 0x4010a66d:0x3ffdd530 0x4010a67d:0x3ffdd550 0x400e0cbd:0x3ffdd570 0x400e0cdd:0x3ffdd590 0x400e0ce9:0x3ffdd5b0 0x400f9e72:0x3ffdd5d0 0x400fc05d:0x3ffdd5f0 0x400f74ff:0x3ffdd610 0x400f750b:0x3ffdd630 0x401a243e:0x3ffdd650 0x401a275a:0x3ffdd680 0x401a2770:0x3ffdd6a0 0x401a28f6:0x3ffdd6c0

ELF file SHA256: ba393eda5744a754

E (599) esp_core_dump_flash: Core dump flash config is corrupted! CRC=0x7bd5c66f instead of 0x0
Rebooting...
chris299 commented 9 months ago

it continues to reboot now and then....

Nothing received, resend whole request
TX AlarmData 865.00 MHz --> 15 83 74 81 74 80 16 22 96 80 11 00 65 C2 CD 00 00 00 00 00 00 00 00 00 E1 55 7A 
RX Period End
All missing
Nothing received, resend whole request
TX AlarmData 865.00 MHz --> 15 83 74 81 74 80 16 22 96 80 11 00 65 C2 CD 00 00 00 00 00 00 00 00 00 E1 55 7A 
Websocket: [/livedata][29] connect
CORRUPT HEAP: Bad tail at 0x3ffde4c0. Expected 0xbaad5678 got 0x00000000

assert failed: multi_heap_free multi_heap_poisoning.c:259 (head != NULL)

Backtrace: 0x40083a41:0x3ffd0290 0x4008eca1:0x3ffd02b0 0x40094f59:0x3ffd02d0 0x40094b51:0x3ffd0400 0x40083fe1:0x3ffd0420 0x40094f89:0x3ffd0440 0x401c71c6:0x3ffd0460 0x40191987:0x3ffd0480 0x401c5d6a:0x3ffd04a0

ELF file SHA256: ba393eda5744a754

E (12014) esp_core_dump_flash: Core dump flash config is corrupted! CRC=0x7bd5c66f instead of 0x0
Rebooting...
schlimmchen commented 9 months ago

Given that your issue appears in multiple different ways and that not a lot of other people report the same problems, I want to suggest to check your hardware. Is the power supply good? Can you switch the ESP32 with a different one to confirm it's not an issue with the ESP32?

chris299 commented 9 months ago

@schlimmchen I understand that the issue is difficult to track down.

Currently the ESP32 is powered by the usb of the computer for logging, but I had the issue before with two other power supplies with very short cables. IMHO With an insufficient power supply the ESP32 would not give a debug output on memory issues, it would just reboot.

And I am pretty sure that the issue came with the versions that introduced the daily yield fix...

Unfortunately I can not test it with a different hardware for now.

Sn0w3y commented 9 months ago

IMHO With an insufficient power supply the ESP32 would not give a debug output on memory issues, it would just reboot.

This is not true.. If the Device gets Powered down because of a Bad Power Connection it also writes "Software Reset CPU"

chris299 commented 9 months ago

@Sn0w3y I'll try other power sources the next days (currently running on USB of my computer, which should be stable an powefull enough actually). But would a power down really give you a message on the serial port CORRUPT HEAP: Bad tail at 0x3ffde4c0. Expected 0xbaad5678 got 0x00000000 as I see it before rebooting?

chris299 commented 8 months ago

@Sn0w3y @schlimmchen I tried now 3 other power supplies, one of them from Apple πŸ™‚ and it keeps happening. However, I think there is a noticeable difference between versions regarding the frequency of crashes: 24.1.26 reboots more often than later versions

chris299 commented 8 months ago

I ran 24.2.16 for a couple of days now before it crashed...

mr-p666 commented 8 months ago

Unfortunately similar for me. Everything went well for 9 days, although individual packages were missing here and there. After that it got worse day by day and yesterday I had to restart. Since then everything has been back to normal. What I notice is the Time correction error messages that always accompany the problem. As soon as I find more than two of these in the daily log, I know it's time to restart the DTU.

mr-p666 commented 8 months ago

Since last update to 24.3.15 it seems to have gotton worse again. Just after one day I had to restart the DTU again. However, the problem always occurs along with the 'Time calibration'-messages. As a workaround, it might be possible to restart the DTU automatically as soon as this message appears.

Sn0w3y commented 8 months ago

Since last update to 24.3.15 it seems to have gotton worse again. Just after one day I had to restart the DTU again. However, the problem always occurs along with the 'Time calibration'-messages. As a workaround, it might be possible to restart the DTU automatically as soon as this message appears.

Ahhhh, now its getting closer. Do you use the API of the DTU to regulate the Inverters?

mr-p666 commented 8 months ago

No, I regulate it via MQTT. The only time I use the API is to invoke a reboot instead of going to the DTU and doing a power cycle. :wink: Just FYI: I also often use my desktop and mobile browsers to check the web interface. Maybe it could be a WebSocket disconnection issue?

Sn0w3y commented 8 months ago

Nope, Time calibration as often as in your case for me only appeared ONLY if i wrote 0% Power for example at night or during times where the Inverters have not been reachable. Maybe try to stop regulating them, set them to 100% and see if the Error still persists ;)

chris299 commented 8 months ago

@Sn0w3y I use IOBroker with the OpenDTU-Adapter. AFAIK it uses only the API, not MQTT. I also don't regulate the Inverters I have at all. I just watch...

mr-p666 commented 8 months ago

To avoid problems, I only send commands to the inverter when it tells me it's producing. Which, looking at my HA log files, works well. Edit: I could change the permanent limitation for a while and stop sending temporary limitations for testing purposes, but regarding of @chris299's post I assume that this wouldn't change the behavior.

Sn0w3y commented 7 months ago

Edit: I could change the permanent limitation for a while and stop sending temporary limitations for testing purposes, but regarding of @chris299's post I assume that this wouldn't change the behavior.

Try it and report :) Do NOT regulate anything, restart the openDTU and report :) In my case it fixed it ;D

mr-p666 commented 7 months ago

@Sn0w3y If it had worked, it would be a workaround in the best case but no fix. πŸ˜‰ But for me it didn't help. Changed the permanent limit, deactivated all automations and this morning: Screenshot_20240323-075606

Sn0w3y commented 7 months ago

@Sn0w3y If it had worked, it would be a workaround in the best case but no fix. πŸ˜‰ But for me it didn't help. Changed the permanent limit, deactivated all automations and this morning: Screenshot_20240323-075606

In Germany we say "a try was it worth" πŸ˜‚

mr-p666 commented 7 months ago

Or as an Austrian put it: "So that nobody can say afterwards... See, I told you it would work!" πŸ˜‰

MetaChuh commented 7 months ago

@mr-p666 fein aus gedrΓΌckt und besser als "oida geh ..." πŸ˜‰ ehrenmann und danke πŸ™