lumapu / ahoy

Various tools, examples, and documentation for communicating with Hoymiles microinverters
https://ahoydtu.de
Other
953 stars 224 forks source link

[ESP8266] WDT reset issue #392

Closed SvenLuebke closed 1 year ago

SvenLuebke commented 1 year ago

Platform

ESP8266

Model name

LoLin NodeMCU V3 (AliExpr.) 4MB

nRF24L01+ Module

nRF24L01+ plus

Antenna

external antenna

Power Stabilization

nothing

Connection diagram

Connection diagram I used:

nRF24L01+ Pin ESP8266/32 GPIO
Pin 1 GND [] GND
Pin 2 +3.3V +3.3V
Pin 3 CE GPIO4 CE
Pin 4 CSN GPIO15 CS
Pin 5 SCK GPIO14 SCLK
Pin 6 MOSI GPIO13 MOSI
Pin 7 MISO GPIO12 MISO
Pin 8 IRQ GPIO5 IRQ

Connection picture

Version

0.5.28

Github Hash

2e08ee0

Build & Flash Method

ESP Tools (flash)

Desktop

Linux

Setup

Device Host Name

- Device Name: AHOY-DTU

WiFi

- SSID: YOUR_WIFI_SSID *don't paste here*
- Password: YOUR_WIFI_PWD *don't paste here*

Inverter

Inverter 0

- Address: 116181853696
- Name: HM-1500
- Active Power Limit: 65535
- Active Power Limit Control Type: no powerlimit
- Max Module Power (Wp): 410
- Module Name: link, rech

General

- Interval [s]: 10
- Max retries per Payload: 5

NTP Server

- NTP Server / IP: pool.ntp.org (tried also fritz.box)
- NTP Port: 123

MQTT

- Broker / Server IP: 
- Port: 1883
- Username (optional): 
- Password (optional): 
- Topic: inverter

System Config

Pinout (Wemos)

- CS: D8 (GPIO15)
- CE: D2 (GPIO4)
- IRQ: D1 (GPIO5)

Radio (NRF24L01+)

- Amplifier Power Level: LOW

Serial Console

- print inverter data: [x]
- Serial Debug: [x]
- Interval [s]: 2

Debug Serial Log output

I: procPyld: cmd:  11
I: procPyld: txid: 0x95
I: Payload (62): 00 01 00 09 00 01 00 01 00 00 00 00 00 00 00 C8 00 00 00 03 00 00 00 00 01 71 00 02 00 04 00 09 00 10 00 00 00 59 00 00 2A 1D 00 03 02 E3 09 37 13 88 00 00 00 D9 00 00 00 00 00 C2 00 10 
I: resetPayload: id: 0
I: Requesting Inverter SN 116181853696
I: enqueuedCmd: 11
I: sendTimePacket
I: TX 27B Ch40 | 15 81 85 36 96 81 72 89 27 80 0B 00 63 66 53 6E 00 00 00 10 00 00 00 00 28 13 74 
I: RX 27B Ch3 | 95 81 85 36 96 81 85 36 96 01 00 01 00 09 00 01 00 01 00 00 00 00 00 00 00 C8 54 
I: RX 27B Ch3 | 95 81 85 36 96 81 85 36 96 02 00 00 00 03 00 00 00 00 01 71 00 02 00 04 00 09 EB 
I: RX 27B Ch3 | 95 81 85 36 96 81 85 36 96 84 13 8A 00 00 00 D5 00 00 00 00 00 C1 00 10 82 13 1D 
W: while retrieving data: Frame 3 missing: Request Retransmit
I: TX 11B Ch61 | 15 81 85 36 96 81 72 89 27 83 6F 
I: RX 27B Ch3 | 95 81 85 36 96 81 85 36 96 03 00 10 00 00 00 59 00 00 2A 1D 00 03 02 E3 09 20 23 
I: RX 27B Ch23 | 95 81 85 36 96 81 85 36 96 03 00 10 00 00 00 59 00 00 2A 1D 00 03 02 E3 09 20 23 

 ets Jan  8 2013,rst cause:4, boot mode:(3,7)

wdt reset
load 0x4010f000, len 3460, room 16 
tail 4
chksum 0xcc
load 0x3fff20b8, len 40, room 4 
tail 4
chksum 0xc9
csum 0xc9
v0006a2b0
~ld
I: resetPayload: id: 0
I: resetPayload: id: 0
I: resetPayload: id: 0
I: resetPayload: id: 0
I: connect to network 'MYWIFINETWORK' ...
...........
I: 

----------------------------------------
I: Welcome to AHOY!
I: 
point your browser to http://192.168.1.56
I: to configure your device
I: ----------------------------------------

I: RF24 Amp Pwr: RF24_PA_I: LOW
I: Radio Config:
SPI Frequency= 1 Mhz
Channel= 3 (~ 2403 MHz)
Model= nRF24L01+
RF Data Rate= 250 KBPS
RF Power Amplifier= PA_LOW
RF Low Noise Amplifier= Enabled
CRC Length= 16 bits
Address Length= 5 bytes
Static Payload Length= 32 bytes
Auto Retry Delay= 250 microseconds
Auto Retry Attempts= 0 maximum
Packets lost on
    current channel= 0
Retry attempts made for
    last transmission= 5
Multicast= Disabled
Custom ACK Payload= Disabled
Dynamic Payloads= Enabled
Auto Acknowledgment= Disabled
Primary Mode= RX
TX address= 0xdeadbeef01
pipe 0 (closed) bound= 0xdeadbeef01
pipe 1 ( open ) bound= 0x2789728101
pipe 2 (closed) bound= 0xc3
pipe 3 (closed) bound= 0xc4
pipe 4 (closed) bound= 0xc5
pipe 5 (closed) bound= 0xc6
I: [NTP]: 2022-11-05 12:13:45 UTC
I: enqueued cmd failed/timeout
I: Inverter #0 I: no Payload received! (retransmits: 0)
I: resetPayload: id: 0
I: Requesting Inverter SN 116181853696
I: enqueuedCmd: 11
I: enqueuedCmd: 1
I: enqueuedCmd: 5
I: sendTimePacket
I: TX 27B Ch40 | 15 81 85 36 96 81 72 89 27 80 0B 00 63 66 53 79 00 00 00 00 00 00 00 00 1B 39 6A

Error description

Approx. every hour my ESP8266 flashed with AhoyDTU does a reset. See logs! Might this be a hardware issue? Any other one experiencing this? Beside that, the software runs quite nicely and I am very satisfied!

Thanks!

lumapu commented 1 year ago

Do you have a stack trace of the wdt? On the attached log I can't see any reason. Have you tried to use a different port for the IRQ

As I can see in your settings both of the intervals a shorter than default. Can you increase them to verify if they causing the wdt?

SvenLuebke commented 1 year ago

Unfortunately I have no stack trace! I remember that I saw it in another project. Shouldn't this be a standard output? Probably wrong baud rate?

I tried some things to get rid of this issue. I just now changed the IRQ pin again...let's see. The change of the intervals was something I did to check whether a buffer overrun appeared. I reverted the values back to original.

I saw yesterday, that the system survives more than 9h in night time mode, without any SPI and reduced serial traffic, so hardware looks OK.

lumapu commented 1 year ago

If the log off your initial post contains the complete log then there ist no software issue. The baud rate was ok, all the logs are printed with the same baud. Hopefully the IRQ Pin change helps.

SvenLuebke commented 1 year ago

Of course it's not the complete log, that would be ~800KB of data, but a complete log of the time, that issue happened (before and after). Hope that is fine?! I remember, that I selected "Printable output" for session logging in Putty. I changed that to "All session output" now to see whether other output (with a different baud rate) is generated.

Regarding the baudrate I remember, that the ESP8266 changes the baudrate directly after reset to 74880. But I guess, the missing stack output is not a core function but some kind of user software function.

The issue just happened again, so change of IRQ pin didn't help.

Just for fun I will try to flash an own build, but I don't think this will change something...

lumapu commented 1 year ago

Are you using MQTT? As I can see in your settings MQTT seems to be not set. I faced an issue during testing yesterday an fixed it in the latest development build. Can you try to install that version to your ESP?

64fb587c from build Action wild be firmware version 0.5.32

stefan123t commented 1 year ago

@lumapu we had several (at least two or three on the Discord) reports of users with no MQTT and experiencing such WDT timer issues repeatedly in versions prior to 0.5.32. So if you fixed anything in this regards, I would say this is a strong case for @SvenLuebke and others to retry with the latest development build / release. Thanks!

lumapu commented 1 year ago

@stefan123t yes during development I saw an issue regarding MQTT. It happend directly at boot and endet in a boot loop. Maybe it helps others to get their system more stable starting with version 0.5.32

SvenLuebke commented 1 year ago

Unfortunately this proposed version didn't help. I also activated MQTT, which also didn't help. The WDT resets were still happening. After flashing the 0.5.32 I tried https://github.com/lumapu/ahoy/commits/4093be7 which seems to be stable now: Uptime: 4 Days, 12:16:05

lumapu commented 1 year ago

so it could be closed now? can you verify the release version?

SvenLuebke commented 1 year ago

Let's wait another day. I flashed 4c52e9c before...which seem to be stable, but I wasn't able to update the system via web update to dec333f. It just said "failed" and 0.5.40 started again (althoug no reboot happened according to the uptime).

I flashed it via USB serial and it seems to be working for now (Uptime: 0 Days, 04:47:02). The WDT was rebooting the system before only when NRF24L01 traffic was happening.

SvenLuebke commented 1 year ago

dec333f restarted yesterday at ~10 PM (when no traffic happened) and some seconds ago. 4c52e9c was more stable for some reason. But are there so many differences? I guess not, right?

roku133 commented 1 year ago

I cannot confirm stability issues using dec333f. My ESP8266 based DTU (however, CE and IRQ swapped) is stable since more than four days now. 👍 Perhaps it makes sense to change the power supply. Capacitor stabilizing 3.3 V power source is used?

stefan123t commented 1 year ago

@SvenLuebke do you have the option to change the Power Source and/or Micro USB cable. It has been reported that Power Supply is a major issue for WDTs on ESPs in general.

Here is a blog post from a Makerlab in Hannover about tracing the ESP power supply using an oscilloscope with revealing results: https://arduino-hannover.de/2018/07/25/die-tuecken-der-esp32-stromversorgung/

roku133 commented 1 year ago

@SvenLuebke A power bank providing a USB 5 V output may also be helpful to check power adapter issues.

stefan123t commented 1 year ago

@SvenLuebke can you update on stability with latest development or release version ?

SvenLuebke commented 1 year ago

Hi!

@stefan123t I exchanged

and soldered a 2200µF capacitor to the 3.3V power pins. The software still reboots as soon as SPI traffic is happening. Really strange! This is happening with all the versions I tested up to 0.5.76 .

stefan123t commented 1 year ago

Why do you use a 2200uF capacitor. We encourage the use of a 10uF to 100uF cap for smoothing the voltage ripples and sustaining the 3.3V at the NRF module. Yours is more than 22 times as large this may be the reason too ?

SvenLuebke commented 1 year ago

To be honest this capacitor was available in my box. Do you think a 2200µF cap will smooth the voltage worse than a 100µF one? It might be a little bit slower. I thought it's for stabilizing the 3.3 V power of the ESP8266. I'll try to find a 10µF one...and will also attach a ceramic cap.

SvenLuebke commented 1 year ago

I just saw a new reboot_reason (copied the rest for some system information):

sdk
2.2.2-dev(38a443e)
cpu_freq
80
heap_free
16720
sketch_used
486
version
0.5.66
wifi_rssi
-53
ts_uptime
31
esp_type
ESP8266
core_version
3.0.2
flash_size
4096
heap_frag
14
max_free_blk
7080
reboot_reason
Software/System restart
Radio
nrf24l01+
is connected
Datarate
250 kbps
Power Level
MIN

I didn't trigger the "Software/System restart". What is the reason for that? I tried to open the "live" website and then it restarted.

Argafal commented 1 year ago

Software/system restart could be an indication of a NullPointerException or OOM. Both I had also seen with 0.5.66 and i documented them in other bug reports. I believe most issues I had documented are fixed in 0.5.92, have you tried it already?

Having said all that, without a stack trace I think it's just guessing. What's your serial output look like when the reboot happens?

SvenLuebke commented 1 year ago

Hey @Argafal Thanks for your message! I tried different versions after 0.5.66...and they behaved even more strange: After some uptime nearly all pages couldn't be displayed anymore. The menu bar on the left contained only one entry (don't remember which one) and the rest vanished. Page refresh often took more than 10s. Tried some things and then I flashed back to 0.5.66 which restarts ~3 times a day but doesn't show this page vanishing.

I just installed 0.5.92...looks much more better, but I have to wait for the sun.

BTW: I noticed that WiFi between ESP8266 and my router is not stable (also have this with my laptop). There are more than 20 reconnects a day. Could that lead to my reported reset behaviour?

Argafal commented 1 year ago

The first thing you describe about the webUI sounds like issue #660. Is that what it looks like? This should be much better again in 0.5.92/93.

I would hope that an unstable wifi connection would not cause random reboots of ahoy. I don't think it does. But without a stack trace it is pure guess work. So I think you need to find a way to record a stack trace if you want to look into this further. For that I would connect the esp via USB to a computer, that might be the easiest way.

SvenLuebke commented 1 year ago

Yes, that was exactly my issue! I didn't want to create another issue for this, because I thought, I'm the only one having this issue. Nice, thank you!

I guess these reconnect messages were just a consequence of hourly resets...that's what I think now. Because with 0.5.92 the disconnects are vanished. Yes...it really looks promising: Uptime: 0 Days, 14:37:37

SvenLuebke commented 1 year ago

Yippee! Uptime: 1 Day, 16:31:07...never saw "1 Day" before...if it reaches 4 I guess we can close the issue.

lumapu commented 1 year ago

does it reached 4?

SvenLuebke commented 1 year ago

Yes, it reached 4 days and ~18 hours, then it resetted again, but that's long enough for me. After 3 days i got a similar behaviour to this https://github.com/lumapu/ahoy/issues/660 again. I had to press refresh two or three times and then it worked again.

Shall I close the ticket?

lumapu commented 1 year ago

cool, seems that we fixed something. I will close this issue with the next release.

SvenLuebke commented 1 year ago

But I'm still thinking about why I was - more or less - the only one with this issue: Are some versions of the ESP8266 less stable? Are some RAM cells (in some memory area...for example at the end) not stable or dead? Are some PCBs less stable? Currently I don't have an explanation for that.

Argafal commented 1 year ago

I don't think you are the only one. I have opened a few issues reporting reboots and/or exceptions running ahoy on ESP8266. As to why that doesn't happen to everyone on an ESP8266, I don't know.

My current status: With the current dev 0.5.98, ahoy runs stable for me as long as I don't use the WebUI. If I use the WebUI it occasionally reboots.

lumapu commented 1 year ago

do you have a capacitor placed to your circuit? I had a very unstable ESP8266 which became stable at the moment where I placed a capacitor next to its 3.3V pin

SvenLuebke commented 1 year ago

My current status: With the current dev 0.5.98, ahoy runs stable for me as long as I don't use the WebUI. If I use the WebUI it occasionally reboots.

Same here...it didn't survive a day.

do you have a capacitor placed to your circuit?

I have two of them connected to 3.3V, one small and one big one. But that didn't change anything regarding the reset behaviour.

SvenLuebke commented 1 year ago

I tried 0.5.104 yesterday and it didn't survive 20 minutes. So, for now 0.5.96 is the latest most stable version for me.

lumapu commented 1 year ago

@SvenLuebke kannst du mir bitte kurz dein Setup auflisten (Anzahl Inverter, Esp-Typ, Kondensator, webIf genutzt oder nicht, Heap-Fragmentation) Gibt es Anzeichen warum der ESP die Krätsche macht?

Argafal commented 1 year ago

@SvenLuebke Und kannst du bitte auch erwähnen, ob du MQTT benutzt oder nicht, in welchem Interval die Wechselrichter abgefragt werden (siehe Einstellungen) und in welchem Interval MQTT verschickt wird (siehe Einstellungen)? Danke.