rvdbreemen / OTGW-firmware

A ESP8266 devkit firmware for the Nodoshop version of the Opentherm Gateway (OTGW)
MIT License
145 stars 34 forks source link

Bootloop #196

Open LacsapOV opened 1 year ago

LacsapOV commented 1 year ago

Wemos D1 reboots about every 10 seconds

Reboot log 2023-02-02 19:48:17 - reboot cause: Exception (2) - Access to invalid address (28) ESP register contents: epc1=0x40241688, epc2=0x00000000, epc3=0x00000000, excvaddr=0x144c0000, depc=0x00000000 2023-02-02 19:48:07 - reboot cause: Exception (2) - Access to invalid address (28) ESP register contents: epc1=0x40241688, epc2=0x00000000, epc3=0x00000000, excvaddr=0x144c0000, depc=0x00000000 2023-02-02 19:47:57 - reboot cause: Exception (2) - Access to invalid address (28) ESP register contents: epc1=0x40241688, epc2=0x00000000, epc3=0x00000000, excvaddr=0x144c0000, depc=0x00000000

Firmware Version 0.10.0+eeeb22c

PIC Firmware Version 6.4

Settings { "hostname": "OTGW", "MQTTenable": true, "MQTTbroker": "192.168.2.13", "MQTTbrokerPort": 1883, "MQTTuser": "", "MQTTpasswd": "", "MQTTtoptopic": "otgw", "MQTThaprefix": "homeassistant", "MQTTuniqueid": "otgw", "MQTTOTmessage": true, "MQTTharebootdetection": true, "NTPenable": true, "NTPtimezone": "Europe/Amsterdam", "NTPhostname": "pool.ntp.org", "LEDblink": true, "GPIOSENSORSenabled": true, "GPIOSENSORSpin": 13, "GPIOSENSORSinterval": 20, "S0COUNTERenabled": false, "S0COUNTERpin": 12, "S0COUNTERdebouncetime": 80, "S0COUNTERpulsekw": 1000, "S0COUNTERinterval": 60, "OTGWcommandenable": false, "OTGWcommands": "GW=1", "GPIOOUTPUTSenabled": false, "GPIOOUTPUTSpin": 16, "GPIOOUTPUTStriggerBit": 0 }

DaveDavenport commented 1 year ago

It looks like the flashing went wrong. Can you remove the wemos from the board and reflash it using a usb cable directly from a pc ?

LacsapOV commented 1 year ago

It looks like the flashing went wrong. Can you remove the wemos from the board and reflash it using a usb cable directly from a pc ?

Thank you Dave. That's how i did it first time around. I tried again but no luck, issue remains. It does not reboot when it's connected to the PC, only when connected to the board.

DaveDavenport commented 1 year ago

that is odd. What node-shop board version do you have?

(I don't expect to see this error, when it is something with the board/power supply)

Did you try to do a full flash erase before flashing? I normally do this when having odd issues with an esp8266.

LacsapOV commented 1 year ago

that is odd. What node-shop board version do you have?

(I don't expect to see this error, when it is something with the board/power supply)

Did you try to do a full flash erase before flashing? I normally do this when having odd issues with an esp8266.

The latest, i got it on Wednesday, soldered and ready to go.

I tried your suggestion with a full flash erase. Even dropped the baud rate to a lower rate, same issue. And finally i took another Wemos D1 mini (Adafruit) i had laying around, still same issue. :-)

DaveDavenport commented 1 year ago

weird.. I recently did it fine.

JvHummel commented 1 year ago

Hi,

Ended up here after having the same issue. Updated from 0.9.5 to 0.10.

2023-02-02 23:49:12 - reboot cause: Exception (2) - Access to invalid address (28) ESP register contents: epc1=0x40241688, epc2=0x00000000, epc3=0x00000000, excvaddr=0x144c0000, depc=0x00000000 2106-02-07 07:28:19 - reboot cause: Exception (2) - Access to invalid address (28) ESP register contents: epc1=0x40241688, epc2=0x00000000, epc3=0x00000000, excvaddr=0x144c0000, depc=0x00000000 2023-02-02 23:48:53 - reboot cause: Exception (2) - Access to invalid address (28) ESP register contents: epc1=0x40241688, epc2=0x00000000, epc3=0x00000000, excvaddr=0x144c0000, depc=0x00000000

etc....

Took the OTGW out from the OpenTherm bus and the system seems stable now.

DaveDavenport commented 1 year ago

https://www.espressif.com/sites/default/files/documentation/esp8266_reset_causes_and_common_fatal_exception_causes_en.pdf

so, it sounds like we are following a invalid pointer. I wonder why this happens, but not for everybody.

JvHummel commented 1 year ago

Since the resets stop happening for me when I disconnect the OpenTherm connection, I'd wager it must be something device-specific. i.e. thermostat/boiler send data the firmware doesn't like.

I'm headed to bed for today but if you're interested, I'll try connecting boiler/thermostat separately tomorrow and see if either of them triggers the issue.

DaveDavenport commented 1 year ago

That would be useful, also please report your setup (boiler/thermostat).

Roos-AID commented 1 year ago

Can you change setting GPIOSENSORSenabled": true, Into false? what type sensors are connected to gpio 13?

JvHummel commented 1 year ago

Hi Dave,

Yes, of course. The following is my setup:

Just did some tests.

I reset the power on the boiler and OTGW every time I performed a test, so message intervals from any module shouldn't affect the tests, if my reasoning is correct.

I should also provide my settings; they are as follows.

{ "hostname": "OTGW", "MQTTenable": false, "MQTTbroker": "192.168.178.120", "MQTTbrokerPort": 1883, "MQTTuser": "xxx", "MQTTpasswd": "xxx", "MQTTtoptopic": "OTGW", "MQTThaprefix": "homeassistant", "MQTTuniqueid": "otgw-XXXXXXXXXXXX", "MQTTOTmessage": false, "MQTTharebootdetection": true, "NTPenable": true, "NTPtimezone": "Europe/Amsterdam", "NTPhostname": "pool.ntp.org", "LEDblink": true, "GPIOSENSORSenabled": false, "GPIOSENSORSpin": 13, "GPIOSENSORSinterval": 20, "S0COUNTERenabled": false, "S0COUNTERpin": 12, "S0COUNTERdebouncetime": 80, "S0COUNTERpulsekw": 1000, "S0COUNTERinterval": 60, "OTGWcommandenable": false, "OTGWcommands": "GW=1", "GPIOOUTPUTSenabled": false, "GPIOOUTPUTSpin": 16, "GPIOOUTPUTStriggerBit": 0 }

The system was stable with 0.9.5 but since I run PIC FW 6.4, I figured I'd update to 0.10 since the changelog mentions improved compatibility.

If there's anything else I can provide or try out, please let me know.

Roos-AID commented 1 year ago

Thanks, this one has no GPIO attached, great, so we can forget about that, a have seen a problem with Onewire detecting a strange device causing this. But with GPIOSENSORSenabled": false this code is not executed.

I have tested with the Honeywell ChronoTherm Touch Modulation as well, but different boiler. No problem there.

I think we need at least a telnet trace or better a trace of the opentherm with OTMonitor.

Suggestion, can we give a version compiled with 2.7.4 a try ?

hvxl commented 1 year ago

As this seems to be caused by some specific data the ESP receives from the PIC, it may be interesting to see what the PIC is sending. With at least hardware v2.3 and later, it is possible to power the board from a USB port of the PC and receive the serial data there, in addition to having a Wemos installed on the OTGW. Running a terminal emulator (or even OTmonitor) on the USB port may provide some valuable insights.

JvHummel commented 1 year ago

@hvxl Can you confirm that this will work? Based on the manual for HW rev. 2.3, it seems not to be meant for this usecase: "Do not connect a Micro USB cable to the WeMos D1 Mini while it is connected to the gateway!", so I am a bit wary to destroy my new toy 😉

@Roos-AID Does 2.7.4 refer to a version of library or core, or something like that? Either way, I'd be happy to try.

DaveDavenport commented 1 year ago

I think @hvxl is talking about the USB board on the main board, not the wemos.

Roos-AID commented 1 year ago

You can do a debug log display with Telnet ipadres. Alternative use OTMonitor and connect to port 25238

If you do Telnet , open the telnet before you connect power, otherwise you might miss the first messages

hvxl commented 1 year ago

Sorry, I should have been clearer. Yes, what I meant was to power the OTGW board from a USB port on the PC.

@Roos-AID It's not possible to connect telnet before connecting the power. With the ESP booting every 10 seconds, there is hardly any chance to connect via TCP at all. That's why I suggested to monitor via USB.

JvHummel commented 1 year ago

Hi all,

I did a USB/TTY readout as @hvxl suggested. I've attached a log file. Hopefully it can shed some light on the situation.

putty.log

LacsapOV commented 1 year ago

In my case it's connected to a Honeywell Chronotherm Touch Modulation and Atlantic Loria heatpump. GPIO is also off after i flashed it again.

From OTGW documentation error 03 suggests a voltage issue. That could have been an explanation for my issue since my board is new. But not for JvHummel.

I'll do my best to dump a log.

JvHummel commented 1 year ago

@LacsapOV My board is also new, soldered it just 2 nights ago :) But shipped with v0.9.5. My theory is that nodo-shop batch pre-programs them ahead of time.

Anyhow, you are right that it doesn't explain why 0.9.5 was stable for me.

hvxl commented 1 year ago

When I replay that I also get exception 28:

--------------- CUT HERE FOR EXCEPTION DECODER ---------------

Exception (28):
epc1=0x40241688 epc2=0x00000000 epc3=0x00000000 excvaddr=0x144c0000 depc=0x00000000

>>>stack>>>

ctx: cont
sp: 3ffffc50 end: 3fffffc0 offset: 0190
3ffffde0:  4028616c 0000001a 3ffffe84 3fff3690  
3ffffdf0:  00000001 144c0000 3ffffe84 402371fb  
3ffffe00:  00000001 3fff5678 3fff3670 4023722d  
3ffffe10:  3fff36cc 3fff5678 3fff3670 4020cebb  
3ffffe20:  34303742 30353036 00090030 70460500  
3ffffe30:  05460701 00000000 00000f66 00000000  
3ffffe40:  00000046 00000000 00000006 3ffe9477  
3ffffe50:  3ffe9480 3ffe9fcd 00000000 3fff3a88  
3ffffe60:  00000000 4bc6a7f0 472b020c 3fff5768  
3ffffe70:  00000000 00000000 2d000000 2d2d2d2d  
3ffffe80:  002d2d2d 00000000 001a001f 00000000  
3ffffe90:  00000000 2d2d2d2d 00000000 b65d1ae8  
3ffffea0:  0000002a ff000000 3fff4c08 00000200  
3ffffeb0:  3fff5678 3fff3690 3fff3670 4021ab1e  
3ffffec0:  34303742 30353036 00090030 70460500  
3ffffed0:  05460701 00000000 00000f66 00000000  
3ffffee0:  00000046 00000000 00000006 3ffe9477  
3ffffef0:  3ffe9480 3ffe9fcd 3fffff2c b65d1ae8  
3fffff00:  0000002a 3fff3630 3ffffee0 00000009  
3fffff10:  40000200 00000000 67617373 00000000  
3fffff20:  00000000 30323030 00000000 b65d1ae8  
3fffff30:  3fff57d4 3fff341c 00000010 3fff3414  
3fffff40:  3fff57d4 3fff341c 3fff4c08 4021b108  
3fffff50:  3fff6844 40221738 030207e7 3fff3a88  
3fffff60:  3fff3b40 3fff3b70 3fff3ba0 4021d29f  
3fffff70:  3fff3b40 3fff3b70 3fff3ba0 4021dcde  
3fffff80:  00000000 00000000 00000001 401004a8  
3fffff90:  3fffdad0 00000000 3fff5d84 3fff5d98  
3fffffa0:  3fffdad0 00000000 3fff5d84 40238460  
3fffffb0:  feefeffe feefeffe 3ffe86a8 401013b1  
<<<stack<<<

--------------- CUT HERE FOR EXCEPTION DECODER ---------------
DaveDavenport commented 1 year ago

aah nice, we should be able to translate that back into a readable backtrace (if we have original elf)

(I think there is a plugin for this: https://github.com/me-no-dev/EspExceptionDecoder)

hvxl commented 1 year ago

Strangely I can't reproduce the issue on v0.10.0rc5 or when I compile v0.10.0 myself.

rvdbreemen commented 1 year ago

With what core are you compiling 3.0.2 or 2.7.4?

DaveDavenport commented 1 year ago

I cannot reproduce with 3.0.2 or 2.7.4.

hvxl commented 1 year ago

I tried both. However, comparing the reported debug information, I seem to end up with a different binary than you:

Firmware Version    0.10.0+36108cf
Free Heap Mem (bytes)   13488
Max. Free Mem (bytes)   12464
Arduino Core Version    3.0.2
Espressif SDK Version   2.2.2-dev(38a443e)
CPU speed (MHz)     160
Sketch Size (bytes) 601072
Sketch Free (bytes) 1495040
Flash ID        001620C2
Flash Chip Size (MB)    4
Real Flash Chip (MB)    4
LittleFSsize        1
Flash Chip Speed (MHz)  40
Flash Mode      DIO
Board Type      WEMOS_D1MINI

Firmware version is probably different because I didn't use autoinc-semver. Heap usage changes dynamically. But I expected the sketch size to be the same. There's probably a difference in the libraries we use. I have the impression your "How to compile the OTGW firmware" wiki page is not current.

Did you manage to run the stack trace through the exception decoder?

DaveDavenport commented 1 year ago

We need the original elf I think to decode the stacktrace.

DaveDavenport commented 1 year ago

I can reproduce the crash with release btw:

-----------DER ---------------

Exception (28):
epc1=0x40241688 epc2=0x00000000 epc3=0x00000000 excvaddr=0x144c0000 depc=0x00000000

>>>stack>>>

ctx: cont
sp: 3ffffc50 end: 3fffffc0 offset: 0190
3ffffde0:  4028616c 0000001a 3ffffe84 3fff3690
3ffffdf0:  00000001 144c0000 3ffffe84 402371fb
3ffffe00:  00000001 3fff5678 3fff3670 4023722d
3ffffe10:  3fff36cc 3fff5678 3fff3670 4020cebb
3ffffe20:  34303742 30353036 00090030 70460500
3ffffe30:  05460701 00000000 000159eb 00000000
3ffffe40:  00000046 00000000 00000006 3ffe9477
3ffffe50:  3ffe9480 3ffe9fcd 00000000 3fff59b8
3ffffe60:  00000000 4bc6a7f0 0c49ba5e 3fff5768
3ffffe70:  00000000 00000000 2d000000 2d2d2d2d
3ffffe80:  002d2d2d 00000000 001a001f 00000000
3ffffe90:  00000000 2d2d2d2d 00000000 4637f0eb
3ffffea0:  0000002a ff000000 3fff3670 00000200
3ffffeb0:  3fff5678 3fff3690 3fff3670 4021ab1e
3ffffec0:  34303742 30353036 00090030 70460500
3ffffed0:  05460701 00000000 000159eb 00000000
3ffffee0:  00000046 00000000 00000006 3ffe9477
3ffffef0:  3ffe9480 3ffe9fcd 3fffff2c 3fff58ac
3fffff00:  0000002a 00000000 3ffffee0 00000009
3fffff10:  80000200 00000001 00000010 4010158c
3fffff20:  00000000 3fff341c 00000000 4637f0eb
3fffff30:  3fff57d4 3fff341c 00000010 3fff3414
3fffff40:  3fff57d4 3fff341c 3fff4c08 4021b108
3fffff50:  3fff6844 40221738 040207e7 3fff3a88
3fffff60:  3fff3b40 3fff3b70 3fff3ba0 4021d29f
3fffff70:  3fff3b40 3fff3b70 3fff3ba0 4021dcde
3fffff80:  00000000 00000000 00000001 401004a8
3fffff90:  3fffdad0 00000000 3fff5d84 3fff5d98
3fffffa0:  3fffdad0 00000000 3fff5d84 40238460
3fffffb0:  feefeffe feefeffe 3ffe86a8 401013b1
<<<stack<<<

--------------- CUT HERE FOR EXCEPTION DECODER ---------------
DaveDavenport commented 1 year ago

This backtrace is not correct as far as I can tell, so I really need the original elf:

x106-elf-gcc/3.0.4-gcc10.3-1757bed/bin/xtensa-lx106-elf-addr2line  build/esp8266.esp8266.d1_mini/OTGW-firmware.ino.elf dump.txt
Exception Cause: 28  [LoadProhibited: A load referenced a page mapped with an attribute that does not permit loads]

0x40241688: _ungetc_r at /workdir/repo/newlib/newlib/libc/stdio/ungetc.c:202
0x4028616c: etharp_output at ??:?
0x402371fb: _ZN12experimentalL11_SPICommandEjjjjjPjjj$constprop$0 at /home/qball/Programming/Other/OTGW-firmware/arduino/packages/esp8266/hardware/esp8266/3.0.2/cores/esp8266/core_esp8266_spi_utils.cpp:89
0x4023722d: _ZN12experimentalL11_SPICommandEjjjjjPjjj$constprop$0 at /home/qball/Programming/Other/OTGW-firmware/arduino/packages/esp8266/hardware/esp8266/3.0.2/cores/esp8266/core_esp8266_spi_utils.cpp:102
0x4020cebb: startTelnet() at /home/qball/Programming/Other/OTGW-firmware/networkStuff.h:167
0x4021ab1e: updateSetting(char const*, char const*) at /home/qball/Programming/Other/OTGW-firmware/settingStuff.ino:268
0x4010158c: pm_rtc_clock_cali_trig at ??:?
0x4021b108: handleMQTTcallback(char*, unsigned char*, unsigned int) at /home/qball/Programming/Other/OTGW-firmware/MQTTstuff.ino:142
0x40221738: DallasTemperature::calculateTemperature(unsigned char const*, unsigned char*) at /home/qball/Programming/Other/OTGW-firmware/libraries/DallasTemperature/DallasTemperature.cpp:638
0x4021d29f: OTGWSerial::processorToString() at /home/qball/Programming/Other/OTGW-firmware/src/libraries/OTGWSerial/OTGWSerial.cpp:945
0x4021dcde: OTGWUpgrade::stateMachine(unsigned char const*, int) at /home/qball/Programming/Other/OTGW-firmware/src/libraries/OTGWSerial/OTGWSerial.cpp:697
0x401004a8: esp_schedule at ??:?
0x40238460: ClientContext::state() const at /home/qball/Programming/Other/OTGW-firmware/arduino/packages/esp8266/hardware/esp8266/3.0.2/libraries/ESP8266WiFi/src/include/ClientContext.h:370
0x401013b1: timer1_isr_handler at /home/qball/Programming/Other/OTGW-firmware/arduino/packages/esp8266/hardware/esp8266/3.0.2/cores/esp8266/core_esp8266_timer.cpp:43
LacsapOV commented 1 year ago

I've got mine up and running and stable. I compiled it myself using the steps in the documentation. Seems the binary in the installation documentation is faulty.

I did have an issue with the Acetime version. It states version 1.9.0 but it's missing a function unixSeconds64, i updated tot the latest 1.x branch.

rvdbreemen commented 1 year ago

Glad to hear that a new build does work. It’s the same conclusion we are reaching on the firmware chat on discord.

What was the issue you ran into, so I can correct it. Also AceTime needs the latest and Brian is very actively improving his lib too.

JvHummel commented 1 year ago

Can confirm that doing a build myself and flashing that, remedies the bootloops.

rvdbreemen commented 1 year ago

@JvHummel thanks for confirming that. I just build a new release 0.10.1... would you be so kind to test this, it's in the beta channel on discord.

JvHummel commented 1 year ago

Good evening Robert, was just messing with a DS18B20 so had my OTGW out anyway. Good timing. Flashed 0.10.1-beta+7b22d7d and connected it to boiler/thermostat. No bootloops!