rvdbreemen / OTGW-firmware

A ESP8266 devkit firmware for the Nodoshop version of the Opentherm Gateway (OTGW)
MIT License
145 stars 34 forks source link

NodeMCU resets a couple hundred times (for about 15-30 minutes) before starting to work #67

Closed mihsu81 closed 2 years ago

mihsu81 commented 2 years ago

Hi @rvdbreemen,

I recently purchased a OTGW (v2.0) and NodeMCU from Nodo Shop. With firmware v0.4 from https://otgw.tclcode.com/download/otgwmcu.zip works fine. When i use your firmware v0.8.5+dc3dc14, the NodeMCU resets every 10 seconds for about 15-30 minutes and eventually starts working and doesn't reset anymore. The PIC is on v5.1. I've tried with the boiler and thermostat connected and disconnected, with the same result. Without the NodeMCU the Gateway works fine and the boiler and thermostat can communicate. When it resets, all 4 LEDs light up on the Gateway board (red/orange/green/green) for about half second. When the NodeMCU resets, LED1 lights up for 1-3 seconds until WiFi gets connected, while LED2 stays lit until the NodeMCU resets again. If i connect it via USB to my laptop it's working fine (disconnected from the gateway :D).

The last reset reason in the Web server is listed as: "External System". It's like a watchdog resets it because of a software fault it detects.

I wasn't able to find any logs or a way to enable them. I connected to it on port 23 but i don't get any output for the few seconds it is connected to WiFi.

The issue seems to be similar to https://github.com/rvdbreemen/OTGW-firmware/issues/56.

Thanks in advance for your help.

rvdbreemen commented 2 years ago

Hi @mihsu81 Thank you for reporting your issue and the details shared help a lot to figure out what might be going on. First off it sounds like it is being reset waiting for something. The "External System" is actually the "watchdog reset" and yes, after about 5 seconds of waiting for anything, the MCU gets reset by the watchdog gets reset.

So two things happen between the first blue led turning off (wifi connect) and the second (finishing of the setup) that can cause a problem that is happening here. At least that I know of so far:

  1. NTP time sync does not complete
  2. MQTT connection is not connecting

If you have an external facing firewall that filters out all traffic, gthan the NTP sync could be the actuall cause of this issue. As the DNS is not resolving to the right address. And it will get into a reset loop. If this is the case, then please add a rule allowing the OTGW to goto "time.google.com".

Or... the other thing could be, it's never connecting to the MQTT broker in time. It should just timeout in three seconds, but I have found this can be an issue in some situation. Not clear what the conditions actually are.

So, did you configure the MQTT hub? And have you tried connecting from you PC with a tool like MQTT explorer? Just to make sure you can connect with the userid/password (or no password)?

Hope to hear from you to figure this out. Thanks Robert

mihsu81 commented 2 years ago

Hi Robert,

I can confirm that "time.google.com" is reachable by every device on the network. MQTT is configured on the NodeMCU and can connect just fine. I'm using the Mosquitto MQTT Broker and i can connect to it with MQTT Explorer (with username/password). After the NodeMCU connects to WiFi (within 1-3 seconds) until it resets again, i don't see any MQTT messages posted by it to the Mosquitto MQTT Broker. So looks like you were right and this is what's causing the reset. Can we increase the time it waits before a reset, just as a test? Would that cause other issues?

Best regards Mihai

mihsu81 commented 2 years ago

Hi Robert,

I did some more troubleshooting regarding NTP and realized the issue was caused by my ASUS router. I had previously enabled redirection of NTP traffic to ntpMerlin and looks like the NodeMCU doesn't like that or there's an issue with ntpMerlin or the reply didn't come back in time. After disabling the NTP redirection, the NodeMCU wouldn't restart anymore, unless it couldn't connect to WiFi.

Thanks a lot for your help in getting this issue solved and for the awesome job you did with this project. 😊

rvdbreemen commented 2 years ago

@mihsu81 thanks for that feedback. So it was the NTP timeout that was cause your issue in the end. If you have any other issues, just come back and I am happy to try and help fix them.