tbnobody / OpenDTU

Software for ESP32 to talk to Hoymiles/TSUN/Solenso Inverters
GNU General Public License v2.0
1.82k stars 509 forks source link

MQTT connection lost, laggy web interface, restart required #2172

Closed broth-itk closed 1 month ago

broth-itk commented 3 months ago

What happened?

Yesterday OpenDTU stopped to publish data to MQTT. The web interface was somewhat laggy and eventually I managed to reboot the unit. Afterwards all started to work as normal.

image

This is the second or third time it happened. The first two events required a power cycle to get all back to normal.

To Reproduce Bug

No indication of the issue being reproducible. Looks like memory leak or similar.

Expected Behavior

Well, the system work with no outage :)

There is already another case where the implementation of a watchdog is discussed: https://github.com/tbnobody/OpenDTU/issues/693

Although I think the best would be to solve the root cause, a Watchdog would help to recover from these situations.

At the same time, remote logging would help to collect valuable system information like memory usage to track leaks, see https://github.com/tbnobody/OpenDTU/issues/1819

Install Method

Pre-Compiled binary from GitHub

What git-hash/version of OpenDTU?

v24.6.29

Relevant log/trace output

No response

Anything else?

No response

Please confirm the following

broth-itk commented 3 months ago

Happened again:

image

image

I saw the unit connected to WiFi. Immediately when I initiated a "Disconnect" from my wireless infrastucture, it reconnected and was properly available.

There was no need to reboot or similar.

Have there been any changes to the WiFi code in the last release? I don't remember having had the issue before.

broth-itk commented 3 months ago

Lets see what happens with new 24.8.1 release, I'll let you know. Maybe it's just a bug in the backend libs somewhere

broth-itk commented 3 months ago

It just happened again:

image

IP connection is down, no connection to wireless infrastructure... Red LED did blink each 5 seconds, indicating that OpenDTU was still running somehow.

After resetting power, all back to normal. Strange.

broth-itk commented 3 months ago

Has this been corrected with the latest version (wifi reconnect issue)?

I wonder how I can get the unit back online without being on site... hm

stefan123t commented 3 months ago

I think this might still be related to some MQTT buffer overload / heap fragmentation. Without further USB Serial Logs about the time the problem occurs, ie sometime before and starting to loose connection this is hard to debug.

Though the comments in #2185 by @Kroki0815 here https://github.com/tbnobody/OpenDTU/issues/2185#issuecomment-2269008410 and by @jstammi here https://github.com/tbnobody/OpenDTU/issues/2185#issuecomment-2269617579 might shed some light on your issue too.

broth-itk commented 3 months ago

First I'm going to install that latest update to see if it helps. As I'm on vacation right now this will be in 2 weeks since I need to power cycle. Maybe a short power cut might help ;-)

USB serial debugging is the next step.

Thanks!

stuckinger commented 3 months ago

Have you tried another esp32? I have experienced similar effects on different projects, even with simple stuff using esphome . Effect was observed on some boards, on some not using the same firmware. Most boards get back again when soft rebooted remotely once they appear again after short outage and run stable for a while afterwards. Some don't and need to be powered off. I think the quality of the chips may vary too much...

trixing commented 2 months ago

fwiw, I experienced the same failure mode, no mqtt enabled though.

Kicking / Blocking it from Wifi allowed it to reconnect and got it unstuck (no reboot required).

v24.8.5 "uptime":965588

stefan123t commented 2 months ago

@broth-itk are you back from your holidays and have you had time already to upgrade to latest version and do some serial logging ?

Follow the link to the documentation to setup for USB / serial logging: https://www.opendtu.solar/firmware/howto/serial_console/

stefan123t commented 1 month ago

@broth-itk hi Bernhard there is a working PR for remote logging in #1819 / #2292 though you may need to somehow build and flash the image as it is not merged into the master yet. Maybe this helps to monitor your OpenDTU and analyse this issue ?

ranma commented 1 month ago

@broth-itk hi Bernhard there is a working PR for remote logging in #1819 / #2292 though you may need to somehow build and flash the image as it is not merged into the master yet. Maybe this helps to monitor your OpenDTU and analyse this issue ?

Additionally newer versions export heap statistics under the ${prefix}/dtu/heap/ topic in case this is a memory issue.

broth-itk commented 1 month ago

@stefan123t @ranma Thanks for the PR and the syslog enhancement! This is very appreciated and will help a lot to gather informations form the unit.

I compiled the code & webapp and from what I can tell it looks fine. Tomorrow I am going to see how it behaves when there are more logs generated from the unit.

broth-itk commented 1 month ago

I am going to close this issue since it did not happen anymore since some update. Maybe it was related to the recent Wifi issue? heap monitoring is very valuable as well. This allows to track down a potential memory leak.

github-actions[bot] commented 4 days ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new discussion or issue for related concerns.