rstrouse / ESPSomfy-RTS

A controller for Somfy RTS shades and blinds
The Unlicense
429 stars 32 forks source link

FW crash: Enable the watchdog to self-detect/recover a FW crash #349

Closed Gerd-tech closed 1 month ago

Gerd-tech commented 2 months ago

Hardware

ESP32

Firmware version

v2.4.1

Application version

v2.4.1

What happened? What did you expect to happen?

First: The system works great and I am really impressed about the work that you did and it's maturity.

The system works for a couple of months now, and I realize that every few weeks, the firmware crashes (web server/page of the device not reachable) and shades not controllable. After power cycle it works again. I am not software engineer and have some difficulties to look at the code, however by doing a search in the source about "Watchdog" I coud not find anything.

I think that it would be good to enable the ESP32 watchdog, so that in case the FW crashes (for any reason, even if not being a bug, like ESD discharge, temperature, or whatever else unexpected), the system will restart. It just makes the system more robust.

I could of course put a timer on the power socket and power cycle the device each night, but I think that it would be much more elegant to use the watchdog for this (it's the purpose of the watchdog to handle this kind of things).

Thank you

How to reproduce it (step by step)

I have 4 Somfy RTS shades paired. Just let it stay for a few weeks (without any action), then the device is not responding anymore on the web interface, and shades can not be controlled anymore in Home-Assistant (or HomeKit, when using HomeAssistant's HomeKit integration).

Note that the device sits in my Veranda, where higher temperature variations as indoors are happening, but not extreme (the min and max I have observed are 5degC and 35degC), this should not be a problem but I still want to mention it.

Logs

No response

rstrouse commented 2 months ago

Please install v2.4.2 firmware. This release contains stability releases.

magtimmermans commented 2 months ago

Hi, I have the same and more often with the V2.4.2 release. It is not responding which result that the blinds won't react. I must say that I am very impressed by this software and how it works. Great job! Hopefully, you can also improve the stability.

Gerd-tech commented 2 months ago

Hello,Improving the firmware robustness is certainly a good thing, however, regardless of how robust a firmware is, there could always be unexpected borderline exceptions. Especially for devices running 24/7.I therefore think it would be good to enable the watchdog (its purpose is to catch unpredictable crashes and reset + recover the system). I always ask the firmware engineers I work with to do this (I am electrical engineer at Apple). I don’t know the details of the ESP32, but nearly every micro has a watchdog function, I believe the ESP32 should also have it. The tricky thing might be to find a good place in the code to service regularly the watchdog (reset the watchdog timer) to show that the system is still alive and prevent the timeout that resets the system, but I am sure you can find the right placeAnother option, much more simple and could be implemented right away as bandaid, would be to add a new function, configurable by the user I. The UI, that restarts the system based on a schedule defined by the user (for instance every Thursday night at 3am). However, in that case, it would be good to remember the blind status/position from before the restartAny chance you can implement one of these recovery mechanisms? Or maybe you can come-up with an even smarter onThank you very muchBest regardsGerdOn 2 May 2024, at 11:30, magtimmermans @.***> wrote: Hi, I have the same and more often with the V2.4.2 release. It is not responding which result that the blinds won't react. I must say that I am very impressed by this software and how it works. Great job! Hopefully, you can also improve the stability.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

rstrouse commented 2 months ago

I will be adding wdt to either the next release or the release after. There are some challenges with it at the moment though as I need to get my head around the interrupts that control the radio receiver. For OTA and git requests I will have to disable it I believe.

EDIT: Are you sure you are running the official release? There were issues with v2.4.1 where UDP packet responses were leaking memory. See #273

geokscott commented 1 month ago

I too have had the device stop responding about 3 times since I set it up, about 3-4 weeks now. Reboot fixes. I don't have any absolute way, that I know of, to see what's causing it, but it just happened last night. I also had another device last night report that it lost it's internet connection, but it recovered. So I'm suspect that these random issues might be the device not recovering from a dropped network connection. I have not looked at the code yet, but I have done a bit of ESP32 programming and work with lost WiFi connection recovery code. Do you think it's possible the WiFi recovery might need a look see??? Also, is there some way of capturing logs after an event to help troubleshoot? Thanks for the great product, I love it!

rstrouse commented 1 month ago

In the v2.4.3 pre-release there is now a watchdog timer that will reboot the ESP if it encounters any long running processes.

@geokscott which version of the firmware are you running. There was a udp leak that was fixed with v2.4.2 that after a period of several days would stop responding when the SSDP traffic was high. Essentially the datagram response was not being destroyed.

ESPSomfy RTS now does roaming to ensure it has the best possible connection for mesh networks and the connection is checked on each loop cycle.

geokscott commented 1 month ago

Yeah @rstrouse , I forgot to say, I've been running 2.4.2 from the start.

I have tried with and without Roaming. I originally had it turned on, but after a couple loss of service, I turned it off. I thought maybe that fixed it, but then last night the problem cropped up again.

You didn't say, anyway of capturing logs? Can I run the device off a USB and see logging info? I would be willing to do that and do some testing.

PS: Sorry, I just noticed this thread is for version 2.4.1. I just assumed it was 2.4.2

rstrouse commented 1 month ago

If you connect it with a data serial connection you can watch the logs with esphome.io.

cvhoang commented 1 month ago

@rstrouse version 2.4.2 still crashes every few days. I'm running on ESP32 C3.

rstrouse commented 1 month ago

I did upgrade the core to 2.0.16 on the pre-release. Install v2.4.3 pre version since that core fixes serial interrupt problems. Also if you are using HA you can watch the memory usage. There are memory entities now on the device.

cvhoang commented 1 month ago

Thanks. I've just upgraded the firmware to 2.4.3 pre. Let's see how it goes.

rstrouse commented 1 month ago

@cvhoang, @geokscott, and @cvhoang let me know if the update to the pre-release v2.4.3 solve the issue with your boards.

geokscott commented 1 month ago

@cvhoang, @geokscott, and @cvhoang let me know if the update to the pre-release v2.4.3 solve the issue with your boards.

I also installed pre-2.4.3 and updated my HA component a couple days ago. So far so good. I will let you know if anything changes.

magtimmermans commented 1 month ago

Me too, 2 days ago and still running fine.

magtimmermans commented 1 month ago

This morning the system was not avalible for a few minutes (2.4.3 pre). I am testing with uptime kuma every 2 minutes.

rstrouse commented 1 month ago

If it went off for a few minutes and came back then that means it lost connection but had to go through a cycle to come back or it lost wifi connection long enough to trigger the AP mode for 3 minutes. Do you have HA and what was the network RSSI and free memory at the time.

I am still getting a handle on how long it needs before the wdt is triggered. Currently, I have it set to 5 seconds but that may be too short.

rstrouse commented 1 month ago

I have increased the wdt timeout to 7 seconds are shortened the internet check. v2.4.3 has been released so report please open a new issue for any new issues.

cvhoang commented 1 month ago

@rstrouse: I forgot to reply to this thread. V2.4.3 has been working well for me. Thank you for your work. Really fantastic stuff.

rstrouse commented 1 month ago

Thanks for letting me know. I think the stability issues are a thing f the past.

geokscott commented 1 month ago

I meant to respond also, tried to find this thread last week but didn't see it because it was a closed item. Mine has been stable also. I have only noticed one oddity since the new version, and only once did this happen. I have a wind sensor, an anemometer type. It has been working fine, but one day when there was little wind the web interface was reporting that it has been sensed (the yellow warning icon was showing) It seems it was stuck on from the day before. I did a reboot from the web interface and it cleared out. I have been keeping the web interface open in the browser tab and check it daily since updating the firmware. That's why I even noticed. BUT other than that, it's been rock solid. Good job!

SO, when are you going to make these a for sale product and submit your component to HA and make it official? If you need any help with PCB designs or hardware let me know. Also 3D printing enclosures. I tinker in all of that.

rstrouse commented 1 month ago

Thanks for confirming. I chased this issue for a while and even had several devices that never exhibited the issue.

The wind sensor does not reset until it gets a sensor frame indicating no wind. Since this is not a persisted state it cleared on reboot.

Gerd-tech commented 1 month ago

Hello, thank you very much for all the updates in the past weeks. Version 2.4.3 is working well so far. Great job!

In case you have some time, I would still recommend to add an option in the config page to auto-reboot the device every so often (weekly, monthly, or so, user configurable). Even with the most stable operating systems, it is a good habit to reboot from time to time (auto-reboot can typically be set with the "pmset repeat wakeorpoweron" command line in the recent MacOS versions, useful in systems running 24/7). Better to do this reboot on a scheduled non-critical time (like Monday nights at 3am) rather than run the system until it reboots by his own on an unknown day/time.

Beside that I want to re-iterate what I said the original post: The system works great and I am really impressed about the work that you did and it's maturity! Not only the firmware itself, but also the quality of the documentation!

geokscott commented 1 month ago

Thanks for confirming. I chased this issue for a while and even had several devices that never exhibited the issue.

The wind sensor does not reset until it gets a sensor frame indicating no wind. Since this is not a persisted state it cleared on reboot.

So what you're saying is the wind sensor sends a close command when it is tripped, and is supposed to send a no wind command when the wind dies down?

Well either my wind sensor is NOT sending a message when the wind stops OR the receiver is not picking it up. I cannot tell which for sure....

I have been monitoring the wind sensor warning on the UI over the past 3 days while I've been outside and it has triggered and closed my awning at least once every day. Each time the warning never goes away. I don't now how to verify this, but I suspect this wind sensor is NOT sending an all clean message out at all. It only sends a close command when it has a sustained wind speed of X amount.

If that IS the case, could your software be modified to automatically clear the sensor warning icon after a period of time elapses with no wind sensor triggers? Maybe other types of sensors send clear messages, but I'm pretty sure this one does not.

This is the sensor I'm using: https://www.somfysystems.com/en-us/products/9012499/eolis-rts-wind-sensor-24v-dc-kit-includes-sensor-and-transformer

EDIT 6/3/2024

Today is a breezy day. I slightly adjusted the location of my receiver to check for possible reception issues. I opened the awning and waited for the wind to close it. A while after it was closed, the wind warning icon cleared. So, obviously I had a reception issue before! I guess the wind sensor does not have much or very reliable range. It's only maybe 60 feet from my receiver.

Thanks - George