mongoose-os-apps / shelly-homekit

Apple HomeKit firmware for Shelly's
Other
1.83k stars 130 forks source link

Shelly 2.5 reboots every x-minutes #252

Closed patricks closed 3 years ago

patricks commented 4 years ago

Hi,

I have currently 3 Shelly's (2.5) installed for a few months now. One of them reboots every x-minutes. I figured this out because the light turns off for a few seconds and the uptime in the web ui is also reseted. Is there a way to figure out whats going wrong? I tried to enable debug logging, but it looks like after a reboot the log gets cleared?

I am always running the latest firmware (currently 2.4.0), but this problem also appeared with older ones.

rojer commented 4 years ago

go to http://DEVICE/debug/core in your browser. if the firmware crashed, you'll get a core dump. please send it to rojer@rojer.me, i'll take a look.

Pixel-Chris commented 4 years ago

@patricks Is that an Shelly 2.5 in garage door mode?

patricks commented 4 years ago

go to http://DEVICE/debug/core in your browser. if the firmware crashed, you'll get a core dump. please send it to rojer@rojer.me, i'll take a look.

It rebooted a few times today (see it via the uptime in the web ui) but there is no core dump.

patricks commented 4 years ago

@patricks Is that an Shelly 2.5 in garage door mode?

No it is in the switch mode

rojer commented 4 years ago

reboot with no crash dump likely points to hardware issue, voltage instability (sags, spikes) or some such.

andyblac commented 4 years ago

possibly over heat too ?

rojer commented 4 years ago

overheating just disables HAP service, it doesn't reboot the device

andyblac commented 4 years ago

overheating just disables HAP service, it doesn't reboot the device

Ok good to know thanks

rudyemm commented 4 years ago

Hi guys! I have 4 Shelly 2.5s setup as light switches (each only as a single switch), and I have the exact same problem. I thought I had defective units, which seemed unlikely since multiple units face the same issue. It’s reassuring to know others are seeing the same problem, so hopefully we can solve it.

Some more details about my specific issue. Sometimes, the Shelly will run “stable” for many days. Sometimes, they will restart every few hours. A couple times, the Shelly seemed like it was on a restart loop, every few seconds. It was bizarre. I rolled back the stock firmware and set those affected units up on a timer (sunset till 11pm), and the lights seemed to stay on consistently. When the 2.4.0 firmware came out, I thought I would try HomeKit again but the issue persists.

Separately, I’ve also noticed that an automation in HomeKit to turn off all my lights in the evening had failed on one of the Shellys. When I try to access that Shelly by (static) IP, the web server is completely unresponsive until I power down and back up the Shelly.

Also, the WiFi signal is not strong to the Shellys that exhibit these issues, but it’s not terrible either (-63 to -72). I’m not sure if that is playing any part in the restart issue.

Please let me know if you’d like any more details, tests, or logs from me. Keep it up 👍 I love your work!

rojer commented 4 years ago

hm. naturally, i'd like to see logs. i've recently made a change to make acquiring latest logs easier, it hasn't bee released yet, so please install a beta from here - http://rojer.me/files/shelly/2.5.0-beta1/ and go to http://DEVICE/debug/log leave the page open, it will be tailing the log. when device reboots, please take last dozen lines or so from that page.

rudyemm commented 4 years ago

Hey @rojer I've emailed you the core dump files. I'm installing the beta FW on 2 of my Shellys (the most problematic ones). I also wanted to point out that the temperatures of the Shellys (as per the UI) are: 74°C, 95°C, 92°C, and 86°C. I will also note my electrical wiring has no N wire at the switch, so my Shellys are installed at the light end (inside the housing, so maybe that's also why its getting hot in there).

@patricks Can you share more details about your setup? Is it the standard wiring behind the light switch? How's your temp?

rudyemm commented 4 years ago

More notes 😊

Some more bizarre behavior. I check in the UI and see the Shelly has recently rebooted, reporting "Uptime: 0:00:00:31". A few moments later, I wanted to check if the Shelly rebooted again, but now (less than 10 minutes later), it's reporting "Uptime: 0:01:13:07". I recall this weird behavior occurring in the previous firmware as well, before I installed 2.5.0-beta1.

A little while later, one of the Shellys starting behaving possessed. Within 4 minutes, it rebooted 10 times. I used the stopwatch on my phone, and here's the duration between each reboot: 34s 6s 12s 5s 22s 1m01s 18s 10s 4s 1m10s

I've now switched them off, I'll try them again shortly. But super weird behavior. I'm wondering if anyone else has had this issue. The only other / last time this happened was about 3 weeks ago.

rojer commented 4 years ago

@rudyemm i've taken a look at the core dumps - looks like stack is smashed, i'm not getting a meaningful stack trace out of it right away, will need more digging.

rudyemm commented 4 years ago

Sure, let me know if I can provide any other dumps, logs, etc and I’m happy to help you experiment 👍

rojer commented 4 years ago

i've take a closer look at the dumps. the reason i'm not getting backtrace is not because of stack smashing but because the crash happens in binary libs supplied by espressif and those don't have the debug symbols necessary to find function entrypoints. anyway, the firmware enters an endless loop and gets reset by the WDT. as far as i can tell from the disassembly, it prints "mac 985" as the reason:

   0x40102612:  l32r    a2, 0x4010214c
   0x40102615:  l32r    a3, 0x40102150
   0x40102618:  movi    a4, 0x3d9
   0x4010261b:  l32r    a0, 0x4010115c
   0x4010261e:  callx0  a0
=> 0x40102621:  j       0x40102621

0x4010214c is the format string, %s %u 0x40102150 is "mac" and 0x3d9 is 985.

i see other people reporting it as well, but without any reason or resolution... possibly some device on your network generates some traffic that confuses the esp's wireless stack, this happened before. unfortunately, there's very little i can do, this is all in closed code.

rudyemm commented 4 years ago

Oh nice, thank you so much for your investigation. This makes some sense, at least I feel confident there is no malfunction with the Shelly. This could be a result of the 2 most problematic Shellys connecting to my WiFi repeater(they’re located in the garden) while the other 2 “stable” Shellys connect straight to my UniFi network inside the house and do not exhibit so much of the restart problem. I restarted my WiFi repeater, and theShellys connected to it have been stable for a full day without restarting. I will keep testing and perhaps extend my UniFi to these Shellys that keep restarting, and I’ll report back with my findings.

@patricks can you confirm how the Shellys are configured in your home network? Do you have any WiFi repeaters?

Separately @rojer do you think the temperatures I’m seeing with my Shellys of up to 95°C could be causing any stability issues, and more importantly is this normal or is it dangerous to be running that hot?

Thanks again for everyone’s help, and keep up the great work 👍👍

patricks commented 4 years ago

This sounds interesting, @rudyemm it looks like I have a very similar WiFi setup. UniFi Amplifi Router + WiFi repeater. The problematic shelly is also the only one which is connected via the repeater. I have already tried to reboot my router and then the Shelly works for a few days, but after a few days I have the same problems. The temperature on my Shellys is about 65°C

patricks commented 4 years ago

@rudyemm are your problems gone since the router restart? I have the same problems again.

konagar commented 4 years ago

Hi, at what voltage do you run your shellys. There are problems with 24 volts.

rudyemm commented 3 years ago

@konagar mine is on 230v

@patricks after a lot of testing, my conclusion is that a weak WiFi signal is causing the FW to crash and reboot – even after completely removing the repeater. My network setup is enterprise Unifi, not the Amplifi. I have moved my access point closer to the Shelly although the signal is still weak (they are in my garden). The web server is still slow to respond and the Shelly is still exhibiting the reboot behavior.

There is still a scenario when it falls in a reboot loop – I’m not sure what causes this. I dunno if the high temperature has any effect on this behavior. I have ordered another access point to provide better WiFi to the Shellys, and will report back with my findings of the Shellys once they have a good WiFi signal.

@rojer perhaps it’s worth testing the reliability/stability of the FW with a weak signal (RSSI: -80 or lower)?

andyjp80 commented 3 years ago

I've been experiencing the same issue with a Shelly 2.5 I had installed. I have over 15 other Shelly 1's and 1PM's running the firmware that have been running perfectly for over 6 months now on the same network config. I purchased some more 2.5's and have just done some more testing. The 2.5 I started having this issue with is next to a 1PM behind a switch and is about 3m away from the closest Nano HD access point. I have 2 Nano HD's and I have a 2.4G SSID setup on each one which makes sure the Shellys only connect to the closest AP to them physically and don't hop between AP's. RSSI is -52 so I don't think signal strength is an issue. Reverted to stock firmware and ran for a week with no issues on the same network config. I setup a second new 2.5 yesterday on the bench and it ran for about 8hrs before dropping off. I've got some other devices on that SSID that I'm going to move off and test again, failing that I'm going to try and setup a VLAN on Unifi Controller with only the 2.5 on it and see if that has any impact.

rojer commented 3 years ago

this is interesting, there must be something to the 2.5 that causes it... but at the moment i have no idea what it could be. if someone could capture serial logs continuously from a running device, that would be great. warning: you HAVE to use an isolated serial to usb adapter, or you will regret it.

andyjp80 commented 3 years ago

If you could point me in the direction of the right kind of serial usb adapter and how to capture the logs I'll have a crack at it. I'd love to get these working as reliably as the 1's and PM's as they're probably more reliable than some of my genuine homekit stuff ha ha.

rojer commented 3 years ago

i wish i could... i have a custom thing i use. but you need an isolated TTL level serial converter to USB

andyblac commented 3 years ago

If you could point me in the direction of the right kind of serial usb adapter and how to capture the logs I'll have a crack at it. I'd love to get these working as reliably as the 1's and PM's as they're probably more reliable than some of my genuine homekit stuff ha ha.

i use this https://www.amazon.co.uk/gp/product/B07BBPX8B8/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&psc=1

rojer commented 3 years ago

@andyblac it's not isolated, never connect it while the device is connected to mains, you know what happens if you do :)

andyblac commented 3 years ago

ye, i always remove all wires 1st 😄

rudyemm commented 3 years ago

@andyjp80 hi there 👋 great to hear your feedback. Are you facing (a) reboots only, (b) unresponsive web UI, or (c) both? It’s good to have more people involved in this, hopefully we can get to the bottom of what is going on. It’s also reassuring to hear from you that your Shelly 1 devices are operating normally.

andyjp80 commented 3 years ago

@rudyemm the core dump from my unit pointed to the "mac 985" issue. When it happens it has "no response" in homekit and it disappears from my network and web ui isnt accessible. If I power the circuit off at the breaker in the switch board it comes back to normal operation. Yesterday I moved the 2.5 onto its own 2.4G SSID and VLAN on a different IP range (it's the only device on it) to the rest of my network and so far its been up for 24hrs without issue. I'll report back in a few days to see if it's still going.

andyjp80 commented 3 years ago

@rojer something like this? https://www.amazon.com.au/KNACRO-Isolated-Serial-Module-Fully/dp/B07L2MT5QJ

rojer commented 3 years ago

yes, that's more like it

rojer commented 3 years ago

i plan to update Espressif SDK to latest version in the near future, hopefully this will help or at least give us an opportunity to report issues to Espressif as we'll be running their latest and greatest.

jllavina commented 3 years ago

Hi, I add this comment from #310.

I have installed 3 Shelly 2.5 to control the roller shutters at home. 2 of them are working without problems, but the third one only works for a couple of days and then it loses the connection to the Wi-Fi (and HomeKit of course), with no reboots as far as I know (I will check logs next time). The physical switches are still working without problems, but the only solution to recover the device is to cut off the power to force a restart. Then, it reconnects to the Wi-Fi immediately (I will try to restart the router next time).

All devices are in the same Wi-Fi (no repeaters) and running firmware 2.6.1. Temperature is around 50 ºC in all devices. And Wi-Fi signal is -70, -76 for the devices working fine and -83 for the one with intermittent failures (so, this is the highest difference between them).

andyjp80 commented 3 years ago

Quick update - have had the 2.5 on its own SSID (tied to a single NanoHD AP) and VLAN for 2 days now and is still working well - as mentioned above if I had it on my main network with the the Shelly's and a couple of other devices (Ring Doorbell and Garage Door Opener) it would drop out after 3-12hrs consistently and would need to cut power then it would become responsive again for another 3-12hrs. Seems to be a viable workaround for now. Now I'm going to add 2 more 2.5's to the VLAN and see if they all stay stable.

andyjp80 commented 3 years ago

Spoke too soon - it just rebooted itself and turned the lights on. Back to the drawing board. At least its still responsive without a hard power off I guess..

rudyemm commented 3 years ago

Damn, but at least we’re narrowing it down 🤣 what’s the WiFi signal strength for the Shellys? I’ve anecdotally noticed (roughly) that the worse the signal, the more often the reboot.

rojer commented 3 years ago

thanks everyone for your efforts, i'm watching this closely. i am also working on Espressif SDK update, that will hopefully fix this. i'll let you know when i have something to test.

jllavina commented 3 years ago

Hi, I add this comment from #310.

I have installed 3 Shelly 2.5 to control the roller shutters at home. 2 of them are working without problems, but the third one only works for a couple of days and then it loses the connection to the Wi-Fi (and HomeKit of course), with no reboots as far as I know (I will check logs next time). The physical switches are still working without problems, but the only solution to recover the device is to cut off the power to force a restart. Then, it reconnects to the Wi-Fi immediately (I will try to restart the router next time).

All devices are in the same Wi-Fi (no repeaters) and running firmware 2.6.1. Temperature is around 50 ºC in all devices. And Wi-Fi signal is -70, -76 for the devices working fine and -83 for the one with intermittent failures (so, this is the highest difference between them).

It happened again a few days ago. I restarted the router and all the devices reconnected fine except that one. I cut off its power to force the restart and I moved the device a little to improve the Wi-Fi signal (now it is -72 instead of -83). It has been working fine for 5 days, so it seems that my main problem is fixed. If it doesn't lose the connection everything is right...

rudyemm commented 3 years ago

Hi @andyjp80 👋 how was your experience so far? Were you able to get the serial adapter to retrieve logs of the crash/reboot? For me, the 2.7 beta still exhibits the same crash/reboot behavior.

rojer commented 3 years ago

please test 2.7 beta and let me know your experience - https://github.com/mongoose-os-apps/shelly-homekit/issues/330

rojer commented 3 years ago

@rudyemm thanks for providing the dumps. by the looks of it, something related to dns-sd is leaking memory - heap autopsy shows a lot of active allocations with dns-sd advertising data. will keep looking.

rojer commented 3 years ago

some more investigating today: it's a connection leak, related to dns-sd. somehow connections get left behind... will continue.

rojer commented 3 years ago

ok, it's not a connection leak but a leak of pending pbufs when closing UDP connections. i think i've found the reason, @rudyemm please update to beta3 and let me know if it helps.

rudyemm commented 3 years ago

I'm still facing crashes/reboots with beta 3 – how's the rest of your guys' experience?

I've updated to the stable 2.7.0 and will report back any new findings. I assume not much has changed from beta to stable that addresses this topic @rojer ?

rudyemm commented 3 years ago

Maybe if we can perform a full, clean, wipe of the device #308 we can determine if the issues we're facing are a result of a configuration issue

rojer commented 3 years ago

@rudyemm 2.7.0 was getting stale on the cooker, and i did fix a couple issues that should improve things, so i decided to push it out. 2.7.0 is just a rebuild of the same code as beta3, so no change is expected for you. i understand that you are still facing issues, and we will continue to investigate them. i think next step is to enable remote logging to my server and see if anything comes along that way. i will give you instructions on how to do it soon.

rojer commented 3 years ago

@rudyemm please use the following url:

http://shelly25-test.local/rpc/Config.Set?config=%7b%22debug%22%3a%7b%22udp_log_addr%22%3a%2235.205.201.239:13001%22%7d%7d&reboot=true

replace shelly25-test.local with the names of devices that experiences issues. this will send logs to my server so i can hopefully see what's wrong. please give me device IDs (available in the system section of the web ui) so i can know which is which,

rudyemm commented 3 years ago

Happy New Year everyone 🥳

I hope the remote logs have provided valuable info @rojer – were you able to conclude any issues?

andyjp80 commented 3 years ago

2.7 seems to have done the trick for me.. have been online for 9 days now with no issues.. longest it lasted before was a few days. Looking good. Thanks!

rojer commented 3 years ago

@rudyemm i see strange behavior by shellyswitch25-1A4A17 does it have a core dump?