Closed coserotondo closed 5 months ago
Hey there @balloob, @bieniu, @thecode, @chemelli74, @bdraco, mind taking a look at this issue as it has been labeled with an integration (shelly
) you are listed as a code owner for? Thanks!
(message by CodeOwnersMention)
shelly documentation shelly source (message by IssueLinks)
Thanks @coserotondo. We have seen this error but couldn't find why it happen. Maybe with additional info from you we can identify the root cause better.
Does it happen only on Gen1 devices? How often does it happen? (For me it takes hours to reproduce so if someone can reproduce faster maybe we can get more data) Do you have automations related to the problematic devices? Does it happen on different devices each time?
Can you reproduce it with debug logging enabled for Shelly integration?
Please add the following to configuration.yaml
, restart home assistant and if the error occurs attach the log.
logger:
default: info
logs:
aioshelly: debug
homeassistant.components.shelly: debug
Note: it is better to drag the log into the comment (which will add it as an attachment) and not copy paste as it is hard to read logs in GitHub.
Thanks.
It happened 2 times only after HA update and consequent HA restart (first 2024.5.1 and then 2024.5.2).
Not always the same Shelly devices but all Gen1. I've automations attached to some of them.
I put the debug options and tried to manual restart HA. Logs attached but I had no issue this time.
Now I stopped the debug (I've 51 Shellyes...). What I can do is to enable the debug option before the next update to observe what happens after the restart.
Any other suggestions? Maybe the logs contains useful info?
Thanks.
I have the same problem (only Gen1 devices) but want to note that immediately after the HA restart (when it says "Home Assistant has Started!") all the Shelly devices are found and random testing shows that they appear to work as expected when operated.
Within 60 seconds however they are all lost again - so it is not the case that they never get found after a restart.....they are found and then lost, and then found again after reloading the integrations.
In the meantime is there an easy/automated way to cause the Shelly devices to reload? Doing them one at a time is rather painful, especially from mobile.
@ChirpyTurnip please follow https://github.com/home-assistant/core/issues/116975#issuecomment-2098151285 and provide log. Thanks 👍
hi all, exactly the same here. I have had no issues with HA Core 2024.4 and this comes with 2024.5.x...
I have 24 Shelly devices. Here is the same log entry one of them:
2024-05-07 23:07:03.744 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry ZS11-Shutter Bedroom Girls Door for update
Traceback (most recent call last):
File "/usr/src/homeassistant/homeassistant/config_entries.py", line 575, in async_setup
result = await component.async_setup_entry(hass, self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/homeassistant/homeassistant/components/update/__init__.py", line 118, in async_setup_entry
return await component.async_setup_entry(entry)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 185, in async_setup_entry
raise ValueError("Config entry has already been setup!")
ValueError: Config entry has already been setup!
The name of the shelly device here is: ZS11-Shutter Bedroom Girls Door
Versions I have (supervisors updated couple of minutes ago): Core: 2024.5.2 Supervisor: 2024.05.1 Operating System: 12.2 Frontend: 20240501.1
Based on the log entry, it seems the issue relates for async_setup_entry
, that might changed in 2024.5.0. I have found this blog post that might helps:
https://developers.home-assistant.io/blog/2024/04/13/deprecate_async_track_state_change/
@zollak please follow https://github.com/home-assistant/core/issues/116975#issuecomment-2098151285 and provide log. Thanks 👍
Please no need to comment that you have it also, it just spams the issue. Follow https://github.com/home-assistant/core/issues/116975#issuecomment-2098151285 and provide logs.
Thanks 👍
@zollak please follow #116975 (comment) and provide log. Thanks 👍
I did the debugging. I don't want to attach here my log, it contains too much sensitive info. I'm trying to filter it somehow...
Me too. Same problem. Just only with Shelly Gen1. I've reloaded manually all devices and rebooted. After reboot only 1 device was offline. After enabled it again and reboot for the second time no more issues. I don't understand. 2024.5.1 no problem 2024.5.2 problem
For me the reproduction of the issue is fast and reliable --> Restart, count to 100, instant chaos! :-)
Here are my logs from running to restart to config reloads.....
A handful of the shelly code owners have reviewed the code and can't find a path to allow this to happen under normal circumstances. Some of us have spent a few hours staring at the code, looking for a problem. At this point, it might be something corrupting the internal asyncio state.
Can you try running in safe mode to see if you can still reproduce the problem?
Additionally, if it is an unsafe thread operation corrupting the asyncio state, https://community.home-assistant.io/t/2024-5-tracking-down-instability-issues-caused-by-integrations/724441 can help you uncover it. Follow the steps in Enabling asyncio debug mode at run time (preferred)
and Using Home Assistant's built-in debug mode in 2024.5.x or later (preferred)
That doesn't sound super ideal - it would be better if it was an easy fix. :-)
The problem does not appear to occur in safe mode....but then in safe mode many things are broken. In normal mode I noted that not all the Shellies were impacted - some (a very small number) continued to operate.
I have attached what is hopefully the information are after: HA_Safe Mode.zip HA_Normal.zip
I hope this helps!
Your callgrind files are impressive assuming they are 60s long. You have on of the busiest systems I have ever seen.
Normal Mode
Safe Mode
It might be that whichever integration is causing the issue is only doing the non-thread-safe operations at startup.
You might have to go the debugpy
route instead to get asyncio debug enabled soon enough to catch it: https://community.home-assistant.io/t/2024-5-tracking-down-instability-issues-caused-by-integrations/724441 Enabling asyncio debug mode before startup (fallback plan)
Your callgrind files are impressive assuming they are 60s long. You have on of the busiest systems I have ever seen.
I can't claim a record...they are 600s long.....I wanted to make sure I captured the process of the Shelly devices coming on line, dropping off, and reloading... :-)
I will have a look at debugpy to see what it gives me.....just need to wrap up work so it will take me a couple of hours to get back to it....
OK....so this is all a little bit weird. It actually behaves better with the debugpy: entry in the configuration.yaml....like basically no issue. At one point one device dropped off (and came back by itself a few seconds later without intervention):
2024-05-09 15:55:20.921 ERROR (MainThread) [homeassistant.components.shelly] Error fetching Shelly-BookcaseLights data: Error fetching data: DeviceConnectionError()
Over several restart tests I still lost the occasional device (2 -3) and these needed to be reloaded, but not what I had previously where almost all of them had tapped out. However, I'm not actually 100% sure I'm doing it right as I'm not getting extra log files or anything. There was a weird console 'thing' that showed on the screen but it didn't linger. I have also (briefly set) this in the config before restarting:
#Logging
logger:
default: info
logs:
homeassistant.core: debug
custom_components.shelly: debug
That caused significantly more logging that just having the recommended entry in the config:
homeassistant:
debug: true
This gave me this log dump, but ironically this also caused all the Shelly devices to fall off again: home-assistant.zip
There is a pattern there.....but what it is a mystery!
Regardless, how do I actually use debugpy:? Adding it by itself doesn't seem to do much....
OK....so this is all a little bit weird. It actually behaves better with the debugpy: entry in the configuration.yaml....like basically no issue.
Once debugpy is enable, it will turn on asyncio debug mode at startup https://docs.python.org/3/library/asyncio-dev.html which will log any unsafe operations and event loop blocks that it can detect. We need the log for when that is enabled. You'll see a lot of asyncio
in the log when its turned on.
I have same behavior with random shellys gen1 after a HA restart. I found that if you directly force a change in the embedded web (IP site) for example a turn on&off, and then reload it in HA usually it becomes available again. Could be related to ECO mode??
OK....I ran it again with HA.core + Shelly in debug mode, and debugpy: in the configuration.yaml. This time I saw the asyncio messages in the log.....so it might have done something useful this time.
About 50% of the Shelly devices were lost on this restart.....
CP.
Hi @ChirpyTurnip,
Thanks for your efforts and help, can you disable or temporarily remove the multiscrape
integration and see if it changes something?
I am not an expert in analyzing asyncio
debug output but the following two lines are suspicious:
2024-05-10 07:30:57.992 INFO (MainThread) [homeassistant.setup] Setup of domain multiscrape took 30.33 seconds
2024-05-10 07:30:58.013 INFO (MainThread) [homeassistant.setup] Setup of domain esphome took 25.36 seconds
I am not sure why ESPHome
is here but maybe it is blocked by the multiscrape
integration.
Thanks 👍
I have created a Shelly custom component from core 2024.4.4 if you want to test, unzip it under Home assistant config directory and restart Home Assistant, it will replace the integrated one, to remove it just delete the Shelly
directory under the custom_components
folder.
Please use only to test this issue. shelly 2024.4.4 custom_component.zip
Thanks 👍
can you disable or temporarily remove the
multiscrape
integration and see if it changes something?
Right - have done this by commenting out the link to multiscrape in the configuration.yaml. First restart everything worked perfectly...high fives all around, but on the second and third restarts things were broken again....so no cigar.
Will now try the custom Shelly integration.....but will leave multiscrape disabled for now.
OK....job done. Three runs completed including a full reboot.
In the first run after enabling the component and restarting HA nothing failed. Again high fives....but..... On the second run we had failures....so still no cigar..... On the third run I did a full platform restart but again had failures....
The logs are here: home-assistant_custom_shelly_logs.zip
From the asyncio logs, solcast_solar looks like its blocking the event loop for more than 20 seconds.
Can you try disable that integration?
2024-05-10 07:30:57.136 WARNING (MainThread) [asyncio] Executing <Task pending name='config entry setup Solcast Solar solcast_solar f932aa4cff866cd64117e83c21dbda17' coro=<ConfigEntry.async_setup_locked() running at /usr/src/homeassistant/homeassistant/config_entries.py:712> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/local/lib/python3.12/asyncio/futures.py:387, Task.task_wakeup()] created at /usr/local/lib/python3.12/asyncio/base_events.py:448> cb=[gather.<locals>._done_callback() at /usr/local/lib/python3.12/asyncio/tasks.py:767] created at /usr/src/homeassistant/homeassistant/util/async_.py:40> took 23.660 seconds
So the usual story - the restarts appear successful, everything comes back...and then some of the Shelly devices get assassinated....
On the first pass it looked OK.....I exported the log, sanitized it, rechecked, everything still ok. So restarted....on the second restart it all came back good as well but then things started to go...I think I lost 3 - 4 shelly devices but after that it stayed like that (i.e. no more were lost).
I went full-on debug with these two runs, so the logs are bigger. Have got both multiscrape and solcast still disabled. home-assistant_no_solcast_success.zip home-assistant_no_solcast_fail.zip
I'm doing some more restarts now to get rid of the debugging again and to see if anything changed. and to look for patterns....it is just really inconsistent as sometimes it comes up almost clean (or even clean) and then it loses everything...other times it comes up and it is a complete mess.
I wish the entity list would remember my set filters but every time I come back I have to re-set them....must have done it 50 times by now :-(
So, random observations:
Still calling it weird.
There is some blocking I/O in that integration. Additionally it's leaking listeners. There are probably more issues but that's what's I saw right away
@ChirpyTurnip can I ask you to follow the following steps to narrow down the issue ?
Thank you very much for your great help!
I'm currently not able to help deeper testing and sort it out. But want to share my expirience.
At my installation mainly the battery operated devices Shelly H&T are affected. Sometimes also Shelly Plus H&T, Shelly 2.5 (everytime the same, while it is reachable by WebUI) and just once a Shelly Plus 2. Therfore it doesn't matter, if the H&T is powered by battery of by USB. Both are affected.
I do not have the custom_component solarcast installed. I just the Forecast.Solar integration from homeassistant itself.
So I have been plugging away at this for a few hours but it is 1am and I need to stop. When you are tired that 'delete' button, which is cleverly placed next to the 'disable' button, is too easy to hit and the confirmation dialogue looks just like the disable confirmation....so let's just say a minor mistake was made. Oops. :-)
Anyway, before I give the results (which you will not like) let me explain the methodology:
So, this process is very time intensive...but it did not come by accident. The crappy result is that the results are inconsistent. I can do this process as follows:
I cannot get a simple answer here....however I have found some things do seem to be more prone to causing problems or causing not just one shelly to fail but multiple or all shelly devices to fail....the worst ones so far seem to be the Sonos, Echonet, Envisalink, Thread, ESPHome, and Solcast integrations.....it might be chance but these happen to be 'chatty' ones that like to communicate a lot...but that said I don't have any Thread devices so it can't be talking too much. The ones that seemed more stable were Samsung Smart, Sony Bravia, Scheduler, nmap, watchman, spotify, electricity maps, met, nodered companion....all very much more 'passive' integrations.
I only got through 16 integrations out of the 40, so have many more to test but the value is dubious since there is no clear distinction between working and not working. And either working or non-working can easily flip the other way with no obvious reason.
I just keep coming back to the same thing - sometimes it fine, sometimes it loses a few, sometimes it loses a lot....but on reload they seem to come back and stay back. This could be a strange coincidental thing where how busy the system is determines whether or not none, some, or all of the shelly devices get killed off?
Whatever is happening they do not recover, and it also doesn't seem to happen again as I'm pretty sure they stay up once reloaded...and it's not a permanent error either as they will just come back if asked via a reload...
I'm off to get some sleep....here's hoping you make some progress while I rest. :-)
I spent a few hours digging through the config entry code to look for a race bug yesterday but unfortunately did not find one.
I opened https://github.com/home-assistant/core/pull/117193 which adds some additional checks to the setup process to raise an exception if it detects that a config entry is trying to be setup again once it's already loaded or setup is in progress. Hopefully that will give us a better error in 2024.6.x if we aren't able to locate the root cause by than
Just confirming that after getting up this morning all of the Shelly devices are still available....so my money is on some sort of startup process glitch that wasn't there in 2024.4.
The other thing I noticed just idly scrolling through the logs is that when I restart after a failure I get lots of these messages:
2024-05-10 22:51:35.044 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-LaundryLights
2024-05-10 22:51:35.044 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-Dryer
2024-05-10 22:51:35.044 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-WashingMachine
But also quite a few of these:
2024-05-10 22:51:35.068 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-GarageBay
2024-05-10 22:51:35.068 ERROR (MainThread) [aioshelly.block_device.device] host 192.168.5.120: error during shutdown: KeyError('E5679B')
2024-05-10 22:51:35.068 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-HallwayLights
2024-05-10 22:51:35.068 ERROR (MainThread) [aioshelly.block_device.device] host 192.168.5.104: error during shutdown: KeyError('E403ED')
2024-05-10 22:51:35.068 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-GuestLights
2024-05-10 22:51:35.068 ERROR (MainThread) [aioshelly.block_device.device] host 192.168.5.117: error during shutdown: KeyError('E31FA6')
Looking closer, some Shelly devices are getting hit twice:
2024-05-10 22:51:35.066 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-StairsLights
[snip]
2024-05-10 22:51:35.067 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-StairsLights
I don't know if that's normal or not....some are getting hit just once....certainly all the IP addressing of the devices tagged with the 'KeyError' messages map exactly to the devices that are getting stopped twice (by name) - for example StairsLights is on 192.68.5.103:
2024-05-10 22:51:35.067 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-StairsLights
2024-05-10 22:51:35.067 ERROR (MainThread) [aioshelly.block_device.device] host 192.168.5.103: error during shutdown: KeyError('A27381')
I don't know if the number of KeyError messages matches the number of failed Shelly devices....but it is entirely possible. Is this perhaps because these are the ones that got set up twice?
Will be back this evening....I'm told I have things to do. :-)
2024-05-10 22:51:35.067 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-StairsLights
2024-05-10 22:51:35.067 ERROR (MainThread) [aioshelly.block_device.device] host 192.168.5.103: error during shutdown: KeyError('A27381')
The reason you have KeyError
is because once the first device is stopped the 2nd one fails to remove this key since it was already removed, this is not an exception, it is logged by purpose.
Your observation is correct, there are two coordinators for the same device since it is setup twice, we try to find the reason for that.
Needless to say the problem persists (not expecting it to be any different as nothing has changed). But a semi-off topic question from us sufferers.....
Is there any way to automate the reload of all the Shelly integrations?
Hitting them one at a time is a major drag.....I'm actively configuring things so HA is being restarted 10 - 20 times a day to load new configs....you can do the math on the number of clicks needed to restart 38 shellies manually every time....I'll need a new mouse by the end of this.... :-(
Thanks!
I am running a change that will be released on the next major release (2025.6.0) and it looks good for few days. If you want I can explain how to patch your current version for it, but beware that if it doesn't work it will be hard to get back to an operational state on HA OS if it doesn't work.
I'm happy to wait....June is not too far away and the last thing I need is to completely break everything....it would take me longer to fix it than to wait :-)
I am running a change that will be released on the next major release (2025.6.0) and it looks good for few days. If you want I can explain how to patch your current version for it, but beware that if it doesn't work it will be hard to get back to an operational state on HA OS if it doesn't work.
June 24 or 25 ???
I am running a change that will be released on the next major release (2025.6.0) and it looks good for few days. If you want I can explain how to patch your current version for it, but beware that if it doesn't work it will be hard to get back to an operational state on HA OS if it doesn't work.
June 24 or 25 ???
24 😸
Since the update which added the debugg logging I didn‘t run into this issue again. May something else in this update fixed that issue?
I have the same problem (only Gen1 devices) but want to note that immediately after the HA restart (when it says "Home Assistant has Started!") all the Shelly devices are found and random testing shows that they appear to work as expected when operated.
Within 60 seconds however they are all lost again - so it is not the case that they never get found after a restart.....they are found and then lost, and then found again after reloading the integrations.
In the meantime is there an easy/automated way to cause the Shelly devices to reload? Doing them one at a time is rather painful, especially from mobile.
@ChirpyTurnip I made a Auto Entities card to streamline the reloading of unavailable devices:
type: custom:auto-entities
card:
type: entities
title: Shelly devices unavailable
show_header_toggle: false
filter:
include:
- domain: light
integration: shelly
state: unavailable
options:
tap_action:
action: call-service
service: homeassistant.reload_config_entry
data:
entity_id: this.entity_id
- domain: switch
integration: shelly
state: unavailable
options:
tap_action:
action: call-service
service: homeassistant.reload_config_entry
data:
entity_id: this.entity_id
- domain: cover
integration: shelly
state: unavailable
options:
tap_action:
action: call-service
service: homeassistant.reload_config_entry
data:
entity_id: this.entity_id
exclude: []
2024.5.5 has a stripped down version of the changes coming in 2024.6.x. There is a chance it might improve the situation while we are waiting for 2024.6.x to ship.
Hi @ChirpyTurnip @zollak @met67 can you update to 2024.5.5 and verify if it still happen? it would be good to know before next release.
Thanks 👍
So far all good !
Il giorno sab 25 mag 2024 alle 22:34 Shay Levy @.***> ha scritto:
Hi @ChirpyTurnip https://github.com/ChirpyTurnip @zollak https://github.com/zollak @met67 https://github.com/met67 can you update to 2024.5.5 and verify if it still happen? it would be good to know before next release.
Thanks 👍
— Reply to this email directly, view it on GitHub https://github.com/home-assistant/core/issues/116975#issuecomment-2131430703, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5CAQBAJ2KBOINZKUGO4U6TZEDYTRAVCNFSM6AAAAABHKOEDVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZRGQZTANZQGM . You are receiving this because you were mentioned.Message ID: @.***>
@thecode For me it still happens. Today morning a USB powered Shelly H&T is unavailable.
Hi @ChirpyTurnip @zollak @met67 can you update to 2024.5.5 and verify if it still happen? it would be good to know before next release.
Thanks 👍
After a couple of restart, everything’s fine…
Hi!
Sorry for the delay in responding I've not be around much this weekend. I can however (happily) confirm that after about 6 restarts and a couple of full reboots the Shelly devices all come back online and stay online!! Yay!!
@vistalba - I have some H&T units too.....I need to leave it for a bit to see if they phone home as expected. If they do not I will post an update, if you hear nothing than all is well on my end.
Special thanks to @met67 - it's a tidy solution! I have a few other things that sometimes need a kick so I will be keeping your snippet as it works perfectly! :-)
:-)
Slightly updated the above snipped to also include Sensors (H&T, D&W) which also are affected:
type: custom:auto-entities
card:
type: entities
title: Shelly devices unavailable
show_header_toggle: false
filter:
include:
- domain: light
integration: shelly
state: unavailable
options:
tap_action:
action: call-service
service: homeassistant.reload_config_entry
data:
entity_id: this.entity_id
- domain: switch
integration: shelly
state: unavailable
options:
tap_action:
action: call-service
service: homeassistant.reload_config_entry
data:
entity_id: this.entity_id
- domain: cover
integration: shelly
state: unavailable
options:
tap_action:
action: call-service
service: homeassistant.reload_config_entry
data:
entity_id: this.entity_id
- domain: sensor
integration: shelly
state: unavailable
options:
tap_action:
action: call-service
service: homeassistant.reload_config_entry
data:
entity_id: this.entity_id
exclude: []
Hi @ChirpyTurnip @zollak @met67 can you update to 2024.5.5 and verify if it still happen? it would be good to know before next release.
Thanks 👍
Hi, I have updated today my HA to 2024.5.5 and also came updates for Shelly devices v1.3.2 (the previous was v1.3.1). I also updated all my Shelly devices and HA as well.
I made system reboot as well and after a couple of restart, everything’s fine…
I am closing this issue for now, if anyone still experiencing Config entry has already been setup
after 2024.6.0 please create a new issue with logs.
Thanks everyone for helping with logs
❗ Please do not post "me too" or "I have the same issue comments" as it clutters this issue and notifies everyone who is subscribed to it, which will drive people away who could have helped solve the problem. Instead, give this issue a thumbs up, and post logs requested in https://github.com/home-assistant/core/issues/116975#issuecomment-2098151285
The problem
Starting with 2024.5.1 at any HA restart I've
[homeassistant.config_entries] Error
and the entities are related to Shelly Integration. After HA restart, the affected entities are unavailable and I have to manually reload the integration to make them available again. Attached the diagnostics of one Shelly affected devices:Letto Monte Camera Rosa
.What version of Home Assistant Core has the issue?
core-2024.5.2
What was the last working version of Home Assistant Core?
core-2024.4.4
What type of installation are you running?
Home Assistant OS
Integration causing the issue
No response
Link to integration documentation on our website
No response
Diagnostics information
config_entry-shelly-696dfd20befd67dcebd37c4a7aea8378.json
Example YAML snippet
No response
Anything in the logs that might be useful for us?
Additional information
No response