coserotondo commented 6 months ago

116975#issuecomment-2098151285

The problem

Starting with 2024.5.1 at any HA restart I've[homeassistant.config_entries] Error and the entities are related to Shelly Integration. After HA restart, the affected entities are unavailable and I have to manually reload the integration to make them available again. Attached the diagnostics of one Shelly affected devices: Letto Monte Camera Rosa.

What version of Home Assistant Core has the issue?

core-2024.5.2

What was the last working version of Home Assistant Core?

core-2024.4.4

What type of installation are you running?

Home Assistant OS

Integration causing the issue

No response

Link to integration documentation on our website

No response

Diagnostics information

config_entry-shelly-696dfd20befd67dcebd37c4a7aea8378.json

Example YAML snippet

No response

Anything in the logs that might be useful for us?

2024-05-07 09:24:17.075 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry Letto Monte Camera Rosa for binary_sensor
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 575, in async_setup
    result = await component.async_setup_entry(hass, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/binary_sensor/__init__.py", line 234, in async_setup_entry
    return await component.async_setup_entry(entry)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 185, in async_setup_entry
    raise ValueError("Config entry has already been setup!")
ValueError: Config entry has already been setup!
2024-05-07 09:24:17.128 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry Letto Monte Camera Rosa for button
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 575, in async_setup
    result = await component.async_setup_entry(hass, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/button/__init__.py", line 69, in async_setup_entry
    return await component.async_setup_entry(entry)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 185, in async_setup_entry
    raise ValueError("Config entry has already been setup!")
ValueError: Config entry has already been setup!
2024-05-07 09:24:17.180 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry Letto Monte Camera Rosa for cover
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 575, in async_setup
    result = await component.async_setup_entry(hass, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/cover/__init__.py", line 241, in async_setup_entry
    return await component.async_setup_entry(entry)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 185, in async_setup_entry
    raise ValueError("Config entry has already been setup!")
ValueError: Config entry has already been setup!
2024-05-07 09:24:17.190 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry Letto Monte Camera Rosa for event
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 575, in async_setup
    result = await component.async_setup_entry(hass, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/event/__init__.py", line 67, in async_setup_entry
    return await component.async_setup_entry(entry)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 185, in async_setup_entry
    raise ValueError("Config entry has already been setup!")
ValueError: Config entry has already been setup!
2024-05-07 09:24:17.241 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry Letto Monte Camera Rosa for light
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 575, in async_setup
    result = await component.async_setup_entry(hass, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/light/__init__.py", line 675, in async_setup_entry
    return await component.async_setup_entry(entry)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 185, in async_setup_entry
    raise ValueError("Config entry has already been setup!")
ValueError: Config entry has already been setup!
2024-05-07 09:24:17.260 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry Letto Monte Camera Rosa for sensor
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 575, in async_setup
    result = await component.async_setup_entry(hass, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/sensor/__init__.py", line 134, in async_setup_entry
    return await component.async_setup_entry(entry)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 185, in async_setup_entry
    raise ValueError("Config entry has already been setup!")
ValueError: Config entry has already been setup!
2024-05-07 09:24:17.284 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry Letto Monte Camera Rosa for switch
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 575, in async_setup
    result = await component.async_setup_entry(hass, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/switch/__init__.py", line 94, in async_setup_entry
    return await component.async_setup_entry(entry)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 185, in async_setup_entry
    raise ValueError("Config entry has already been setup!")
ValueError: Config entry has already been setup!
2024-05-07 09:24:17.296 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry Letto Monte Camera Rosa for update
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 575, in async_setup
    result = await component.async_setup_entry(hass, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/update/__init__.py", line 118, in async_setup_entry
    return await component.async_setup_entry(entry)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 185, in async_setup_entry
    raise ValueError("Config entry has already been setup!")
ValueError: Config entry has already been setup!
2024-05-07 09:24:17.313 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry Letto Monte Camera Rosa for valve
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 575, in async_setup
    result = await component.async_setup_entry(hass, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/valve/__init__.py", line 111, in async_setup_entry
    return await component.async_setup_entry(entry)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 185, in async_setup_entry
    raise ValueError("Config entry has already been setup!")
ValueError: Config entry has already been setup!

Additional information

No response

home-assistant[bot] commented 6 months ago

Hey there @balloob, @bieniu, @thecode, @chemelli74, @bdraco, mind taking a look at this issue as it has been labeled with an integration (shelly) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of `shelly` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign shelly` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

_{^{(message by CodeOwnersMention)}}

shelly documentation shelly source _{^{(message by IssueLinks)}}

thecode commented 6 months ago

Thanks @coserotondo. We have seen this error but couldn't find why it happen. Maybe with additional info from you we can identify the root cause better.

Does it happen only on Gen1 devices? How often does it happen? (For me it takes hours to reproduce so if someone can reproduce faster maybe we can get more data) Do you have automations related to the problematic devices? Does it happen on different devices each time?

Can you reproduce it with debug logging enabled for Shelly integration? Please add the following to configuration.yaml, restart home assistant and if the error occurs attach the log.

logger:
  default: info
  logs:
    aioshelly: debug 
    homeassistant.components.shelly: debug

Note: it is better to drag the log into the comment (which will add it as an attachment) and not copy paste as it is hard to read logs in GitHub.

Thanks.

coserotondo commented 6 months ago

It happened 2 times only after HA update and consequent HA restart (first 2024.5.1 and then 2024.5.2).

Not always the same Shelly devices but all Gen1. I've automations attached to some of them.

I put the debug options and tried to manual restart HA. Logs attached but I had no issue this time.

Now I stopped the debug (I've 51 Shellyes...). What I can do is to enable the debug option before the next update to observe what happens after the restart.

Any other suggestions? Maybe the logs contains useful info?

Thanks.

home-assistant_shelly_2024-05-07T11-47-54.099Z.log

ChirpyTurnip commented 6 months ago

I have the same problem (only Gen1 devices) but want to note that immediately after the HA restart (when it says "Home Assistant has Started!") all the Shelly devices are found and random testing shows that they appear to work as expected when operated.

Within 60 seconds however they are all lost again - so it is not the case that they never get found after a restart.....they are found and then lost, and then found again after reloading the integrations.

In the meantime is there an easy/automated way to cause the Shelly devices to reload? Doing them one at a time is rather painful, especially from mobile.

thecode commented 6 months ago

@ChirpyTurnip please follow https://github.com/home-assistant/core/issues/116975#issuecomment-2098151285 and provide log. Thanks 👍

zollak commented 6 months ago

hi all, exactly the same here. I have had no issues with HA Core 2024.4 and this comes with 2024.5.x...

I have 24 Shelly devices. Here is the same log entry one of them:

2024-05-07 23:07:03.744 ERROR (MainThread) [homeassistant.config_entries] Error setting up entry ZS11-Shutter Bedroom Girls Door for update
Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/config_entries.py", line 575, in async_setup
    result = await component.async_setup_entry(hass, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/update/__init__.py", line 118, in async_setup_entry
    return await component.async_setup_entry(entry)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/entity_component.py", line 185, in async_setup_entry
    raise ValueError("Config entry has already been setup!")
ValueError: Config entry has already been setup!

The name of the shelly device here is: ZS11-Shutter Bedroom Girls Door

Versions I have (supervisors updated couple of minutes ago): Core: 2024.5.2 Supervisor: 2024.05.1 Operating System: 12.2 Frontend: 20240501.1

Based on the log entry, it seems the issue relates for async_setup_entry, that might changed in 2024.5.0. I have found this blog post that might helps: https://developers.home-assistant.io/blog/2024/04/13/deprecate_async_track_state_change/

thecode commented 6 months ago

@zollak please follow https://github.com/home-assistant/core/issues/116975#issuecomment-2098151285 and provide log. Thanks 👍

thecode commented 6 months ago

Please no need to comment that you have it also, it just spams the issue. Follow https://github.com/home-assistant/core/issues/116975#issuecomment-2098151285 and provide logs.

Thanks 👍

zollak commented 6 months ago

@zollak please follow #116975 (comment) and provide log. Thanks 👍

I did the debugging. I don't want to attach here my log, it contains too much sensitive info. I'm trying to filter it somehow...

fablav76 commented 6 months ago

Me too. Same problem. Just only with Shelly Gen1. I've reloaded manually all devices and rebooted. After reboot only 1 device was offline. After enabled it again and reboot for the second time no more issues. I don't understand. 2024.5.1 no problem 2024.5.2 problem Screenshot_20240508-210812

ChirpyTurnip commented 6 months ago

For me the reproduction of the issue is fast and reliable --> Restart, count to 100, instant chaos! :-)

Here are my logs from running to restart to config reloads.....

home-assistant_shelly_2024-05-08T01-24-19.494Z_CLEAN.log

bdraco commented 6 months ago

A handful of the shelly code owners have reviewed the code and can't find a path to allow this to happen under normal circumstances. Some of us have spent a few hours staring at the code, looking for a problem. At this point, it might be something corrupting the internal asyncio state.

Can you try running in safe mode to see if you can still reproduce the problem?

Additionally, if it is an unsafe thread operation corrupting the asyncio state, https://community.home-assistant.io/t/2024-5-tracking-down-instability-issues-caused-by-integrations/724441 can help you uncover it. Follow the steps in Enabling asyncio debug mode at run time (preferred) and Using Home Assistant's built-in debug mode in 2024.5.x or later (preferred)

ChirpyTurnip commented 6 months ago

That doesn't sound super ideal - it would be better if it was an easy fix. :-)

The problem does not appear to occur in safe mode....but then in safe mode many things are broken. In normal mode I noted that not all the Shellies were impacted - some (a very small number) continued to operate.

I have attached what is hopefully the information are after: HA_Safe Mode.zip HA_Normal.zip

I hope this helps!

bdraco commented 6 months ago

Your callgrind files are impressive assuming they are 60s long. You have on of the busiest systems I have ever seen.

Normal Mode

solarcastapi is using quite a bit of cpu time
Some integration is encoding lots of json data with the stdlib python encoder
24227x state writes per minute
128859x debug log writes
watchman utils is using quite a bit of cpu time

Safe Mode

22812x state writes per minute
26983x debug log writes

bdraco commented 6 months ago

It might be that whichever integration is causing the issue is only doing the non-thread-safe operations at startup.

You might have to go the debugpy route instead to get asyncio debug enabled soon enough to catch it: https://community.home-assistant.io/t/2024-5-tracking-down-instability-issues-caused-by-integrations/724441 Enabling asyncio debug mode before startup (fallback plan)

ChirpyTurnip commented 6 months ago

Your callgrind files are impressive assuming they are 60s long. You have on of the busiest systems I have ever seen.

I can't claim a record...they are 600s long.....I wanted to make sure I captured the process of the Shelly devices coming on line, dropping off, and reloading... :-)

I will have a look at debugpy to see what it gives me.....just need to wrap up work so it will take me a couple of hours to get back to it....

ChirpyTurnip commented 6 months ago

OK....so this is all a little bit weird. It actually behaves better with the debugpy: entry in the configuration.yaml....like basically no issue. At one point one device dropped off (and came back by itself a few seconds later without intervention):

2024-05-09 15:55:20.921 ERROR (MainThread) [homeassistant.components.shelly] Error fetching Shelly-BookcaseLights data: Error fetching data: DeviceConnectionError()

Over several restart tests I still lost the occasional device (2 -3) and these needed to be reloaded, but not what I had previously where almost all of them had tapped out. However, I'm not actually 100% sure I'm doing it right as I'm not getting extra log files or anything. There was a weird console 'thing' that showed on the screen but it didn't linger. I have also (briefly set) this in the config before restarting:

#Logging
logger:
  default: info
  logs:
    homeassistant.core: debug
    custom_components.shelly: debug

That caused significantly more logging that just having the recommended entry in the config:

homeassistant:
  debug: true

This gave me this log dump, but ironically this also caused all the Shelly devices to fall off again: home-assistant.zip

There is a pattern there.....but what it is a mystery!

Regardless, how do I actually use debugpy:? Adding it by itself doesn't seem to do much....

bdraco commented 6 months ago

OK....so this is all a little bit weird. It actually behaves better with the debugpy: entry in the configuration.yaml....like basically no issue.

Once debugpy is enable, it will turn on asyncio debug mode at startup https://docs.python.org/3/library/asyncio-dev.html which will log any unsafe operations and event loop blocks that it can detect. We need the log for when that is enabled. You'll see a lot of asyncio in the log when its turned on.

tonicb78 commented 6 months ago

I have same behavior with random shellys gen1 after a HA restart. I found that if you directly force a change in the embedded web (IP site) for example a turn on&off, and then reload it in HA usually it becomes available again. Could be related to ECO mode??

ChirpyTurnip commented 6 months ago

OK....I ran it again with HA.core + Shelly in debug mode, and debugpy: in the configuration.yaml. This time I saw the asyncio messages in the log.....so it might have done something useful this time.

home-assistant_asyncio.zip

About 50% of the Shelly devices were lost on this restart.....

CP.

thecode commented 6 months ago

Hi @ChirpyTurnip, Thanks for your efforts and help, can you disable or temporarily remove the multiscrape integration and see if it changes something? I am not an expert in analyzing asyncio debug output but the following two lines are suspicious:

2024-05-10 07:30:57.992 INFO (MainThread) [homeassistant.setup] Setup of domain multiscrape took 30.33 seconds
2024-05-10 07:30:58.013 INFO (MainThread) [homeassistant.setup] Setup of domain esphome took 25.36 seconds

I am not sure why ESPHome is here but maybe it is blocked by the multiscrape integration.

Thanks 👍

thecode commented 6 months ago

I have created a Shelly custom component from core 2024.4.4 if you want to test, unzip it under Home assistant config directory and restart Home Assistant, it will replace the integrated one, to remove it just delete the Shelly directory under the custom_components folder.

Please use only to test this issue. shelly 2024.4.4 custom_component.zip

Thanks 👍

ChirpyTurnip commented 6 months ago

can you disable or temporarily remove the multiscrape integration and see if it changes something?

Right - have done this by commenting out the link to multiscrape in the configuration.yaml. First restart everything worked perfectly...high fives all around, but on the second and third restarts things were broken again....so no cigar.

Will now try the custom Shelly integration.....but will leave multiscrape disabled for now.

ChirpyTurnip commented 6 months ago

OK....job done. Three runs completed including a full reboot.

In the first run after enabling the component and restarting HA nothing failed. Again high fives....but..... On the second run we had failures....so still no cigar..... On the third run I did a full platform restart but again had failures....

The logs are here: home-assistant_custom_shelly_logs.zip

bdraco commented 6 months ago

From the asyncio logs, solcast_solar looks like its blocking the event loop for more than 20 seconds.

Can you try disable that integration?

2024-05-10 07:30:57.136 WARNING (MainThread) [asyncio] Executing <Task pending name='config entry setup Solcast Solar solcast_solar f932aa4cff866cd64117e83c21dbda17' coro=<ConfigEntry.async_setup_locked() running at /usr/src/homeassistant/homeassistant/config_entries.py:712> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/local/lib/python3.12/asyncio/futures.py:387, Task.task_wakeup()] created at /usr/local/lib/python3.12/asyncio/base_events.py:448> cb=[gather.<locals>._done_callback() at /usr/local/lib/python3.12/asyncio/tasks.py:767] created at /usr/src/homeassistant/homeassistant/util/async_.py:40> took 23.660 seconds

ChirpyTurnip commented 6 months ago

So the usual story - the restarts appear successful, everything comes back...and then some of the Shelly devices get assassinated....

On the first pass it looked OK.....I exported the log, sanitized it, rechecked, everything still ok. So restarted....on the second restart it all came back good as well but then things started to go...I think I lost 3 - 4 shelly devices but after that it stayed like that (i.e. no more were lost).

I went full-on debug with these two runs, so the logs are bigger. Have got both multiscrape and solcast still disabled. home-assistant_no_solcast_success.zip home-assistant_no_solcast_fail.zip

I'm doing some more restarts now to get rid of the debugging again and to see if anything changed. and to look for patterns....it is just really inconsistent as sometimes it comes up almost clean (or even clean) and then it loses everything...other times it comes up and it is a complete mess.

I wish the entity list would remember my set filters but every time I come back I have to re-set them....must have done it 50 times by now :-(

ChirpyTurnip commented 6 months ago

So, random observations:

The more I get it to do in terms of debug logging the better it seems to run...I don't know why, but with debug on I lose less Shelly device on start up than otherwise. If I had to guess this increases the run time which gives things slightly longer to happen...
On the other hand, the less I get it to do in terms of running integrations the better it seems to run - with multiscrape and solcast off I have more success that with Shelly devices on. If i had to guess this means there is less stuff happening in parallel freeing up a little resource to make other things work better (so like the one above, only more resource instead of time).
With all the logging plus the minimum number of integrations I have the least amount of trouble....and I get the occasional clean startup (which the best).
I have now noticed that I have some other integrations that are also failing to start reliably - for example my MetService integration is also not firing up every time. So this might not be limited to just Shelly integrations....here too telling it to reload sorts the problem.

Still calling it weird.

bdraco commented 6 months ago

https://github.com/oziee/ha-solcast-solar/blob/1a7f29b5cb7101cdb1e25ba7dcd2b2bf9662fe46/custom_components/solcast_solar/solcastapi.py#L111

There is some blocking I/O in that integration. Additionally it's leaking listeners. There are probably more issues but that's what's I saw right away

chemelli74 commented 6 months ago

@ChirpyTurnip can I ask you to follow the following steps to narrow down the issue ?

disable all custom integrations
restart HA and confirm that all is fine
enable 1 custom integration, restart HA and confirm that all is fine
repeat step 3. for each custom integration you have
once you find one custom integration that create the issue, disable it again and proceed with all the others
confirm that HA with all custom integrations enabled but the 1 or 2 identified in the previous steps works fine
report the names of the identified custom integrations that create the issue

Thank you very much for your great help!

vistalba commented 6 months ago

I'm currently not able to help deeper testing and sort it out. But want to share my expirience.

At my installation mainly the battery operated devices Shelly H&T are affected. Sometimes also Shelly Plus H&T, Shelly 2.5 (everytime the same, while it is reachable by WebUI) and just once a Shelly Plus 2. Therfore it doesn't matter, if the H&T is powered by battery of by USB. Both are affected.

I do not have the custom_component solarcast installed. I just the Forecast.Solar integration from homeassistant itself.

ChirpyTurnip commented 6 months ago

So I have been plugging away at this for a few hours but it is 1am and I need to stop. When you are tired that 'delete' button, which is cleverly placed next to the 'disable' button, is too easy to hit and the confirmation dialogue looks just like the disable confirmation....so let's just say a minor mistake was made. Oops. :-)

Anyway, before I give the results (which you will not like) let me explain the methodology:

I killed off almost all the customs (about 40 in total) except for the ones we already know made no difference in or out.
Restart, watch for errors, give it a good 3 minutes or so to fail.....repeat at least twice....if OK add the next integration
Keep adding until it fails. Repeat to confirm the failure is not chance
Remove the last added custom, restart and confirm system back to happy. Repeat to confirm.

So, this process is very time intensive...but it did not come by accident. The crappy result is that the results are inconsistent. I can do this process as follows:

Add X --> OK --> OK
Add Y --> OK --> OK
Add Z --> OK --> ERROR
Remove Z --> ERROR --> ERROR
Remove Y --> ERROR --> OK --> ERROR
Remove X --> OK --> ERROR --> OK --> OK --> OK
Add Y --> OK --> OK
Add Z --> OK --> OK
Add W -- > OK --> ERROR

I cannot get a simple answer here....however I have found some things do seem to be more prone to causing problems or causing not just one shelly to fail but multiple or all shelly devices to fail....the worst ones so far seem to be the Sonos, Echonet, Envisalink, Thread, ESPHome, and Solcast integrations.....it might be chance but these happen to be 'chatty' ones that like to communicate a lot...but that said I don't have any Thread devices so it can't be talking too much. The ones that seemed more stable were Samsung Smart, Sony Bravia, Scheduler, nmap, watchman, spotify, electricity maps, met, nodered companion....all very much more 'passive' integrations.

I only got through 16 integrations out of the 40, so have many more to test but the value is dubious since there is no clear distinction between working and not working. And either working or non-working can easily flip the other way with no obvious reason.

I just keep coming back to the same thing - sometimes it fine, sometimes it loses a few, sometimes it loses a lot....but on reload they seem to come back and stay back. This could be a strange coincidental thing where how busy the system is determines whether or not none, some, or all of the shelly devices get killed off?

Whatever is happening they do not recover, and it also doesn't seem to happen again as I'm pretty sure they stay up once reloaded...and it's not a permanent error either as they will just come back if asked via a reload...

I'm off to get some sleep....here's hoping you make some progress while I rest. :-)

bdraco commented 6 months ago

I spent a few hours digging through the config entry code to look for a race bug yesterday but unfortunately did not find one.

I opened https://github.com/home-assistant/core/pull/117193 which adds some additional checks to the setup process to raise an exception if it detects that a config entry is trying to be setup again once it's already loaded or setup is in progress. Hopefully that will give us a better error in 2024.6.x if we aren't able to locate the root cause by than

ChirpyTurnip commented 6 months ago

Just confirming that after getting up this morning all of the Shelly devices are still available....so my money is on some sort of startup process glitch that wasn't there in 2024.4.

The other thing I noticed just idly scrolling through the logs is that when I restart after a failure I get lots of these messages:

2024-05-10 22:51:35.044 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-LaundryLights
2024-05-10 22:51:35.044 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-Dryer
2024-05-10 22:51:35.044 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-WashingMachine

But also quite a few of these:

2024-05-10 22:51:35.068 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-GarageBay
2024-05-10 22:51:35.068 ERROR (MainThread) [aioshelly.block_device.device] host 192.168.5.120: error during shutdown: KeyError('E5679B')
2024-05-10 22:51:35.068 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-HallwayLights
2024-05-10 22:51:35.068 ERROR (MainThread) [aioshelly.block_device.device] host 192.168.5.104: error during shutdown: KeyError('E403ED')
2024-05-10 22:51:35.068 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-GuestLights
2024-05-10 22:51:35.068 ERROR (MainThread) [aioshelly.block_device.device] host 192.168.5.117: error during shutdown: KeyError('E31FA6')

Looking closer, some Shelly devices are getting hit twice:

2024-05-10 22:51:35.066 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-StairsLights
[snip]
2024-05-10 22:51:35.067 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-StairsLights

I don't know if that's normal or not....some are getting hit just once....certainly all the IP addressing of the devices tagged with the 'KeyError' messages map exactly to the devices that are getting stopped twice (by name) - for example StairsLights is on 192.68.5.103:

2024-05-10 22:51:35.067 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-StairsLights
2024-05-10 22:51:35.067 ERROR (MainThread) [aioshelly.block_device.device] host 192.168.5.103: error during shutdown: KeyError('A27381')

I don't know if the number of KeyError messages matches the number of failed Shelly devices....but it is entirely possible. Is this perhaps because these are the ones that got set up twice?

Will be back this evening....I'm told I have things to do. :-)

thecode commented 6 months ago

2024-05-10 22:51:35.067 DEBUG (MainThread) [homeassistant.components.shelly] Stopping block device coordinator for Shelly-StairsLights
2024-05-10 22:51:35.067 ERROR (MainThread) [aioshelly.block_device.device] host 192.168.5.103: error during shutdown: KeyError('A27381')

The reason you have KeyError is because once the first device is stopped the 2nd one fails to remove this key since it was already removed, this is not an exception, it is logged by purpose.

Your observation is correct, there are two coordinators for the same device since it is setup twice, we try to find the reason for that.

ChirpyTurnip commented 6 months ago

Needless to say the problem persists (not expecting it to be any different as nothing has changed). But a semi-off topic question from us sufferers.....

Is there any way to automate the reload of all the Shelly integrations?

Hitting them one at a time is a major drag.....I'm actively configuring things so HA is being restarted 10 - 20 times a day to load new configs....you can do the math on the number of clicks needed to restart 38 shellies manually every time....I'll need a new mouse by the end of this.... :-(

Thanks!

thecode commented 6 months ago

I am running a change that will be released on the next major release (2025.6.0) and it looks good for few days. If you want I can explain how to patch your current version for it, but beware that if it doesn't work it will be hard to get back to an operational state on HA OS if it doesn't work.

ChirpyTurnip commented 6 months ago

I'm happy to wait....June is not too far away and the last thing I need is to completely break everything....it would take me longer to fix it than to wait :-)

coserotondo commented 6 months ago

I am running a change that will be released on the next major release (2025.6.0) and it looks good for few days. If you want I can explain how to patch your current version for it, but beware that if it doesn't work it will be hard to get back to an operational state on HA OS if it doesn't work.

June 24 or 25 ???

thecode commented 6 months ago

I am running a change that will be released on the next major release (2025.6.0) and it looks good for few days. If you want I can explain how to patch your current version for it, but beware that if it doesn't work it will be hard to get back to an operational state on HA OS if it doesn't work.

June 24 or 25 ???

24 😸

vistalba commented 6 months ago

Since the update which added the debugg logging I didn‘t run into this issue again. May something else in this update fixed that issue?

met67 commented 6 months ago

I have the same problem (only Gen1 devices) but want to note that immediately after the HA restart (when it says "Home Assistant has Started!") all the Shelly devices are found and random testing shows that they appear to work as expected when operated.

Within 60 seconds however they are all lost again - so it is not the case that they never get found after a restart.....they are found and then lost, and then found again after reloading the integrations.

In the meantime is there an easy/automated way to cause the Shelly devices to reload? Doing them one at a time is rather painful, especially from mobile.

@ChirpyTurnip I made a Auto Entities card to streamline the reloading of unavailable devices:

type: custom:auto-entities
card:
  type: entities
  title: Shelly devices unavailable
  show_header_toggle: false
filter:
  include:
    - domain: light
      integration: shelly
      state: unavailable
      options:
        tap_action:
          action: call-service
          service: homeassistant.reload_config_entry
          data:
            entity_id: this.entity_id
    - domain: switch
      integration: shelly
      state: unavailable
      options:
        tap_action:
          action: call-service
          service: homeassistant.reload_config_entry
          data:
            entity_id: this.entity_id
    - domain: cover
      integration: shelly
      state: unavailable
      options:
        tap_action:
          action: call-service
          service: homeassistant.reload_config_entry
          data:
            entity_id: this.entity_id
  exclude: []

bdraco commented 6 months ago

2024.5.5 has a stripped down version of the changes coming in 2024.6.x. There is a chance it might improve the situation while we are waiting for 2024.6.x to ship.

thecode commented 6 months ago

Hi @ChirpyTurnip @zollak @met67 can you update to 2024.5.5 and verify if it still happen? it would be good to know before next release.

Thanks 👍

coserotondo commented 6 months ago

So far all good !

Il giorno sab 25 mag 2024 alle 22:34 Shay Levy @.***> ha scritto:

Hi @ChirpyTurnip https://github.com/ChirpyTurnip @zollak https://github.com/zollak @met67 https://github.com/met67 can you update to 2024.5.5 and verify if it still happen? it would be good to know before next release.

Thanks 👍

— Reply to this email directly, view it on GitHub https://github.com/home-assistant/core/issues/116975#issuecomment-2131430703, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5CAQBAJ2KBOINZKUGO4U6TZEDYTRAVCNFSM6AAAAABHKOEDVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZRGQZTANZQGM . You are receiving this because you were mentioned.Message ID: @.***>

vistalba commented 6 months ago

@thecode For me it still happens. Today morning a USB powered Shelly H&T is unavailable.

met67 commented 6 months ago

Hi @ChirpyTurnip @zollak @met67 can you update to 2024.5.5 and verify if it still happen? it would be good to know before next release.

Thanks 👍

After a couple of restart, everything’s fine…

ChirpyTurnip commented 6 months ago

Hi!

Sorry for the delay in responding I've not be around much this weekend. I can however (happily) confirm that after about 6 restarts and a couple of full reboots the Shelly devices all come back online and stay online!! Yay!!

@vistalba - I have some H&T units too.....I need to leave it for a bit to see if they phone home as expected. If they do not I will post an update, if you hear nothing than all is well on my end.

Special thanks to @met67 - it's a tidy solution! I have a few other things that sometimes need a kick so I will be keeping your snippet as it works perfectly! :-)

:-)

vistalba commented 6 months ago

Slightly updated the above snipped to also include Sensors (H&T, D&W) which also are affected:

type: custom:auto-entities
card:
  type: entities
  title: Shelly devices unavailable
  show_header_toggle: false
filter:
  include:
    - domain: light
      integration: shelly
      state: unavailable
      options:
        tap_action:
          action: call-service
          service: homeassistant.reload_config_entry
          data:
            entity_id: this.entity_id
    - domain: switch
      integration: shelly
      state: unavailable
      options:
        tap_action:
          action: call-service
          service: homeassistant.reload_config_entry
          data:
            entity_id: this.entity_id
    - domain: cover
      integration: shelly
      state: unavailable
      options:
        tap_action:
          action: call-service
          service: homeassistant.reload_config_entry
          data:
            entity_id: this.entity_id
    - domain: sensor
      integration: shelly
      state: unavailable
      options:
        tap_action:
          action: call-service
          service: homeassistant.reload_config_entry
          data:
            entity_id: this.entity_id
  exclude: []

zollak commented 6 months ago

Hi @ChirpyTurnip @zollak @met67 can you update to 2024.5.5 and verify if it still happen? it would be good to know before next release.

Thanks 👍

Hi, I have updated today my HA to 2024.5.5 and also came updates for Shelly devices v1.3.2 (the previous was v1.3.1). I also updated all my Shelly devices and HA as well.

I made system reboot as well and after a couple of restart, everything’s fine…

thecode commented 5 months ago

I am closing this issue for now, if anyone still experiencing Config entry has already been setup after 2024.6.0 please create a new issue with logs.

Thanks everyone for helping with logs

home-assistant / core

Config entry has already been setup for Shelly devices after 2024.5.1 with Gen1 devices #116975

The problem

What version of Home Assistant Core has the issue?

What was the last working version of Home Assistant Core?

What type of installation are you running?

Integration causing the issue

Link to integration documentation on our website

Diagnostics information

Example YAML snippet

Anything in the logs that might be useful for us?

Additional information