home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
73.19k stars 30.57k forks source link

Lifx integration with many devices frequently goes unavailable #78876

Closed mspinolo closed 1 year ago

mspinolo commented 2 years ago

The problem

Since the latest integration update I have a lot of occurrences of LIFX light becoming "not available". This is happening on most (but not all) of them (I have 20+ lights).

The behavior is not consistent during the day which makes me suspect there is some relation with the wifi environment (I have 3 AP broadcasting the same SSID on 1-6-11 channel), but by AP logs it doesn't seem to be related to LIFX disconnecting from one AP and reconnecting to the other.

Also I see they usually become unavailable for 10s then coming back online: I ask myself if this has something to do with polling rate cycle of the integration as I see from integration discovery interval is 10 (seconds?)

"""Const for LIFX."""

import logging

DOMAIN = "lifx"

TARGET_ANY = "00:00:00:00:00:00"

DISCOVERY_INTERVAL = 10
MESSAGE_TIMEOUT = 1.65
MESSAGE_RETRIES = 5
OVERALL_TIMEOUT = 9
UNAVAILABLE_GRACE = 90

so could it be that discovery, in my environment, simply can't keep the pace and drops connections?

What version of Home Assistant Core has the issue?

2022.9.5

What was the last working version of Home Assistant Core?

the one before LIFX integration update

What type of installation are you running?

Home Assistant OS

Integration causing the issue

LIFX

Link to integration documentation on our website

https://www.home-assistant.io/integrations/lifx/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

probot-home-assistant[bot] commented 2 years ago

lifx documentation lifx source (message by IssueLinks)

probot-home-assistant[bot] commented 2 years ago

Hey there @bdraco, @djelibeybi, mind taking a look at this issue as it has been labeled with an integration (lifx) you are listed as a code owner for? Thanks! (message by CodeOwnersMention)

Djelibeybi commented 2 years ago

It's more likely that your bulbs have always been doing this, we're just better at reporting it now than before. Does Home Assistant consistently re-establish connectivity to each bulb? Are they responsive to automation and manual control?

mspinolo commented 2 years ago

so in general they are re-establishing connection to HA.

In the past I never had automation / responsiveness issues, while now it sometimes happen when the state is unavailable and an action shot.

Not sure if something happened HA side (ex. increase amount of broadcast traffic) which made worse the situation recently. I have quite a lot of wifi devices in a single Lan segment which can be the issue (I need to segregate into VLAN at some stage but I can't find time for this). Airtime shouldn't be an issue as devices are split through 3 APs (20-25 each)

Djelibeybi commented 2 years ago

There shouldn't have been any significant increase in the amount of traffic, but we are interacting with the bulbs more than before. If you don't use HomeKit, it may be worth integrating your bulbs using Home Assistant's HomeKit Controller integration instead, as that uses local push, instead of polling the bulbs every 10 seconds.

If you do you use HomeKit, you still can by connecting them to Home Assistant first, then exporting them to HomeKit from HASS.

mspinolo commented 2 years ago

Yes I read it and also that is in my todo list: should be a much better way to controlling bulbs. Unluckily I have some Z strips which are not homekit compliant, hence for those I believe I will have to stick to LIFX integration.

When you say "we are interacting more with the bulbs" what are you referring to in details?

Mincka commented 2 years ago

I also see a LOT more of this kind of error messages since I use the new integration. The led strip becomes unavailable frequently and I have this in the logs:

2022-10-04 16:40:08.771 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 16:57:58.259 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 17:30:03.269 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 17:43:10.296 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 17:48:25.046 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 17:50:58.265 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 18:06:03.258 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 18:11:40.260 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 18:14:02.258 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 18:26:12.260 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 18:52:03.934 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 18:54:15.258 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 18:57:23.262 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 19:02:22.266 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 19:14:55.286 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data
2022-10-04 19:19:45.260 ERROR (MainThread) [homeassistant.components.lifx] Timeout fetching Tête de lit (10.0.0.3) data

In my case, the Wi-Fi coverage is weak in this room, but it never was an issue to use the led strip. Before muting the integration in the logs in my case, I am going to try the HomeKit integration. I just need to improve by Bluetooth coverage first. Thanks for the suggestion.

Djelibeybi commented 2 years ago

@bdraco I wonder if we shouldn't make the timeout a little less noisy? Perhaps only report if the device hasn't recovered after some amount of time? Most of my timeouts recover within the next 10 second window, for example. Usually much quicker.

bdraco commented 2 years ago

Does increasing the timeout allow it to go though? If we suppress the log message, the device will still be marked unavailable and lead to questions about why.

Djelibeybi commented 2 years ago

Let me test and get back to you on that.

Djelibeybi commented 2 years ago

I'm not getting any timeouts when running the latest dev that uses extended multizone messages and considering the issue is with a strip, I'd like to see if this is still an issue once that code is in a stable release.

TL;DR: this may already be fixed in dev via https://github.com/home-assistant/core/pull/79444

alystair commented 2 years ago

I'm having a similar issue... I only have Z / Z2 strips... image

They only came back online after a HA hardware reboot, had no way as a novice end user to force some sort of manual check.

Is there a way to update temporarily to the dev build, or maybe it's not worth the effort... is there an approximate ETA for when the potential fix will hit stable? (are we talking weeks / months)

alystair commented 2 years ago

One LIFX-Z failed to come back on HA (still worked without issues via Alexa / LIFX app. Removed device from HA list, then tried looking for it again via discovery. Was discovered as a weird name (serial number/mac?) and not the set one ('Bath' in this case). Even after adding it not only does it not show up in the device list, LIFX integration is no longer finding it... hardware reboot did not bring it back. Advice would be appreciated.

tankdeer commented 2 years ago

I am having the same issue as well. Reloading the integration tends to fix the issue, but it often reoccurs a short time later

mspinolo commented 2 years ago

I made various test changing stuff on my network trying to reduce as much as I can multicast traffic but no luck. My aim was to make such that polling all my lifx bulbs/strips was feasible in 10s which I think is a too short turnaround time.

I will try to move some bulbs to HomeKit and see if I will improve the situation which is quite annoying at the moment

Djelibeybi commented 2 years ago

If you haven't disconnected your bulbs from the LIFX Cloud, that's another thing you should do to reduce the CPU load on the bulbs themselves. This assumes you don't use themes or schedules defined in the LIFX app, as those require cloud connectivity.

mspinolo commented 2 years ago

They are disconnected, all of them. All worked well with no hick-ups for 2y before integration update

Djelibeybi commented 2 years ago

I'm not denying there is something going on with the way the integration currently does discovery, but it's proving extremely difficult to isolate or reproduce in a controlled environment. Especially considering discovery is improved for most folks.

mspinolo commented 2 years ago

I know it is not, don’t take too bad my comment. My feeling is it is a wrong mix of polling frequencies / retries which lead to this.

I previously had experiences with a (probably) faulty lifx bulb which just disconnected and hanged on every HA restart (with previous version of integration) I think due to the burst of multicast/polling HA was shooting. I believe lifx bulb have very poor bandwidth and go banana when “flooded”.

In my environment I don’t have the same intermittent disconnection for all bulbs: I have more for the one with weaker signal (still decent thought like -70dB). So I think is a mix of wifi radio environment, positioning, number of bulbs.

Likely this is just showing polling is not a robust way of communication. Not sure if there is something different that can be done within Lifx integration.

Now I migrated few lights to HomeKit: let’s see if it will be better

melbs2 commented 2 years ago

I am having the same issue as well. 2+ years of ~99.9% uptime, now im experiencings multiple long dropouts across my 20 bulbs every day

Djelibeybi commented 2 years ago

Yeah, I have a hypothesis as to the cause of this, I just need some spare time to refactor things to see if it's valid or not. I'm hoping to get to it this weekend.

bdraco commented 1 year ago

There is a thundering herd problem with the coordinators that cause all polling to be aligned at microsecond 0 that is fixed in 2022.12.x that might help this issue

bdraco commented 1 year ago

The thundering heard fix at 0 microseconds

82233

Djelibeybi commented 1 year ago

I've been trying to track this down for ages. I'm really glad you found the cause.

mspinolo commented 1 year ago

In case this can help I migrated all my lifx light to HomeKit controller integration: since then (3+ weeks ago) had zero disconnections

melbs2 commented 1 year ago

Thank you all for your input, i will holdout for 2022.12 as this issue is still persisting. I will use @mspinolo suggestion if the issue remains post update 🙏

Djelibeybi commented 1 year ago

@melbs2 I have some stuff I'm testing on top of @bdraco's fix for the thundering herd that is showing a lot of promise. There is still an issue with very old devices (like Beams or Tiles) but otherwise, I'm quite happy with the way my flock of 60 devices is behaving.

alystair commented 1 year ago

Did you change something recently, I'm getting a ton of errors for all my LIFX Z v1 strips, I've unpowered and repowered them multiple times to no avail. They work everywhere else (LIFX app, Alexa).

Seems to have happened after 2022.11.5

Logger: homeassistant.config_entries
Source: config_entries.py:1089
First occurred: December 2, 2022 at 7:22:14 PM (462 occurrences)
Last logged: 2:06:59 AM

Config entry 'Lower' for lifx integration not ready yet: No response from LIFX bulb; Retrying in background
Config entry 'Door' for lifx integration not ready yet: No response from LIFX bulb; Retrying in background
Config entry 'Window' for lifx integration not ready yet: No response from LIFX bulb; Retrying in background
Config entry 'Upper' for lifx integration not ready yet: No response from LIFX bulb; Retrying in background
Logger: homeassistant.helpers.service
Source: helpers/service.py:637
First occurred: December 3, 2022 at 12:07:21 AM (13 occurrences)
Last logged: December 3, 2022 at 7:00:00 PM

Unable to find referenced entities light.door or it is/they are currently not available
Unable to find referenced entities light.window or it is/they are currently not available
Unable to find referenced entities light.upper or it is/they are currently not available
Unable to find referenced entities light.lower or it is/they are currently not available
Djelibeybi commented 1 year ago

If you have HACS installed, you could try LIFX Beta component: https://github.com/Djelibeybi/ha-lifx-beta/

x3style commented 1 year ago

Having the same issue.

Unable to use automation against the lights, like slow dim with transition. The lights go unavailable after 1-2 minutes subsequently prompting it to go off once available again. The light otherwise does not go unavailable, but if i trigger this dimming which I believe involved rapid communication with the light (turn on from 0 to 50% brightness over 5 minutes). Seems to trigger the unavailability consistently. As if the light is being flooded by the comms and decides to give up for a few seconds.

Djelibeybi commented 1 year ago

@x3style could you please try https://github.com/Djelibeybi/ha-lifx-beta to see if that resolves the availability issue?

x3style commented 1 year ago

@x3style could you please try https://github.com/Djelibeybi/ha-lifx-beta to see if that resolves the availability issue?

Sorry, unable to deploy an unstable version, I need the lights to work. I can live without the automation until you guys find a proper fix.

Here's the errors i get it might help someone:

Logger: homeassistant.components.lifx Source: helpers/update_coordinator.py:151 Integration: LIFX (documentation, issues) First occurred: 11:41:48 PM (3 occurrences) Last logged: 11:51:21 PM

Timeout fetching Stairs 1 (10.0.53.3) data Timeout fetching Stairs 2 (10.0.53.23) data

To recreate it i have the following automation i test/trigger manually: Will kill one or both lights during run.

alias: Stairs:Turn On Stairs @ 5:30 description: "" trigger:

melbs2 commented 1 year ago

@Djelibeybi 24 hours into installing ha-lifx-beta 2022.11.0-dev with no dropouts at all. It has solved all my issues. TY

MSIMaker commented 1 year ago

I also have the same issue with 28 lights. Is there going to be a fix release for this. I can't install the beta version as I need the lights running of the missus will get gnarley at me.

Djelibeybi commented 1 year ago

The Beta is proving to be more stable than core at the moment. It's worth trying because it's easy enough to remove if it doesn't improve things.

tankdeer commented 1 year ago

Just created this issue for the beta. If you have bulbs that are power cycled at the switch, you may have availability issues until its resolved. https://github.com/Djelibeybi/ha-lifx-beta/issues/29

mark007 commented 1 year ago

I have really been enjoying the latest lifx integration, it no longer has instability and some issue where my lights used to randomly flicker on and off every few hours. However this one also is affecting me the last few days. I have switched from a nest wifi to an orbi wifi mesh in my house. While the lifx app itself has always shown all bulbs as online after I have made various changes to the wifi mesh to tweak it, home assistant would show various devices as unavailable, even for many hours afterwards. A HA restart would fix it. Maybe I'll also try the beta.

Does it look like this could be fixed in the beta. Is there an ETA on when we might see it merged into the master branch? Thanks as always for this fabulous integration and the work you guys put into it.

Djelibeybi commented 1 year ago

Take the Beta for spin and if you experience the same issue, please grab debug logs and post them to the Beta GitHub repo in an issue. That will give me further insight into the connectivity issues.

Note that the Beta isn't designed for direct inclusion into the core: rather, it's where users like yourself test different implementations to see what works and what doesn't. Because everyone uses their smart devices slightly differently, there are too many edge cases for me to test myself.

mark007 commented 1 year ago

Looking at the details of the beta, it seems like it shouldn't be used in production. I might pass, but if there's anything understood about this issue, I can of course post here or send in terms of logs. Is this issue understood well enough or is it one of those issues that's difficult to diagnose?

Djelibeybi commented 1 year ago

If you're familiar with HACS, it's easy rollback if it doesn't improve things but it will require restarting Home Assistant to enable and disable. Having said that, the current beta may not improve things for your particular situation.

On that, I seems to recall there is a setting required to improve local LAN access for LIFX on Orbi networks. You might want to search the LIFX forums for that. The LIFX smartphone app "cheats" by using a bulb's cloud connection for control if it can't be found locally. Which is why lights can be appear online in the smartphone app but not in Home Assistant.

mark007 commented 1 year ago

Thanks for the reply. I had a look at some of those Orbi specific settings which look interesting. However in my case a HA reboot fixes it which indicates to me that the bulb itself was ok all along from a WiFi point of view. I just hit it once now where one light was unavailable for 4 hours. A ha restart brought it back online. If I can provide anything to help with this one let me know or perhaps this is a known issue / resolved in the beta.

Edit: Disabling WMM on my orbi for the 2.4ghz band (QoS as fas as i can tell) has so far stopped the regular / short unavailable devices i was seeing. A general question though, can the integration be tweaked to be resilient to these types of network where i guess responses from the bulbs may be sometimes slower than normal, perhaps due to the network prioritizing something like users using netflix etc. That'd be super if possible.

Edit2: Wow disabling Wmm / airtime fairness kills performance for those devices using that network. I have some old tablets which have had their performance killed by turning off wmm on the 2.4ghz. I really think disabling this should be avoided and instead any software / integrations built on top be resilient to the fact that these iot devices in many networks will get a small amount of airtime for good reason.

Edit3: Seemingly disabling wmm mostly fixed the issue but not fully. One or two z strips in particular go unavailable about once per hour, randomly. I am unsure if it's power saving related as I barely use them (not sure if the integration can keep them awake as the orbi doesn't have an option to disable power saving unless I find a telnet command to do it). I do notice pings (set up as home assistant ping entities towards the lifx IPs) do drop to some of the bulbs, and usually become pingable within a few seconds. I'm not sure again if the knowledge you have built up from the ubiquiti tweaks you had to make, help improve the integration to no longer need those network tweaks. I'll keep digging to see if I can prevent these strips going unavailable. They are the furthest devices from the main orbi router but have 2 out of 5 bars according to the lifx app. I'd love to know are the bulbs doing their own reconnect/reboot with this low connection strength maybe. I'm clutching at straws at this point :) Can any non beta integration services like lifx.set_state keep a bulb from going asleep, as an experiment, if I was to run it every 5 or 10 seconds even. Or is the service smart and it won't talk to the bulb if it's already in that state?

mark007 commented 1 year ago

The new beta 2023.1.0b1 is fantastic, thanks so much, no offline bulbs within the last 1.5 hours where as I would usually have a trickle of unavailable bulbs every few minutes.

mark007 commented 1 year ago

What I have noticed throughout the day, while I monitor a helper I have setup to increment when a light becomes unavailable, it was at 0 for a very long time, but every now and again various strips would go into a pattern of becoming unavailable, then off, then unavailable, then off, in a sort of loop. It doesn't seem to resolve itself, unless I power off the strip for a few seconds, then back on. It has happened to three strips today so I'll monitor over the next few days to see if it happens to them all. (As a pure guess, could the bulbs be slowly building up some list of old/open connections and then going crazy until they are power cycled?) Is there a way to see, within each bulb / strip, how many open connection it has internally (or thinks it has). I guess there should only ever be 1 right from the integration towards each bulb at any time?

This is with the new beta b2. Btw the list below is a list sensor I setup which is populated with any unavailable lights, which I can use / look back on in the logbook quite quickly.

Unavailable Lights changed to [] triggered by state of Hob Floor Strip turned off 23:04:24 - In 3 seconds Unavailable Lights changed to ['light.hob_floor_strip'] triggered by state of Hob Floor Strip became unavailable 23:03:56 - 1 minute ago Unavailable Lights changed to [] triggered by state of Hob Floor Strip turned off 22:57:22 - 7 minutes ago Unavailable Lights changed to ['light.hob_floor_strip'] triggered by state of Hob Floor Strip became unavailable 22:56:54 - 8 minutes ago Unavailable Lights changed to [] triggered by state of Hob Floor Strip turned off 22:54:21 - 10 minutes ago Unavailable Lights changed to ['light.hob_floor_strip'] triggered by state of Hob Floor Strip became unavailable 22:51:42 - 13 minutes ago . . . . .

Djelibeybi commented 1 year ago

There are no connections: all messages are sent via UDP, so it's best effort at best. My current suspicion is actually a CPU bottleneck caused by too much network traffic each bulb has to evaluate to some degree in order to ignore.

mark007 commented 1 year ago

Today I had two strips stay unavailable for over an hour, both were still controllable from the lifx app. A HA restart brought both back online, which to me indicates there's something from the integration logic side that might need changing.

bdraco commented 1 year ago

Probably means that the LIFX device is no longer responding to that specific udp source port or the router has dropped off traffic from that source and dest port. Likely reinitializing the socket with a different source port will probably fix it but it shouldn't be needed as that's a network level workaround

mark007 commented 1 year ago

Oh interesting, would a debug log and or would a code change to change port in such cases be needed to prove this point. The strips seem much less reliable than the bulbs. They seem to be rock solid. I'm thinking as the most drastic workaround, is a fallback to the lifx cloud worth considering, although I'm sure it's a lot of work to implement and shouldn't be required if the local connectivity can be made very reliable and resilient to all of the various quirks.

bdraco commented 1 year ago

Ideally the change should be to be made by someone who can replicate the issue. Mine are all rock solid so I have no way to test an implementation change.

Djelibeybi commented 1 year ago

I've seen similar behaviour, so I'll see if I can reproduce it reliably enough to be confident with any potential fix/workaround.

Djelibeybi commented 1 year ago

I've just released 2023.1.0b4 of the LIFX Beta which should do a few things: first, hopefully make bulbs fall offline less often but second, provide more information when they do.

(Edited because I just released 2023.1.0b4).