home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
73.19k stars 30.57k forks source link

Lifx integration with many devices frequently goes unavailable #78876

Closed mspinolo closed 1 year ago

mspinolo commented 2 years ago

The problem

Since the latest integration update I have a lot of occurrences of LIFX light becoming "not available". This is happening on most (but not all) of them (I have 20+ lights).

The behavior is not consistent during the day which makes me suspect there is some relation with the wifi environment (I have 3 AP broadcasting the same SSID on 1-6-11 channel), but by AP logs it doesn't seem to be related to LIFX disconnecting from one AP and reconnecting to the other.

Also I see they usually become unavailable for 10s then coming back online: I ask myself if this has something to do with polling rate cycle of the integration as I see from integration discovery interval is 10 (seconds?)

"""Const for LIFX."""

import logging

DOMAIN = "lifx"

TARGET_ANY = "00:00:00:00:00:00"

DISCOVERY_INTERVAL = 10
MESSAGE_TIMEOUT = 1.65
MESSAGE_RETRIES = 5
OVERALL_TIMEOUT = 9
UNAVAILABLE_GRACE = 90

so could it be that discovery, in my environment, simply can't keep the pace and drops connections?

What version of Home Assistant Core has the issue?

2022.9.5

What was the last working version of Home Assistant Core?

the one before LIFX integration update

What type of installation are you running?

Home Assistant OS

Integration causing the issue

LIFX

Link to integration documentation on our website

https://www.home-assistant.io/integrations/lifx/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

Djelibeybi commented 1 year ago

Yeah, that helps. Just remember to set a fixed IP if you get new bulbs. I always forget, so I still see this occassionally as I reconnect spare bulbs for testing purposes. 😂

MSIMaker commented 1 year ago

Guess I spoke too soon....the 3 lights are going offline for 10 seconds again. Sooooooo annoying.

Djelibeybi commented 1 year ago

Did you set the same IP addresses as they were previously using or new ones? If you changed the IP address when you made them static, delete them from Home Assistant and let it rediscover them. My fleet are mostly stable now with the exception of some Beams and a Z strip that are notoriously flaky anyway.

MSIMaker commented 1 year ago

Now that they are all static, I am going to do that and have them re-discovered.

MSIMaker commented 1 year ago

Update.

Still having issues with this. I found doing a light restart seems to stabilize them for a while. A few hours at least.

MSIMaker commented 1 year ago

Is anyone working on fixing this issue? I have 4 lights constantly going unavailable for no reason.

image

Djelibeybi commented 1 year ago

No-one is working on fixing this issue in the Core right now. If you're like to help me work out the actual cause, feel free to install my LIFX Beta integration using HACS by following the instructions at https://github.com/Djelibeybi/ha-lifx-beta. The current release (2032.2.0b1) is pretty stable. So much so that I'm running it on my production/real instance.

MSIMaker commented 1 year ago

Ok. Well I don't use HACS at all. My system is pretty pure if you know what I mean. I will see if I can install the beta at some stage and see if that helps.

Djelibeybi commented 1 year ago

If you don't use HACS, you should be knowledgable to install the custom integration manually. Unless you don't run any custom integrations at all.

x3style commented 1 year ago

The thundering heard fix at 0 microseconds #82233

We're now running 2023.2.2 and problems are still there. Any chance you could help us with a fix? Would really appreciated it!

Djelibeybi commented 1 year ago

We're now running 2023.2.2 and problems are still there. Any chance you could help us with a fix? Would really appreciated it!

Please try my LIFX Beta integration (linked above) and let me know if that fixes or improves things for you. But no-one is looking at fixing this in the core at the moment, because narrowing down the problem is proving extremely difficult.

skynet01 commented 1 year ago

Been running your beta @Djelibeybi for a week now, works great and no more time out issues. :)

Djelibeybi commented 1 year ago

@skynet01 glad to hear that!

msbc42 commented 1 year ago

Found this issue yesterday and installed the Beta integration. Today I'm still seeing the timeouts:

image

Djelibeybi commented 1 year ago

Yeah, the beta is also not a solution. It's just my attempts at trying to find one. The current version is close, but obviously not complete.

bdraco commented 1 year ago

During this beta cycle I ended up helping a user who had this issue and it turned out the cause was another integration blocking the event loop which was preventing the update from running in time. Once they removed the other integration, the problem went away.

bdraco commented 1 year ago

While it may be something different for every case.

py-spy (use 0.3.14) https://community.home-assistant.io/t/instructions-to-install-py-spy-on-haos/480473 and the profiler.start service https://www.home-assistant.io/integrations/profiler/#service-profilerstart

may be able to reveal performance issues. If that is the root cause of the problem, the more of them we get, the more likely we will be able able to compare and maybe find the issue.

MSIMaker commented 1 year ago

During this beta cycle I ended up helping a user who had this issue and it turned out the cause was another integration blocking the event loop which was preventing the update from running in time. Once they removed the other integration, the problem went away.

Interesting.....can you tell me the integration that was causing the issue? I may have the same one.

bdraco commented 1 year ago

Interesting.....can you tell me the integration that was causing the issue? I may have the same one.

Its not something that was publicly available (thankfully 😉) so it won't be the same integration.

There was also a case of imap triggering the issue but thats since been fixed.

amelchio commented 1 year ago

It's more likely that your bulbs have always been doing this, we're just better at reporting it now than before.

Yes, they have always been doing this. LIFX hardware is flaky. Reporting intermittent dropouts is not helpful and the previous suppression was intentional.

Can we please reinstate something like UNAVAILABLE_GRACE to fix the 10s flip-flop regression?

bdraco commented 1 year ago

It's more likely that your bulbs have always been doing this, we're just better at reporting it now than before.

Yes, they have always been doing this. LIFX hardware is flaky. Reporting intermittent dropouts is not helpful and the previous suppression was intentional.

Can we please reinstate something like UNAVAILABLE_GRACE to fix the 10s flip-flop regression?

That seems like a good idea as I don't think we are going to be able to come up with a software fix for a hardware issue.

Djelibeybi commented 1 year ago

LIFX bulbs use UDP exclusively thus there is no concept of connection and thus there is nothing to drop out. Discussion of hardware flakiness aside, UDP has no retransmission built-in, so it's best effort.

We just need to stop raising exceptions when it doesn't reply and either retry ourselves or log a "oh, shucks we tried" message.

The other other option is replacing aiolifx with Photons which eliminates the issue completely but is way more complex to work with. I'm writing an API shim for the framework just to simplify the implementation.

bdraco commented 1 year ago

It's more likely that your bulbs have always been doing this, we're just better at reporting it now than before.

Yes, they have always been doing this. LIFX hardware is flaky. Reporting intermittent dropouts is not helpful and the previous suppression was intentional.

Can we please reinstate something like UNAVAILABLE_GRACE to fix the 10s flip-flop regression?

https://github.com/home-assistant/core/pull/90872

Needs some tests but out of time to do that right now

amelchio commented 1 year ago

The other other option is replacing aiolifx with Photons which eliminates the issue completely [...]

I believe this is an optimistic view which implies that you have not yet accepted that LIFX hardware is flaky.

Yes, LIFX hardware might work almost fine in perfect conditions. Those conditions include things like abandoning the first few LIFX generations, using an idle wifi, pointing antennas just right, placing bulbs outside of lampshades and keeping neighbors from popping corn.

Take a look at the LIFX firmware release notes, each release features "improved connectivity". I bet the next one will too.

Djelibeybi commented 1 year ago

I believe this is an optimistic view which implies that you have not yet accepted that LIFX hardware is flaky.

No, it takes the view that the framework written by the LIFX employee to power the LIFX Cloud is probably the best thing to use to manage LIFX devices at scale.

amelchio commented 1 year ago

Both of our statements can be true.

Djelibeybi commented 1 year ago

Both of our statements can be true.

Both statements are opinions, so they don't have to be. 😏 But point taken. Photons just takes a whole different approach to almost any other library in any language, which makes it far more robust, but doesn't fit well with Home Assistant's device/entity POV. I've (mostly) created a shim layer that presents an aiolifx-like API to Home Assistant powered by Photons. And then I (mostly) created another one that does the same using Photons Interactor.

Edited to add that "mostly" actually means "got it to work sufficiently to provide a path to MVP" but not actual MVP state.

MSIMaker commented 1 year ago

On a slight tangent here....but for my own interest. What router/modem do you have if you experience this issue?

I have the ASUS AX11000

I can see the lights disappear in my router monitor and then come back. I have some set as static ip and some dhcp....both react the same.....dropped for a few seconds and then come back.

This is without HA even running. I am wondering if router brand or a setting is making any difference here and the issue is not with HA at all.

Djelibeybi commented 1 year ago

I have a Ubiquiti setup and yes, the wifi on LIFX devices is ... interesting. I suspect (though can't prove) that the microcontrollers are quietly rebooting fairly often which results in the high DHCP requests. Certainly on the linear multizone devices (Z, Beam, Lightstrip) they're rebooting without triggering a resync.

alexruffell commented 1 year ago

This may have nothing to do with this issue but a couple of years ago I had highly unstable LIFX lights that kept dropping off HA. It turned out to be a botched mDNS implementation (possibly also related to my bulbs being on a different VLAN) on my Unifi networking gear. Once they fixed that, the instability vanished overnight.

bdraco commented 1 year ago

Can we please reinstate something like UNAVAILABLE_GRACE to fix the 10s flip-flop regression?

https://github.com/home-assistant/core/pull/91157 uses the value for UNAVAILABLE_GRACE

bdraco commented 1 year ago

https://github.com/home-assistant/core/pull/91157 should be ready for testing now

bdraco commented 1 year ago

I'll open PRs to aiolifx to fix some of the underlying issues in the library which should improve reliability:

amelchio commented 1 year ago

Good catch @bdraco!

bdraco commented 1 year ago

2023.4.4 has the new version of aiolifx with fixes in it so it would be nice to know if it improves the situation for anyone.

bdraco commented 1 year ago

Also https://github.com/home-assistant/core/pull/91157 isn't in a public build yet though which is the bigger change.

MSIMaker commented 1 year ago

Installing 2023.4.4 right now and clearing the logs. We shall see how it goes.

But as aside, I removed my ASUS GT11000 router and set my Telstra Smart Modem back to router mode and let it manage my home and the drops outs have almost stopped completely and HA is more stable than it ever has been before. So I am suspected that there are issues within that router as well as some Lifx issues which contribute together here.

The ASUS router is going back to ASUS under RMA and if they replace it, I will try it again. But for now the SM3 is working a treat.