Hyperion integration (with RPi nodes) frequently loses state and cannot recover

home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.

https://www.home-assistant.io

Apache License 2.0

73.58k stars 30.75k forks source link

Hyperion integration (with RPi nodes) frequently loses state and cannot recover #75670

Closed zaneclaes closed 2 years ago

zaneclaes commented 2 years ago

The problem

I have 5 different Raspberry Pis running Hyperion (latest, installed per Hyperion documentation for RPi). Each one of them exhibit the same problems that render them effectively unusable with home assistant. None of these problems are present with the native Hyperion interface. Rather, all of them appear to be issues with the HA/Hyperion communication, as I can try the same operations from the Hyperion web interface and they reliably succeed.

I first noticed that turning on/off a Hyperion instance is incredibly inconsistent. About 1/5 times, my simple light.turn_on or light.turn_off automation fails. There is no error message, but about 1-2 seconds after triggering, it reverts to the old state. Thus, my time-based automations fail silently, and remain in the old state. I can see the same thing happening occasionally even without the automation; simply pressing the toggle (switch) in HA will sometimes inexplicably revert itself. Note that all the hardware is hard-wired (not WiFi), running on a Raspberry Pi 4 — so there's effectively zero chance of resource/bandwidth bottlenecks being the root cause.

Similarly, I cannot get any automation to work for gracefully resuming the light state. For example, I tried adding an automation for when the light leaves the disabled state. However, there are many bugs here:

Home Assistant doesn't always update the devices into the disabled state, instead seeing them as off despite the fact that they are unreachable on the network.
If I trigger an automation to turn the light on when the state changes, it does not work at all. I can run the automation normally and see the light turn on, but if I trigger the automation when the instance leaves the disabled state, it does not turn on the light ever. I have also tried adding a 30 second delay to the automation, in case it has to do with boot up lag.
If I manually trigger the same automation again (light.turn_on with an effect specified while the light is already on) it in fact turns off the light instead of turning it on.

What version of Home Assistant Core has the issue?

2022.7.2

What was the last working version of Home Assistant Core?

N/A (never worked)

What type of installation are you running?

Home Assistant Container

Integration causing the issue

Hyperion

Link to integration documentation on our website

https://www.home-assistant.io/integrations/hyperion/

Diagnostics information

N/A

Example YAML snippet

Configured via IP address in the UI

Anything in the logs that might be useful for us?

Nothing from Hyperion

Additional information

No response

probot-home-assistant[bot] commented 2 years ago

hyperion documentation hyperion source _{^{(message by IssueLinks)}}

probot-home-assistant[bot] commented 2 years ago

Hey there @dermotduffy, mind taking a look at this issue as it has been labeled with an integration (hyperion) you are listed as a code owner for? Thanks! _{^{(message by CodeOwnersMention)}}

dermotduffy commented 2 years ago

(Thanks for moving it over!)

I first noticed that turning on/off a Hyperion instance is incredibly inconsistent. About 1/5 times, my simple light.turn_on or light.turn_off automation fails. There is no error message, but about 1-2 seconds after triggering, it reverts to the old state.

This usually means the backend Hyperion server accepts the command, but is not re-issuing/reflecting the change back to the integration for some reason. The Hyperion API is fairly complicated, so this could be for a variety of reasons.

Home Assistant doesn't always update the devices into the disabled state, instead seeing them as off despite the fact that they are unreachable on the network.

Do you mean 'unavailable'? The HA entities should never be put into/out of disabled state unless a human marks the entity as disabled -- this is never influenced by network connectivity. If you mean unavailable, then this is this bug (https://github.com/home-assistant/core/issues/61230), the underlying library that the integration uses may not notice if its counterpart (the server) gets disconnected.

If I trigger an automation to turn the light on when the state changes, it does not work at all. I can run the automation normally and see the light turn on, but if I trigger the automation when the instance leaves the disabled state

Again the use of "disabled" here is confusing. Can you clarify what you mean? Home assistant entities being enabled or disabled is not related to network connectivity.

If I manually trigger the same automation again (light.turn_on with an effect specified while the light is already on) it in fact turns off the light instead of turning it on.

Lets ignore automations for a moment, can you manually / reliably use the Hyperion entity correctly from HA? i.e. in the entity control ...

Screenshot from 2022-07-23 10-45-20

Can you provide the exact Hyperion version number you are using?

zaneclaes commented 2 years ago

The Hyperion API is fairly complicated, so this could be for a variety of reasons.

Certainly, though I expected some kind of error code in the logs. Does the API not reject in the case of failure, or otherwise have a callback mechanism to report?

Do you mean 'unavailable'?

I suppose I do; I had copy/pasted the state into my automation correctly, but misspoke here.

The linked issue seems to cover one of the symptoms (the unavailable state after losing power, and inability to create an automation which fixes the problem) fairly well. We've had over 30 power outages in 3 months, so this isn't an edge case for me ;)

Reading through the related issue and perusing the Hyperion docs, it seems like a Websocket implementation of the Hyperion client would be much more suited to the HA use-case than a JSON API... though I can imagine that perhaps the resource overhead would not be desirable, it is still the approach used by those integrations which require fast and bidirectional communication like node-red (which have scaled up to a dozen+ web socket connections without issue for me). Note: perhaps the WS API doesn't push updates to the client, but at least the socket disconnect would prevent the need for a polling or keep-alive mechanism, which generally incur even higher performance penalties.

Lets ignore automations for a moment, can you manually / reliably use the Hyperion entity correctly from HA? i.e. in the entity control ...

No. If I simply sit there toggling the switch back and forth (even if I give it a good 5+ seconds each time), eventually the integration breaks. The switch reverts to the old position, or otherwise becomes detached from the actual state of the integration. At that point, I can try toggling it a couple times (it tends to bounce back a few times) and eventually it works again. But the only way I have of knowing if it's working is to observe the actual lights.

Can you provide the exact Hyperion version number you are using?

Each Raspberry Pi 4 instance is connected to a Arduino Uno via USB, which is running adalight...

dermotduffy commented 2 years ago

Certainly, though I expected some kind of error code in the logs. Does the API not reject in the case of failure, or otherwise have a callback mechanism to report?

It's not so much an API error, but rather the resultant state callback that can go missing, i.e. the actual server request can be successful, but that does not imply that the necessary state callback that that API change should trigger will actually happen (e.g. the API call might be to set the light to red at a given Hyperion priority -- that request can succeed, without the callback back to the integration that a red light is now the top priority). I have filed many Hyperion bugs related to these kind of issues (not suggesting this is necessarily an issue with Hyperion in this case).

Reading through the related issue and perusing the Hyperion docs, it seems like a Websocket implementation of the Hyperion client would be much more suited to the HA use-case

Agree. There are some functionality differences between the different implementations (and at the time -- bugs in the websocket implementation I ran into) that made this decision for me. That was quite a while ago and things might be different now but this would be a huge change obviously.

No. If I simply sit there toggling the switch back and forth (even if I give it a good 5+ seconds each time), eventually the integration breaks.

Mind capturing some logs for me? The following should result in the raw traffic going back and forth, showing up in your logs:

logger:
  default: warning
  logs:
    hyperion: debug
    homeassistant.components.hyperion: debug

Also, would you mind seeing if Hyperion 2.0.12 behaves differently (i.e. do a single version downgrade on one devices to see if it's a behavior changes between Hyperion versions).

zaneclaes commented 2 years ago

Also, would you mind seeing if Hyperion 2.0.12 behaves differently (i.e. do a single version downgrade on one devices to see if it's a behavior changes between Hyperion versions).

I've actually been experiencing this bug for about a year (since when I first installed the component). I only had 1 Hyperion instance until now so it wasn't terribly problematic as long as the device stayed on, but now with 5 instances and power outages...

Here are some logs, where I restarted HA and then toggled a light 3 times (every 3 seconds). On the 3rd try, it reverted to the old state (while the lights did not change).

I suppose this could be a race condition (I see you're forced to do many API calls to update the state, which may be conflicting). I can't be sure that this is the same exact symptoms as when it fails due to an automation (automations never run within 12 hours of each other). So I ran the test again, and waited 10 seconds between each toggle... logs of that failure are in the comment. In this second trial, the symptom was different. Instead of the toggle flipping back off despite the light being on, the toggle correctly turned off but the light stayed on. However about a minute later the light turned off. Further attempts to toggle during a state like this tend to lead to indeterminate behavior (sometimes the switch works and suddenly all is well, sometimes it just gets stuck in a "revert loop").

Lest we get lost in the weeds... I would be very happy if there was a sync method of some kind I could run that would check the current state of the lights and re-issue commands if necessary. The real problem for me is this indeterminate state, where HA does not match reality, and I have no way to sync them back up (and subsequent calls to light.turn_on etc. do not behave as expected).

dermotduffy commented 2 years ago

I've actually been experiencing this bug for about a year (since when I first installed the component). I only had 1 Hyperion instance until now so it wasn't terribly problematic as long as the device stayed on, but now with 5 instances and power outages...

Aha! Ok. Not the version then.

The toggle correctly turned off but the light stayed on. However about a minute later the light turned off.

This really feels like it could be a Hyperion issue rather than an integration one, unless you see a subsequent request to the server a minute later. Does this work when you use the "Remote Control" interface on the Hyperion server itself?

Looking at this log snippet the key thing missing is no priorities callback from the server (as predicted earlier in this bug).

Lets entirely take the integration out of the picture. Try this:

Open two separate terminals (In the below, replace $HYPERION with your actual hyperion host)
In one terminal run this and let it sit continually awaiting data from the server:

echo '{"tan": 1, "command": "serverinfo", "subscribe": ["adjustment-update", "components-update", "effects-update", "leds-update", "imageToLedMapping-update", "instance-update", "priorities-update", "sessions-update", "videomode-update"]}' | nc $HYPERION 19444

You'll get a large response like this:

{"command":"serverinfo","info":{"activeEffects":[],"activeLedColor":[],"adjust... [... many lines of output]}

... leave this terminal untouched to see updates from the server.
In a second terminal run this to clear priority 128:

echo '{"command": "clear", "priority": 128}' | nc $HYPERION 19444

I see the following:

Terminal 1:

The terminal awaiting the state update:

{"command":"priorities-update","data":{"priorities":[{"active":true,"componentId":"V4L","origin":"System","owner":"V4L2","priority":240,"visible":true}],"priorities_autoselect":true}}

Terminal 2:

The terminal where you sent the priority command:

{"command":"clear","success":true,"tan":0}

... my guess is you will see nothing on Terminal 1 (and that is a Hyperion server bug IMO, that we should report over there).

zaneclaes commented 2 years ago

You'll get a large response like this

I never even get the response. The top terminal (faded out because not in focus) shows the Hyperion RPi instance (SSH), and the bottom the command. Granted, nc isn't a common tool in my toolbox (and I use Mac+ZSH), so perhaps there's some bashy syntax at fault.

dermotduffy commented 2 years ago

Looks like nc on OSX is behaving differently than on Linux here. Try this instead for the first command (works for me on OSX):

(echo '{"tan": 1, "command": "serverinfo", "subscribe": ["adjustment-update", "components-update", "effects-update", "leds-update", "imageToLedMapping-update", "instance-update", "priorities-update", "sessions-update", "videomode-update"]}' && sleep 9999) | nc $HYPERION 19444

It should send the command, show you the response and hang around for 9999 seconds waiting for more data (more than enough time for you to run the other command in terminal 2).

zaneclaes commented 2 years ago

Thanks for helping untangle that :)

.. my guess is you will see nothing on Terminal 1 (and that is a Hyperion server bug IMO, that we should report over there).

I do see the following response:

{"command":"priorities-update","data":{"priorities":[],"priorities_autoselect":true}}

And watching the Hyperion UI, I see the transition from HA to the empty state:

But then I toggle several times with Home Assistant...

... and end in a mismatched state.

Looking at the terminal, I see the following logs:

{"command":"priorities-update","data":{"priorities":[{"active":true,"componentId":"EFFECT","origin":"Home Assistant@::ffff:10.0.0.3","owner":"Rainbow swirl fast","priority":128,"visible":true}],"priorities_autoselect":true}}
{"command":"components-update","data":{"enabled":false,"name":"LEDDEVICE"}}
{"command":"components-update","data":{"enabled":true,"name":"LEDDEVICE"}}
{"command":"components-update","data":{"enabled":false,"name":"LEDDEVICE"}}
{"command":"components-update","data":{"enabled":true,"name":"LEDDEVICE"}}
{"command":"components-update","data":{"enabled":false,"name":"LEDDEVICE"}}

I note that the logs end with enabled: false... so I suppose we must ultimately still conclude the API is at fault...

FWIW, I have noticed serious lag and/or system reboots at other times from Hyperion. It's surprising — the setup seems straightforward (RPi4 headless with the linux service running, connected to an Arduino Uno running Adalight). I suppose the 300 LED count per instance is a bit high. Still, the RPi isn't driving much else...

zaneclaes commented 2 years ago

Ah. I've figured out at least some more symptoms of this. When turning the LEDs back on via the toggle, in the case that it fails, the effect is set to Solid and white. However, it may not actually illuminate. The Hyperion UI shows the solid white effect as well, but no the lights remain unlit.

In the other failure case, the light remains on despite the toggle being off. I do see the enabled:false in the Hyperion connection, and the Hyperion UI shows the LEDs as off (though they remain lit). Curiously, if there was an effect active, the effect stalls until I turn the light back on (at which point the colors resume).

dermotduffy commented 2 years ago

Ah. I've figured out at least some more symptoms of this. When turning the LEDs back on via the toggle, in the case that it fails, the effect is set to Solid and white. However, it may not actually illuminate. The Hyperion UI shows the solid white effect as well, but no the lights remain unlit.

OK! Progress. This is obviously broken behavior that could explain the desync issues. Can you find a way to reproduce it?

I note that the logs end with enabled: false... so I suppose we must ultimately still conclude the API is at fault...

Do you mean you toggle in HA repeatedly and stop the toggling in the 'on' state, but the last API return is 'false'? If yes, then that is either the integration has failed to send the right 'on' commands (you should be able to see in the logs), or the server has failed to reply appropriately with the 'off' update.

In the other failure case, the light remains on despite the toggle being off. I do see the enabled:false in the Hyperion connection, and the Hyperion UI shows the LEDs as off (though they remain lit). Curiously, if there was an effect active, the effect stalls until I turn the light back on (at which point the colors resume).

Huh, weird. But yes, definitely a Hyperion server bug ...