Possible async issues in pyvera exposed by home assistant

pavoni commented 4 years ago

@vangorra I know you've been looking at this for a while.

My recent experience of going from a working set-up to one that sometimes go wrong when I upgraded HA but left Vera unchanged - suggests it's not a Vera problem - but a pyvera / HA issue

With all the dev work and expertise on HA - I expect async race conditions in HA would get sorted (I suspect the device _2 issue was probably like this something in HA) - which makes me think the most likely issue is something unsafe in pyvera - which is now being exposed by the HA async environment.

As an example I'm suspicious that the following code is unsafe:

https://github.com/pavoni/pyvera/blob/master/pyvera/__init__.py#L791-L792

    def set_switch_state(self, state: int) -> None:
        """Set the switch state, also update local state."""
        self.set_service_value(self.switch_service, "Target", "newTargetValue", state)
        self.set_cache_value("Status", state)

It initiates a vera action - and then sets the internal state of the pyvera switch.

In the new async world the vera action could set the device - and then receive back an async update before the next line of code setting the state runs - overwriting the real vera state with the intended state. This has always been doubtful (the subscription code has always ben multithreaded) - but a more highly async environment would make errors more frequent.

So I think it would be safer to reverse the two lines of code.

There are quite a few examples of this (my pattern seems to have been initiate the action - then change the local state!).

There may also be other similar issues.

With all your investigation do you have a handle on the circumstances where state information is being lost? Is it mainly switches and other active devices - or is it also sensors (I think i've only seen it on switches so far).

vangorra commented 4 years ago

TLDR; Fixed here: https://github.com/home-assistant/core/pull/35703

I noticed a few tickets for HASS which either suggested that the vera hub occasionally returned consistently inaccurate incremental data and/or a race condition with threading was encountered.

I worked around this by giving the HASS vera integration it's own implementation of SubscriptionRegistry. This implementation pulls all statuses from the hub and uses HASS event loops to collect and deliver the data. Reports from test users confirm this PR fixes their issues.

I debated with myself about whether to continue using incremental retrieval of statuses or pulling them all. After some testing locally (on my 2 hubs), I learned that returning all statuses has no significant performance impact on the hub. So it's reliability was the preferred outcome.

pavoni commented 4 years ago

Thanks.

Am I right that the new code polls vera every second via the HA event loop?

I guess there would be two concerns :-

Vera Hub performance which it sounds like you've tested (I do recall having some issues with my Vera Edge where I was first trying this - although it was via the get incremental change call which might be more expensive),
Responsiveness to switch changes - I choose the 200ms for the SUBSCRIPTION_MIN_WAIT after some experimenting to try to get good responsiveness to switches (I have a Vera switch that triggers switching on and off lights via HA). It was never as fast as I would have liked - but less that 200ms seemed to stress vera without helping. How have you found the responsiveness? (Of course right is always better than fast!)

vangorra commented 4 years ago

Yes, the poll interval is set every second. I tried running parallel requests every 100ms and ran into zero issues.

My instances have always been very responsive even with this recent HASS code change. I suspect there will be some unexpected issues that come up but we at least remove the race conditions.

Regarding switch responsiveness. I suspect the responsiveness will be the same or better. After all, the VeraSwitch implementation in HA refreshes the status after sending the change.

pavoni commented 4 years ago

Great thanks. Will look forward to seeing it merged.

May also swap round some of the unsafe code anyway - even though it won't matter with your new approach.

pavoni commented 4 years ago

Just saw the review comments - I guess with a 5s refresh it may be worth trying to combine the methods. 5s certainly wouldn't work for my use case - and perhaps for other users too.

You could still do a 5s refresh - but also trigger your update code in a return from the vera poll/wait call...

vangorra commented 4 years ago

I'm confused. The HASS change uses a 1 second refresh interval in a loop. When entities like a switch are activated through HASS, then they activate a force a refresh too. I'm not aware of a 5s refresh anywhere.

pavoni commented 4 years ago

There is a requested change on your HA pr saying that the minimum poll interval they allow us is 5s, do they won’t allow 1s.

vangorra commented 4 years ago

I somehow missed that feedback. Working on it.

vangorra commented 4 years ago

I'm to see you can reproduce the issue. I've been unable to. At any rate, I've updated the HASS vera component to use long polling. Hopefully this will be the best fix.

pavoni commented 4 years ago

All working perfectly now - closing!

maximvelichko / pyvera

Possible async issues in pyvera exposed by home assistant #134