jimpastos / ha-modernforms

ModernForms Smart Fan Integration for HomeAssistant
MIT License
12 stars 6 forks source link

[feature request] Handle Dying Fans #15

Open icemarkom opened 3 years ago

icemarkom commented 3 years ago

This is more of a thinking aloud than a well-formed feature request.

Since these fans die when polled "excessively", detecting, and perhaps handling these situations may be something to add to the integration.

Initial thoughts are to allow for a configuration of "periodic reboot", i.e. reboot the fan after ever N polls, where N is some reasonably large number (for example every 100 polls by default).

In addition to the above, blindly polling and expecting a response is not the best of approaches. Adding logging and state tracking about status of the fan would be better. I.e. keep state on whether a fan is alive, and just not poll if it's not responding. Also, log when they stop responding. This may help determining a good number of polls between reboots :-).

Working around 3rd party memory leaks is hard.

faceprint commented 3 years ago

...continuing conversation from #16 ...

10 second interval did not go well. I dug further, and realized that the default keepalive_expiry for httpx is 5 seconds, which explains why dropping the interval that low had the impact it did. For another test, I've commented out the set_session() call, and am building my own httpx.AsyncClient() with an override to set keepalive_expiry=30 with a 10 second interval.

icemarkom commented 3 years ago

Now that we moved to HTTPX, I have observed that the current implementation really doesn't like persistent connections. I.e. HTTPX opens a persistent connection by default, and those stay in place, until the next comment (either a probe, or another operation) is executed. The session is torn down, and new established.

This is something we should focus on figuring out. Adding the following code to https://github.com/jimpastos/ha-modernforms/blob/af32f8f487be2866f0fcc0f9cfc8a97993ab1a0f/custom_components/modernforms/__init__.py#L140

        headers={"Connection": "close"})

replicates the "old" behavior, i.e. when each session is destroyed immediately after the operation.

I have this running now for a bit, and I have the expected result:

tcp        0      0 192.168.xx.2:54276     192.168.yy.151:80       TIME_WAIT
tcp        0      0 192.168.xx.2:54302     192.168.yy.151:80       TIME_WAIT
tcp        0      0 192.168.xx.2:54232     192.168.yy.151:80       TIME_WAIT
tcp        0      0 192.168.xx.2:53980     192.168.yy.151:80       TIME_WAIT
tcp        0      0 192.168.xx.2:54018     192.168.yy.151:80       TIME_WAIT
tcp        0      0 192.168.xx.2:54088     192.168.yy.151:80       TIME_WAIT
tcp        0      0 192.168.xx.2:54056     192.168.yy.151:80       TIME_WAIT
tcp        0      0 192.168.xx.2:54204     192.168.yy.151:80       TIME_WAIT
tcp        0      0 192.168.xx.2:53938     192.168.yy.151:80       TIME_WAIT
tcp        0      0 192.168.xx.2:54344     192.168.yy.151:80       TIME_WAIT
tcp        0      0 192.168.xx.2:54130     192.168.yy.151:80       TIME_WAIT
tcp        0      0 192.168.xx.2:54154     192.168.yy.151:80       TIME_WAIT

Several options here, in my opinion:

  1. Figure out how to shorten this timeout (which may be prohibitively non-scalable)
  2. Figure out whether we can implement persistent session, that isn't recreated every time a new command comes in.

My preferred option would be (2), but that may not be possible, if the fan side doesn't support it.

icemarkom commented 3 years ago

10 second interval did not go well. I dug further, and realized that the default keepalive_expiry for httpx is 5 seconds, which explains why dropping the interval that low had the impact it did. For another test, I've commented out the set_session() call, and am building my own httpx.AsyncClient() with an override to set keepalive_expiry=30 with a 10 second interval.

Correct, and as a reference: https://www.python-httpx.org/advanced/#timeout-configuration

Now, just adjusting the timeouts may not do the the trick (I spent some time today playing around with this, as I had a similar thoughts) - see my previous comment about existing sessions being torn when new action is taken. I.e. for some reason, either our code, or the fan doesn't like using the established connection.

faceprint commented 3 years ago

Interesting, I'm not seeing that behavior. I've had a single connection to the fan since I started this test, nothing torn down as best I can tell. Took at look at tcpdump, and don't see anything suspicious there. I'll let this run the rest of the weekend and see if it maintains this behavior.

icemarkom commented 3 years ago

That's good news. I'd love to try the same, as my attempted solutions was slightly different. Do you have a commit somewhere so I can try it out as well?

faceprint commented 3 years ago

I just pushed what I'm running to https://github.com/faceprint/ha-modernforms for you to try out

icemarkom commented 3 years ago

Perfect. I just started running this, and I can definitely see only one session per fan, and that seems to be staying on, i.e. there are no dead sessions all over the place as before.

That change you made, though, are in fact two changes. One is the timeout change, and the other one is the async client creation.

I am going to disable my reboot automation, and let this run for a while. Even if this doesn't address the issue of fans locking up, it appears to be a lot cleaner approach than what we had until now, so - nice work there!

faceprint commented 3 years ago

I'm seeing occasional ReadTimeout exceptions, and when I do, it ends up opening another connection, while maintaining the old one. Over the course of 20ish hours I got up to 30 established connections.

Maybe stupid, but as a trial, I'm re-implementing with aiohttp in place of httpx to see if there's a different behavior there (aiohttp seems to be used more in HA, I just happened to pick an example that used httpx when I started). I'll report back with any behavior change.

icemarkom commented 3 years ago

On my end, I've been running this since the morning. The same two TCP sessions have been ESTABLISHED at 11:24, 11:54, 13:58. I.e. rock solid for at least a few hours.

icemarkom commented 3 years ago

Maybe stupid, but as a trial, I'm re-implementing with aiohttp in place of httpx to see if there's a different behavior there (aiohttp seems to be used more in HA, I just happened to pick an example that used httpx when I started). I'll report back with any behavior change.

I suggest we track this as a separate issue/feature request. I don't have strong opinions, but it appears to me that moving from multiple connections to more persistent ones is the right direction to address dying fans. HOW we're going to do and support that in the long run is something to be addressed.

I've noticed aiohttp elsewhere too, so it may be OK to do. I have no strong opinions, either way :-).

icemarkom commented 3 years ago

Unsuprisingly, one of the fans quickly deteriorated, and died. So, I think this alone does not solve our problem. Let's see if aiohttp makes any difference, but I am having strong doubts about this being in any way, shape, or form a client-side solvable issue.

faceprint commented 3 years ago

I've pushed to my faceprint fork the switch to aiohttp (#18), which has been solid from a connectivity standpoint. I've seen some hiccups from the fan, but there has been some "interference" so I don't have a clean test run I'm happy with yet. Would be interested in your experience if you feel up for trying it. If you have success I can create a pull request.

icemarkom commented 3 years ago

For sure I will! Sadly, I won't be at home over the weekend, but I will switch to that version shortly.

icemarkom commented 3 years ago

So, after a few days, I have weirdness to report back... :-)

TL; DR: Fans are still dying. A lot less frequently, but death is still a thing. Connections seem to be rock solid with the latest aiohttp based solution, so I think we can go with that... understanding that we have not solved the problem of dying fans.

I have just re-enabled reboot automation on my end, because "unhappy wife == unhappy life", and solving that trumps troubleshooting software and hardware :-).

I do believe that adding reboots every NNN polls will dump this problem under the carpet and make it not be visible any more. Anyone willing to take a stab at that? I suggest reboot every N intervals, where N * poll_interval < 4 hours.

faceprint commented 3 years ago

Agree wholeheartedly about the "WAF" but unfortunately, reboot == light blink, so it's not a cure-all. I pushed another change that resets the connection (not the fan) every 100 messages (so about 16 minutes). I'm in the middle of testing it now.

Has anyone actually reached out to modern forms about this? Do they care?

barlock commented 3 years ago

I was about to take another stab at fixing this when I found that there's now an "official" integration. https://www.home-assistant.io/integrations/modern_forms/

I started reading through their code to try and figure out how/if they solved this issue. I wasn't able to find anything.

I did notice that they don't have a reboot service and their scan interval is 5 seconds which seems very fast to me. I'll report back if it seems to work better, recently I've been having my fan die at least every 24 hours despite a reboot automation.

djj211 commented 3 years ago

I was living with this problem for a few months. I did, however, recently upgrade the firmware on my MF device (to 2.00.0023) and since then haven't not had any issues. It's been about 2 weeks without issue. Will likely switch over to the Official integration now as well. Hopefully, all is well now...this was frustrating and seemed like a MF issue. Code looked fine as far as I could see.

barlock commented 3 years ago

Amazing! After multiple firmware upgrades I wasn't sure if that was the fix. I'm glad they finally fixed the underlying issue (🤞) Clearly my bad for making two changes at once.

Just to confirm though, the "official" one has been working with no issues for two weeks also.

djj211 commented 3 years ago

I'm still using this custom component, so that has been working for 2 weeks since the firmware upgrade.

I have not switched to the official integration yet. I'll probably make the move some time this week. If I get the time.