home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
73.67k stars 30.8k forks source link

Risco alarm system integration aggressive memory leak #109164

Closed bbs2web closed 9 months ago

bbs2web commented 9 months ago

The problem

I found a related bug report, which perfectly matches my symptoms, where the report was unfortunately closed due to it relating to an unusual integration.

https://github.com/home-assistant/core/issues/91921

Our home is situated on an elevated ridge and although we have proper lightning protection, those that hit earthing conductors still generate inductive surges in copper wires longer than 10 metres within 200 metres of the strike. The alarm is a Risco LightSYS II and now completely wireless (passives, contacts and internet). We've had two occasions where the alarm panel stopped answering connection requests on port 1000 until I restart it (by disconnecting both the battery and power and restoring the connections thereafter).

When this happens the HASS vm chews through 8GB of memory within half an hour and the OOM (out of memory) task killer steps in to further break automations and integrations.

Please may I ask for a reconsideration? I really enjoy having lights turn on & off automatically at dusk & dawn when there is/isn't movement in parts of the home.

PS: The old panel only had a wired ethernet port and was fried by a surge delivered together with the IP packets. Trying to engage with Risco regarding the bug where the Configuration Software interface stops responding but it's really frustrating that everything stops working when HA can't talk to the Risco panel via the local integration.

What version of Home Assistant Core has the issue?

core-2024.1.5

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Risco

Link to integration documentation on our website

https://www.home-assistant.io/integrations/risco/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

I'm a relative HA newby, but comfortable with SSH and Linux. Can gladly dissect logs but would appreciate some pointers relating to where to look.

Additional information

No response

home-assistant[bot] commented 9 months ago

Hey there @onfreund, mind taking a look at this issue as it has been labeled with an integration (risco) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `risco` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign risco` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


risco documentation risco source (message by IssueLinks)

codyc1515 commented 9 months ago

Are you able to simulate this (maybe by disconnecting the network on the device) and post logs? Without logs, this is a very specific scenario and on an uncommon integration, as you said.

OnFreund commented 9 months ago

First of all, I'd like to clarify that https://github.com/home-assistant/core/issues/91921 was not closed due to an "unusual integration" or an "uncommon" one. It was closed because there was another component between the integration and the panel, and that component seemed to have cause the problem. This component was in the user's control. Had that tunnel accurately reflected the panel's behavior, there would be no issue.

As for this one, I'm a bit confused by the description, so a few questions:

  1. Are you using the local variant of the integration?
  2. How can the problem be reproduced?
  3. Where are the logs?
  4. Are copper wires and surges related to the problem?

We'll need to get this reproduced and see the panel behavior in order to investigate.

bbs2web commented 9 months ago

Hi,

PS: Thanks for the clarification, my understanding of the query was not that the tunnel/firewalls were blocking the connection. Completely understand your view.

I'm away from home at the moment but should be able to reproduce the issue by changing the CS (configuration software) port number on the panel to simulate the scenario where the panel stops responding to connections.

I'm comfortable with Linux but not familiar with the log location. Please allow me a day's grace to get this done...

Answers in the meantime:

  1. Running with the local integration, LightSYS II accepts connections on TCP port 1000 whilst remaining connected to the cloud (using the Risco app as a backup to home assistant).
  2. Pretty certain, not by simply disconnecting the alarm from the network though. By this I mean that the memory leak only appears to occur when the configured local integration IP is reachable and connections to TCP port 1000 are rejected. Home assistant is in the same subnet as the panel and there is no filtering between the two devices.
  3. Would appreciate any guidance or pointer to documentation that possibly details which file I could tail. Will use my Google-Foo though...
  4. No, just trying to explain that the environment makes it very unattractive to use fixed copper wiring instead of WiFi. Home Assistant runs as a VM on a 1 litre PC, running Proxmox, where it's bridged in to the same VLAN as the wireless SSID the panel connects to.
OnFreund commented 9 months ago

I'm comfortable with Linux but not familiar with the log location

Would appreciate any guidance or pointer to documentation that possibly details which file I could tail.

No need for any Linux foo - Home Assistant has a Logs screen.

By this I mean that the memory leak only appears to occur when the configured local integration IP is reachable and connections to TCP port 1000 are rejected.

I tried testing rejected connections when trying to reproduce https://github.com/home-assistant/core/issues/91921 - it didn't reproduce the problem. The main question is how to reproduce this, and how likely is it to happen naturally. Next time this happens, you can try using pyrisco directly and see what error you're getting when trying to connect.

bbs2web commented 9 months ago

Thank you for your time on this, the integration is really perfect in every regard!

I'm fortunately not able to reproduce the issue, I'll take packet captures if it does though, so that I can try to reproduce at will.

So changing the port from 1000 to something else results in the Risco integration continuing to work uninterrupted. I wasted a bit of time trying to get packets forwarded by the bridge sent to the CPU, so that I could selectively reject only connections on TCP port 1000 but eventually resorted to changing the DHCP reservation for the alarm and kicking the WiFi client out of the registration table. When I then assigned the alarm system's usual IP to the router and then setup a simple firewall rule to reject connections on the port the memory leak and CPU load increase didn't occur. I tried rejecting the connection with the default ICMP network is unreachable response, ICMP port unreachable responses and also trying the method whereby the the TCP connection is reset.

PS: The two endpoints are in the same broadcast domain, so they normally don't send their packets via any gateway.

For what it's worth and anyone else stumbling in to this:

HAOS runs as a VM on Proxmox with Open vSwitch (OvS), I could subsequently obtain visibility of the traffic by using tcpdump on the VM's virtual NIC port. In the below HAOS is 10.239.240.100 and the Risco alarm panel is 10.239.240.254.

When I moved the alarm system's IP to a router and setup a standard reject rule with default ICMP response (network unreachable):

22:38:57.683306 IP 10.239.240.100.47596 > 10.239.240.254.1000: Flags [S], seq 890650932, win 64240, options [mss 1460,sackOK,TS val 2011065383 ecr 0,nop,wscale 7], length 0
22:38:57.683596 IP 10.239.240.254 > 10.239.240.100: ICMP net 10.239.240.254 unreachable, length 68
22:39:03.040301 ARP, Request who-has 10.239.240.254 tell 10.239.240.100, length 28
22:39:03.040876 ARP, Reply 10.239.240.254 is-at 78:9a:de:ad:be:ef, length 42

When I set the reject rule to generate an ICMP port unreachable response:

22:39:42.760440 IP 10.239.240.100.41394 > 10.239.240.254.1000: Flags [S], seq 1324388464, win 64240, options [mss 1460,sackOK,TS val 2011110461 ecr 0,nop,wscale 7], length 0
22:39:42.760886 IP 10.239.240.254 > 10.239.240.100: ICMP 10.239.240.254 tcp port 1000 unreachable, length 68
22:39:48.096283 ARP, Request who-has 10.239.240.254 tell 10.239.240.100, length 28
22:39:48.096911 ARP, Reply 10.239.240.254 is-at 78:9a:de:ad:be:ef, length 42

When I set the reject rule to reset the TCP session:

22:48:15.243033 IP 10.239.240.100.54862 > 10.239.240.254.1000: Flags [S], seq 779133466, win 64240, options [mss 1460,sackOK,TS val 2011622943 ecr 0,nop,wscale 7], length 0
22:48:15.243353 IP 10.239.240.254.1000 > 10.239.240.100.54862: Flags [R.], seq 0, ack 779133467, win 0, length 0
22:48:20.608550 ARP, Request who-has 10.239.240.254 tell 10.239.240.100, length 28
22:48:20.609105 ARP, Reply 10.239.240.254 is-at 78:9a:de:ad:be:ef, length 42

PS: My previous panel didn't accept local connections whilst being connected to the cloud. The lag on the cloud calls was annoying and most motion sensor events were subsequently not seen by Home Assistant. The silver lining of our most recent lightning strike has definitely been the newer panel which now allows me to use the local integration with instant updates.

OnFreund commented 9 months ago

Thanks for the update!