Closed bbs2web closed 9 months ago
Hey there @onfreund, mind taking a look at this issue as it has been labeled with an integration (risco
) you are listed as a code owner for? Thanks!
(message by CodeOwnersMention)
risco documentation risco source (message by IssueLinks)
Are you able to simulate this (maybe by disconnecting the network on the device) and post logs? Without logs, this is a very specific scenario and on an uncommon integration, as you said.
First of all, I'd like to clarify that https://github.com/home-assistant/core/issues/91921 was not closed due to an "unusual integration" or an "uncommon" one. It was closed because there was another component between the integration and the panel, and that component seemed to have cause the problem. This component was in the user's control. Had that tunnel accurately reflected the panel's behavior, there would be no issue.
As for this one, I'm a bit confused by the description, so a few questions:
We'll need to get this reproduced and see the panel behavior in order to investigate.
Hi,
PS: Thanks for the clarification, my understanding of the query was not that the tunnel/firewalls were blocking the connection. Completely understand your view.
I'm away from home at the moment but should be able to reproduce the issue by changing the CS (configuration software) port number on the panel to simulate the scenario where the panel stops responding to connections.
I'm comfortable with Linux but not familiar with the log location. Please allow me a day's grace to get this done...
Answers in the meantime:
I'm comfortable with Linux but not familiar with the log location
Would appreciate any guidance or pointer to documentation that possibly details which file I could tail.
No need for any Linux foo - Home Assistant has a Logs screen.
By this I mean that the memory leak only appears to occur when the configured local integration IP is reachable and connections to TCP port 1000 are rejected.
I tried testing rejected connections when trying to reproduce https://github.com/home-assistant/core/issues/91921 - it didn't reproduce the problem. The main question is how to reproduce this, and how likely is it to happen naturally. Next time this happens, you can try using pyrisco
directly and see what error you're getting when trying to connect.
Thank you for your time on this, the integration is really perfect in every regard!
I'm fortunately not able to reproduce the issue, I'll take packet captures if it does though, so that I can try to reproduce at will.
So changing the port from 1000 to something else results in the Risco integration continuing to work uninterrupted. I wasted a bit of time trying to get packets forwarded by the bridge sent to the CPU, so that I could selectively reject only connections on TCP port 1000 but eventually resorted to changing the DHCP reservation for the alarm and kicking the WiFi client out of the registration table. When I then assigned the alarm system's usual IP to the router and then setup a simple firewall rule to reject connections on the port the memory leak and CPU load increase didn't occur. I tried rejecting the connection with the default ICMP network is unreachable response, ICMP port unreachable responses and also trying the method whereby the the TCP connection is reset.
PS: The two endpoints are in the same broadcast domain, so they normally don't send their packets via any gateway.
For what it's worth and anyone else stumbling in to this:
HAOS runs as a VM on Proxmox with Open vSwitch (OvS), I could subsequently obtain visibility of the traffic by using tcpdump on the VM's virtual NIC port. In the below HAOS is 10.239.240.100 and the Risco alarm panel is 10.239.240.254.
When I moved the alarm system's IP to a router and setup a standard reject rule with default ICMP response (network unreachable):
22:38:57.683306 IP 10.239.240.100.47596 > 10.239.240.254.1000: Flags [S], seq 890650932, win 64240, options [mss 1460,sackOK,TS val 2011065383 ecr 0,nop,wscale 7], length 0
22:38:57.683596 IP 10.239.240.254 > 10.239.240.100: ICMP net 10.239.240.254 unreachable, length 68
22:39:03.040301 ARP, Request who-has 10.239.240.254 tell 10.239.240.100, length 28
22:39:03.040876 ARP, Reply 10.239.240.254 is-at 78:9a:de:ad:be:ef, length 42
When I set the reject rule to generate an ICMP port unreachable response:
22:39:42.760440 IP 10.239.240.100.41394 > 10.239.240.254.1000: Flags [S], seq 1324388464, win 64240, options [mss 1460,sackOK,TS val 2011110461 ecr 0,nop,wscale 7], length 0
22:39:42.760886 IP 10.239.240.254 > 10.239.240.100: ICMP 10.239.240.254 tcp port 1000 unreachable, length 68
22:39:48.096283 ARP, Request who-has 10.239.240.254 tell 10.239.240.100, length 28
22:39:48.096911 ARP, Reply 10.239.240.254 is-at 78:9a:de:ad:be:ef, length 42
When I set the reject rule to reset the TCP session:
22:48:15.243033 IP 10.239.240.100.54862 > 10.239.240.254.1000: Flags [S], seq 779133466, win 64240, options [mss 1460,sackOK,TS val 2011622943 ecr 0,nop,wscale 7], length 0
22:48:15.243353 IP 10.239.240.254.1000 > 10.239.240.100.54862: Flags [R.], seq 0, ack 779133467, win 0, length 0
22:48:20.608550 ARP, Request who-has 10.239.240.254 tell 10.239.240.100, length 28
22:48:20.609105 ARP, Reply 10.239.240.254 is-at 78:9a:de:ad:be:ef, length 42
PS: My previous panel didn't accept local connections whilst being connected to the cloud. The lag on the cloud calls was annoying and most motion sensor events were subsequently not seen by Home Assistant. The silver lining of our most recent lightning strike has definitely been the newer panel which now allows me to use the local integration with instant updates.
Thanks for the update!
The problem
I found a related bug report, which perfectly matches my symptoms, where the report was unfortunately closed due to it relating to an unusual integration.
https://github.com/home-assistant/core/issues/91921
Our home is situated on an elevated ridge and although we have proper lightning protection, those that hit earthing conductors still generate inductive surges in copper wires longer than 10 metres within 200 metres of the strike. The alarm is a Risco LightSYS II and now completely wireless (passives, contacts and internet). We've had two occasions where the alarm panel stopped answering connection requests on port 1000 until I restart it (by disconnecting both the battery and power and restoring the connections thereafter).
When this happens the HASS vm chews through 8GB of memory within half an hour and the OOM (out of memory) task killer steps in to further break automations and integrations.
Please may I ask for a reconsideration? I really enjoy having lights turn on & off automatically at dusk & dawn when there is/isn't movement in parts of the home.
PS: The old panel only had a wired ethernet port and was fried by a surge delivered together with the IP packets. Trying to engage with Risco regarding the bug where the Configuration Software interface stops responding but it's really frustrating that everything stops working when HA can't talk to the Risco panel via the local integration.
What version of Home Assistant Core has the issue?
core-2024.1.5
What was the last working version of Home Assistant Core?
No response
What type of installation are you running?
Home Assistant OS
Integration causing the issue
Risco
Link to integration documentation on our website
https://www.home-assistant.io/integrations/risco/
Diagnostics information
No response
Example YAML snippet
No response
Anything in the logs that might be useful for us?
Additional information
No response