home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
72.39k stars 30.29k forks source link

Out of Memory error on Risco Integration #119138

Closed IlPicasso closed 3 months ago

IlPicasso commented 4 months ago

The problem

I found my HASS instance crashed many times after having electricity cuts . I cant replicate that, but by a Mistake I discovered Risco was crashing my HASS instance.

By mistake was that yesterday I set the same IP of risco and another device (in this case my tp-link switch), when I did that HASS would crash.

My guess is that when Risco cant connect it crashes

This is normal behaviour

https://imgur.com/a/c4xWboA

And this is virtualization behaviour when I did this mistake, and I can replicate by doing the same

https://imgur.com/a/PU7mGCe

Memory and CPU usage keeps increasing until it is not accesible anymore and finally it crashes (this may take 1-2 minutes)

I can guess too, when my electricity was cutted and since my HASS hardware is connected to a UPS , it has been trying to connect to risco and crashed in the trial.

Note please that I am using Risco TCP connnection method and is configured to have https://imgur.com/a/XpFJXui

What version of Home Assistant Core has the issue?

It is not version dependant, been having this issue for months

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Risco

Link to integration documentation on our website

https://www.home-assistant.io/integrations/risco

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

home-assistant[bot] commented 4 months ago

Hey there @onfreund, mind taking a look at this issue as it has been labeled with an integration (risco) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `risco` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign risco` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


risco documentation risco source (message by IssueLinks)

OnFreund commented 4 months ago

This was reported in the past but I couldn't reproduce it. One of the suspicions backs then was that connections were dropped rather than rejected. Could that be the case with your setup?

IlPicasso commented 4 months ago

This was reported in the past but I couldn't reproduce it. One of the suspicions backs then was that connections were dropped rather than rejected. Could that be the case with your setup?

I couldnt say ,

But I can replicate It easily by creating an IP conflict

Restarting hass After hass is initiated AND risco loaded, put another device (in my case my tplink switch) the same IP than risco

OnFreund commented 4 months ago

It really depends on how the other device is handling the connection request - you won't get this with any conflict. Can you find any error in the logs? Are you able to try reproducing this with a python console so we can get the exact error?

IlPicasso commented 4 months ago

It really depends on how the other device is handling the connection request - you won't get this with any conflict. Can you find any error in the logs? Are you able to try reproducing this with a python console so we can get the exact error?

I didnt find any error on HASS log

If you guide me on how to open python console, I can try replicating it with python console opened

OnFreund commented 4 months ago

You can do it on your regular computer. Install Python (instructions depend on your operating system), and then run

pip3 install aioconsole pyrisco
apython

You'll get a python console, and you can follow the instructions in the readme to try to connect to the tp-link switch and see what error you get.

IlPicasso commented 4 months ago

You can do it on your regular computer. Install Python (instructions depend on your operating system), and then run

pip3 install aioconsole pyrisco
apython

You'll get a python console, and you can follow the instructions in the readme to try to connect to the tp-link switch and see what error you get.

I dont see any output, or I am doing something wrong

image

Maybe you have discord?

OnFreund commented 4 months ago

You're not in an async context. Did you run apython?

IlPicasso commented 4 months ago

You're not in an async context. Did you run apython?

I managed to do it, and managed to replicate the memory issue sadly I didnt saw any error

image

What I noticed is that to replicate this

  1. First, TCP connection must be succesful
  2. After pyrsico succesfully connects to risco panel, create the conflict

Here is a video https://www.loom.com/share/de6d0af39acb431bb3e1e1f76d757133

OnFreund commented 4 months ago

Does it only start after you try to get the zone name? That's really strange and it definitely shouldn't hang, because the zones are cached in memory and shouldn't trigger any network request. It should return the name immediately.

My only suspicion could be the keep alive pings somehow causing this, but going over the code I don't see any reason they would. I'll need to figure out a way to reproduce this myself without changing the IP address.

IlPicasso commented 4 months ago

Does it only start after you try to get the zone name? That's really strange and it definitely shouldn't hang, because the zones are cached in memory and shouldn't trigger any network request. It should return the name immediately.

My only suspicion could be the keep alive pings somehow causing this, but going over the code I don't see any reason they would. I'll need to figure out a way to reproduce this myself without changing the IP address.

I did more tests and replicated

No need to change IP or create conflict, neither try to get zone name

Simply,

  1. Do a succesfull connection with PyRisco,
  2. Disconnect Risco Panel from Local Network (Remove LAN Cable from panel or shutdown panel)
IlPicasso commented 4 months ago

Does it only start after you try to get the zone name? That's really strange and it definitely shouldn't hang, because the zones are cached in memory and shouldn't trigger any network request. It should return the name immediately. My only suspicion could be the keep alive pings somehow causing this, but going over the code I don't see any reason they would. I'll need to figure out a way to reproduce this myself without changing the IP address.

I did more tests and replicated

No need to change IP or create conflict, neither try to get zone name

Simply,

  1. Do a succesfully connection with PyRisco,
  2. Disconnect Risco Panel from Local Network (Remove LAN Cable from panel or shutdown panel)

After this pyrisco Will start eating all ram, looks like pyrisco is trying to reconnect? Infinite times with no waiting Time I can create a video if you want

IlPicasso commented 4 months ago

Does it only start after you try to get the zone name? That's really strange and it definitely shouldn't hang, because the zones are cached in memory and shouldn't trigger any network request. It should return the name immediately.

My only suspicion could be the keep alive pings somehow causing this, but going over the code I don't see any reason they would. I'll need to figure out a way to reproduce this myself without changing the IP address.

Let me know if I can help with anything else

OnFreund commented 3 months ago

It's going to take me a while to reproduce this. In the meantime, you can try this:

And let me know what happens

IlPicasso commented 3 months ago

Following that procedure the next happens: image

If I connect back the cable I get output of the last command and I can run commands normally

image

Running those commands seems to work normally , there is not ram leak

EDIT: seems last commands are ran from cache because I tried running them with cable disconnected too and they output information

IlPicasso commented 3 months ago

I left it running more time before connecting back the cable and I got a different output

image

OnFreund commented 3 months ago

That last one looks like a valid response. Are you sure the cable was disconnected? Can you run the last command again and see what happens next?

IlPicasso commented 3 months ago

That last one looks like a valid response. Are you sure the cable was disconnected? Can you run the last command again and see what happens next?

It stays like this, until I connect back the cable

image

When I connect the cable again, the command responds

OnFreund commented 3 months ago

OK, let's try something else. Instead of canceling all 3 tasks, just cancel this one:

r._rs._keep_alive_task.cancel()

and then disconnect the cable. Is there still a memory leak?

IlPicasso commented 3 months ago

Ok No memory leak here

image image

IlPicasso commented 3 months ago

I try running print(r.zones[1].name) when cable disconnected, it seems it outputs info from cache there.

IlPicasso commented 3 months ago

Ok, I connected back

ran again r._rs._keep_alive_task.cancel() and got the next and got memory leak

image

OnFreund commented 3 months ago

I try running print(r.zones[1].name) when cable disconnected, it seems it outputs info from cache there.

Zone and partition information is saved in memory, there's no need to test it.

ran again r._rs._keep_alive_task.cancel() and got the next and got memory leak

You can't cancel the task twice. It also seems like it wasn't canceled at all or was already cancled when you tried (returned False). Please repeat the experiments exactly as I say - the added commands just make it more difficult to see what's going on.

Right now what I'd like to see is:

  1. Connect
  2. r._rs._keep_alive_task.cancel()
  3. Unplug the cable.

That's it - no extra commands.

Are you seeing a memory leak? If not, try running:

await r._rs.send_result_command("CLOCK")

What do you see?

IlPicasso commented 3 months ago

This is the result of last test image

In summary

  • Connect
  • r._rs._keep_alive_task.cancel()
  • Unplug the cable.

This doesnt cause any memory leak.

Memory leak starts after running

> await r._rs.send_result_command("CLOCK")
OnFreund commented 3 months ago

What do you mean that the memory leak starts after running this command? Was there a single increase in memory or is it continuous?

IlPicasso commented 3 months ago

What do you mean that the memory leak starts after running this command? Was there a single increase in memory or is it continuous?

Continuos, in a moment I send a video

OnFreund commented 3 months ago

OK, let's try this:

IlPicasso commented 3 months ago

What do you mean that the memory leak starts after running this command? Was there a single increase in memory or is it continuous?

This is the video of leak memory https://www.loom.com/share/6b389b9ec4b3454fad88256f9500e53a

OK, let's try this:

  • Connect
  • Cancel all 3 tasks
  • Unplug cable
  • Send CLOCK command
  • (There shouldn't be any leak at this point)
  • Call await r._rs._read_command()

This is the video from this test https://www.loom.com/share/ae64b97a15324dcf830f269fcc750af1

As you said, no leak on this process this is the log image

OnFreund commented 3 months ago

OK, I have an idea. Will get back to you in the next few days.

OnFreund commented 3 months ago

I released version 0.6.3 of pyrisco (it's untested). Please install it and verify that there's no leak when disconnecting the cable (no need to cancel any task or run any commands - just the regular connect and unplug cable). After you test, I'll update HA with the new pyrisco version.

IlPicasso commented 3 months ago

I released version 0.6.3 of pyrisco (it's untested). Please install it and verify that there's no leak when disconnecting the cable (no need to cancel any task or run any commands - just the regular connect and unplug cable). After you test, I'll update HA with the new pyrisco version.

I just tested it, didnt experienced any leak here.

Thank you 💯

IlPicasso commented 3 months ago

Hello, I am visiting back here.

I tested the integration on 2024.7.0b2, if I am not wrong changes should be already applied there.

Issue now is not on disconnect, but on reconnect. I made a video https://www.loom.com/share/1359a4493feb4666b4e0a57999fdf8f9

These are the errors I find

This one seems to be on start, before doing anything https://hastebin.com/share/uzinikibod.python

This one is after disconnecting risco from network (found twice) https://hastebin.com/share/inulomedux.python

Please note that I cant replicate this error directly in pyrisco, and to replicate in homeassistant I had to do the same procedure twice

OnFreund commented 3 months ago

I'm sorry but I don't understand what's happening in this video, and large parts of what's on screen isn't in English. Please write a clear description of what you're doing and what's happening. As for timeouts in CLOCK commands, you can usually ignore those (an exception would be if they happen on every command, i.e. every 5 seconds).

IlPicasso commented 3 months ago

I'm sorry but I don't understand what's happening in this video, and large parts of what's on screen isn't in English. Please write a clear description of what you're doing and what's happening. As for timeouts in CLOCK commands, you can usually ignore those (an exception would be if they happen on every command, i.e. every 5 seconds).

My bad I am sorry.

What you Saw in the video was me disconnecting and connecting again risco network and checking memory usage

In summary.

Disconnect risco network Connect it again Leak memory on Homeassistant which ends crashing It

As I said, in this case I could only replicate It on Homeassistant, It worked correctly when I tested It on aioconsole

OnFreund commented 3 months ago

What does the integration window show after you unplug and before you plug again? And can you link to full logs?

IlPicasso commented 3 months ago

home-assistant (2).log

Sorry, its filled with spam because a modbus integration I am working on

I will record when I get home the integration window

jakob1911 commented 3 months ago

Hello OnFreund, hello IIPicasso,

maybe something of my pullrequest is helpfull to solve the problem. OnFreund/pyrisco/pull#22

From what i understand "ConnectionResetError" will be raised earliest by reconnecting the cable. But in my test i also crashed imedeatly (will never be raised) -- correct me if i´m wrong.

I investigated a lot in where the problem is exactly is comming from (python reference - Stream API.....), but couldn´t find suitable answers. So i did what i explained in the pull request 22 - pyrisco

BR Jakob

OnFreund commented 3 months ago

@IlPicasso I don't see a connection reset error in these logs....

@jakob1911 I agree with the approach in general, but since all errors are propagated to HA, I think we should just handle it there instead of passing configuration. Can you create an HA PR and I'll review it?

IlPicasso commented 3 months ago

Let me know if I can do anything else

jakob1911 commented 3 months ago

@OnFreund yes good point with avoid passing configuration. I will create a pull request the next days.

IlPicasso commented 2 months ago

Hi @jakob1911 anything I can help on?

jakob1911 commented 2 months ago

Hello @IlPicasso thanks for the reminder. I was busy the last days. I will create the pr by end of this week.

BR Jakob