home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
70.92k stars 29.58k forks source link

Home Assistant Memory Leak #42752

Closed McGiverGim closed 3 years ago

McGiverGim commented 3 years ago

The problem

EDIT: I edited this to remove the ONVIF integration as suspicious of the problem.

There is some memory leak in ONVIF, as stated here: https://github.com/home-assistant/core/issues/42390 It was fixed in theory in 0.117.2 but it seems at least in my case this is wrong. I was asked to open a new issue, so this is.

Several hours after removing ONVIF seems it does not fix the issue, so is clear that in my case the leak is at another place.

Here is a sample after installing 0.117.2 from 0.116.: image

Environment


## Traceback/Error logs
<!--
  If you come across any trace or error logs, please provide them.
-->
Nothing.
```txt

Additional information

probot-home-assistant[bot] commented 3 years ago

Hey there @hunterjm, mind taking a look at this issue as its been labeled with an integration (onvif) you are listed as a codeowner for? Thanks! (message by CodeOwnersMention)

hunterjm commented 3 years ago

@McGiverGim - can you set log level to debug for ONVIF and include the resulting log here after you notice memory utilization rise again?

McGiverGim commented 3 years ago

@hunterjm I will do, it seemed with my tests that it was the ONVIF integration, but now I'm not sure. I will confirm it and post the logs...

hunterjm commented 3 years ago

It's possible that it is unrelated to ONVIF. I told this person (who doesn't use ONVIF) to open an issue here, but they opted not to: https://community.home-assistant.io/t/0-117-0-continual-memory-increase/240665/15

Edit: My recommendation would be to remove ONVIF devices and see if the issue still occurs for you. If it doesn't (removing restores memory utilization) then the logs would definitely be helpful. Otherwise, it may be an unrelated bug.

McGiverGim commented 3 years ago

Several hours after removing ONVIF seems it does not fix the issue, so is clear that in my case the leak is at another place.

I have returned to 0.116.4, if not each 8 hours my system crash by the memory leak. But if I need to do some test I can install version 0.117.x again.

Is better that I rename this issue or that I open another? But without any suspicious log it will be difficult to find.

Gunth commented 3 years ago

Indeed not sure that the reason of this issue is causing by the onvif integration anymore, because on d'ont use it and i have the same kind of chart about the ram image But not sure where to start to find the issue ...

hunterjm commented 3 years ago

@Gunth - Easiest way to start is working with @McGiverGim to narrow down a list of integrations you both use, then start disabling them one by one until you find the culprit.

DanskerUS commented 3 years ago

My memory usage jumped to 41% for CORE after 12 hours, typically just a few percentage points. Rebooted, back to about 2% (8GB RAM on NUC), now at about 5.4% after a few hours. Not using ONVIF.

tr1plus commented 3 years ago

Same issue here - For me it takes about 24 hours to fill up 2 gb of ram, my full swap file and then it starts increasing processor usage to 100% - making the system less and less responsive. I'm also not using ONVIF.

I'm running the supervisor in a esxi container and I still have the snapshot before any upgrade so I can still go back if needed.

image image image image image

EDIT: The integrations/add-ons that I use:

image image HACS: image image

McGiverGim commented 3 years ago

I suppose the most probably is that is one of the included integrations, not any unofficial (HACS). I don't have any of your HACS integrations. Seeing your integrations I have the Brother and UPNP (I have more: mobile, CAST, and weather but I suppose this others are used by almost everybody so if there is a problem I suppose it will be massive): image

There are others integrations configured using YAML, or that come as default, this is the list of all of them:

  Integration Documentación Incidencias
  Panel de control de alarmasalarm_control_panel Documentación Incidencias
  Amazon Alexaalexa Documentación Incidencias
  Home Assistant APIapi Documentación Incidencias
  Authauth Documentación Incidencias
  Automatizaciónautomation Documentación Incidencias
  Bad Nest (A hack around the Nest component to pull from their internal api)badnest Documentación Incidencias
  Sensor binariobinary_sensor Documentación Incidencias
  Brother Printerbrother Documentación Incidencias
  Cámaracamera Documentación Incidencias
  Google Castcast Documentación Incidencias
  Climatizaciónclimate Documentación Incidencias
  Home Assistant Cloudcloud Documentación Incidencias
  Configurationconfig Documentación Incidencias
  Default Configdefault_config Documentación Incidencias
  Device Automationdevice_automation Documentación Incidencias
  Rastreador de dispositivodevice_tracker Documentación Incidencias
  FFmpegffmpeg Documentación Incidencias
  Home Assistant Frontendfrontend Documentación Incidencias
  Google Assistantgoogle_assistant Documentación Incidencias
  Grupogroup Documentación Incidencias
  HACShacs Documentación Incidencias
  Logitech Harmony Hubharmony Documentación Incidencias
  Hass.iohassio Documentación Incidencias
  Historyhistory Documentación Incidencias
  Home Assistanthomeassistant Documentación Incidencias
  HTTPhttp Documentación Incidencias
  Imageimage Documentación Incidencias
  Procesamiento de imágenesimage_processing Documentación Incidencias
  Entrada booleanainput_boolean Documentación Incidencias
  Entrada de fechainput_datetime Documentación Incidencias
  Entrada de númeroinput_number Documentación Incidencias
  Entrada de seleccióninput_select Documentación Incidencias
  Entrada de textoinput_text Documentación Incidencias
  Internet Printing Protocol (IPP)ipp Documentación Incidencias
  Luzlight Documentación Incidencias
  Logbooklogbook Documentación Incidencias
  Lovelacelovelace Documentación Incidencias
  Mapmap Documentación Incidencias
  Reproductor multimediamedia_player Documentación Incidencias
  Media Sourcemedia_source Documentación Incidencias
  Meteorologisk institutt (Met.no)met Documentación Incidencias
  Mobile Appmobile_app Documentación Incidencias
  MQTTmqtt Documentación Incidencias
  Node-REDnodered Documentación
  Notificacionesnotify Documentación Incidencias
  Home Assistant Onboardingonboarding Documentación Incidencias
  ONVIFonvif Documentación Incidencias
  Persistent Notificationpersistent_notification Documentación Incidencias
  Personaperson Documentación Incidencias
  Plantaplant Documentación Incidencias
  Spain electricity hourly pricing (PVPC)pvpc_hourly_pricing Documentación Incidencias
  Python Scriptspython_script Documentación Incidencias
  Recorderrecorder Documentación Incidencias
  Remotoremote Documentación Incidencias
  Comprobador de fuente de alimentación de Raspberry Pirpi_power Documentación Incidencias
  Escenascene Documentación Incidencias
  Scriptscript Documentación Incidencias
  Searchsearch Documentación Incidencias
  Sensorsensor Documentación Incidencias
  Simple Service Discovery Protocol (SSDP)ssdp Documentación Incidencias
  Streamstream Documentación Incidencias
  Solsun Documentación Incidencias
  Interruptorswitch Documentación Incidencias
  Estado del sistemasystem_health Documentación Incidencias
  System Logsystem_log Documentación Incidencias
  Etiquetatag Documentación Incidencias
  Timertimer Documentación Incidencias
  Text-to-Speech (TTS)tts Documentación Incidencias
  Actualizadorupdater Documentación Incidencias
  UPnPupnp Documentación Incidencias
  Utility Meterutility_meter Documentación Incidencias
  Weatherweather Documentación Incidencias
  Webhookwebhook Documentación Incidencias
  Home Assistant WebSocket APIwebsocket_api Documentación Incidencias
  Zero-configuration networking (zeroconf)zeroconf Documentación Incidencias
  Zonezone Documentación Incidencias
llevering commented 3 years ago

I am experiencing the issues as well. I have the Brother integration as well. I have deleted it for now to see if there is something there.

tr1plus commented 3 years ago

I'll take another esxi snapshot and remove brother and UPNP. Keep you posted in (hopefully) 24h :)

DanskerUS commented 3 years ago

I have this problem, but no Brother integration. Sorry.

ramyi commented 3 years ago

i have the same issue. dont use brother or onvif. do have ring cameras which use ffmpeg. below snapshot of addons image downgrading to .116.4 to see if it makes a difference

hmmbob commented 3 years ago

i have the same issue. dont use brother or onvif. do have ring cameras which use ffmpeg. below snapshot of addons image downgrading to .116.4 to see if it makes a difference

I think it was established that it isn't in the addons - what integrations are you running?

ramyi commented 3 years ago

i have the same issue. dont use brother or onvif. do have ring cameras which use ffmpeg. below snapshot of addons image downgrading to .116.4 to see if it makes a difference

I think it was established that it isn't in the addons - what integrations are you running?

image

ramyi commented 3 years ago

i have the same issue. dont use brother or onvif. do have ring cameras which use ffmpeg. below snapshot of addons image downgrading to .116.4 to see if it makes a difference

I think it was established that it isn't in the addons - what integrations are you running?

image

there should be more. how do i get the simple table like McGiverGim

McGiverGim commented 3 years ago

there should be more. how do i get the simple table like McGiverGim

It's in the Configuration, Information.

ramyi commented 3 years ago
  AdGuard Homeadguard Documentation Issues
  Alarm Control Panelalarm_control_panel Documentation Issues
  Home Assistant APIapi Documentation Issues
  Authauth Documentation Issues
  Automationautomation Documentation Issues
  Binary Sensorbinary_sensor Documentation Issues
  Sony Bravia TVbraviatv Documentation Issues
  Cameracamera Documentation Issues
  Google Castcast Documentation Issues
  Climateclimate Documentation Issues
  Configurationconfig Documentation Issues
  Conversationconversation Documentation Issues
  Device Automationdevice_automation Documentation Issues
  Device Trackerdevice_tracker Documentation Issues
  Discoverydiscovery Documentation Issues
  ESPHomeesphome Documentation Issues
  FFmpegffmpeg Documentation Issues
  Home Assistant Frontendfrontend Documentation Issues
  Google Assistantgoogle_assistant Documentation Issues
  Hass.iohassio Documentation Issues
  Historyhistory Documentation Issues
  Home Assistant Core Integrationhomeassistant Documentation Issues
  HomeKithomekit Documentation Issues
  HTTPhttp Documentation Issues
  Apple iCloudicloud Documentation Issues
  Imageimage Documentation Issues
  Input Booleaninput_boolean Documentation Issues
  Home Assistant iOSios Documentation Issues
  Konnected.iokonnected Documentation Issues
  Lightlight Documentation Issues
  Logbooklogbook Documentation Issues
  Lovelacelovelace Documentation Issues
  Mapmap Documentation Issues
  Media Playermedia_player Documentation Issues
  Meteorologisk institutt (Met.no)met Documentation Issues
  Mobile Appmobile_app Documentation Issues
  Notificationsnotify Documentation Issues
  Network UPS Tools (NUT)nut Documentation Issues
  Home Assistant Onboardingonboarding Documentation Issues
  Persistent Notificationpersistent_notification Documentation Issues
  Personperson Documentation Issues
  Recorderrecorder Documentation Issues
  Ringring Documentation Issues
  Scriptsscript Documentation Issues
  Searchsearch Documentation Issues
  Sensorsensor Documentation Issues
  Simple Service Discovery Protocol (SSDP)ssdp Documentation Issues
  Streamstream Documentation Issues
  Sunsun Documentation Issues
  Switchswitch Documentation Issues
  System Healthsystem_health Documentation Issues
  System Logsystem_log Documentation Issues
  Tagstag Documentation Issues
  TP-Link Kasa Smarttplink Documentation Issues
  Text-to-Speech (TTS)tts Documentation Issues
  Updaterupdater Documentation Issues
  UPnPupnp Documentation Issues
  Wake on LANwake_on_lan Documentation Issues
  Weatherweather Documentation Issues
  Webhookwebhook Documentation Issues
  Home Assistant WebSocket APIwebsocket_api Documentation Issues
  Zero-configuration networking (zeroconf)zeroconf Documentation Issues
  Zonezone Documentation Issues
tr1plus commented 3 years ago

Adding my full integration list to the mix too:

  Amazon Alexaalexa Documentation Issues
  Home Assistant APIapi Documentation Issues
  Authauth Documentation Issues
  Automationautomation Documentation Issues
  Binary sensorbinary_sensor Documentation Issues
  Cameracamera Documentation Issues
  Google Castcast Documentation Issues
  Climateclimate Documentation Issues
  Home Assistant Cloudcloud Documentation Issues
  Configurationconfig Documentation Issues
  Countercounter Documentation Issues
  Covercover Documentation Issues
  deCONZdeconz Documentation Issues
  Default Configdefault_config Documentation Issues
  Device Automationdevice_automation Documentation Issues
  Device trackerdevice_tracker Documentation Issues
  ESPHomeesphome Documentation Issues
  Fanfan Documentation Issues
  Home Assistant Frontendfrontend Documentation Issues
  Groupgroup Documentation Issues
  HACShacs Documentation Issues
  Hass.iohassio Documentation Issues
  Historyhistory Documentation Issues
  Home Assistanthomeassistant Documentation Issues
  HTTPhttp Documentation Issues
  Imageimage Documentation Issues
  Input booleaninput_boolean Documentation Issues
  Input datetimeinput_datetime Documentation Issues
  Input numberinput_number Documentation Issues
  Input selectinput_select Documentation Issues
  Input textinput_text Documentation Issues
  Internet Printing Protocol (IPP)ipp Documentation Issues
  Lightlight Documentation Issues
  Locklock Documentation Issues
  Logbooklogbook Documentation Issues
  Lovelacelovelace Documentation Issues
  Mapmap Documentation Issues
  Media playermedia_player Documentation Issues
  Media Sourcemedia_source Documentation Issues
  Meteorologisk institutt (Met.no)met Documentation Issues
  Mobile Appmobile_app Documentation Issues
  Notificationsnotify Documentation Issues
  Home Assistant Onboardingonboarding Documentation Issues
  Persistent Notificationpersistent_notification Documentation Issues
  Personperson Documentation Issues
  Pi-holepi_hole Documentation Issues
  Recorderrecorder Documentation Issues
  Scenescene Documentation Issues
  Scriptscript Documentation Issues
  Searchsearch Documentation Issues
  Sensorsensor Documentation Issues
  Sonossonos Documentation Issues
  Simple Service Discovery Protocol (SSDP)ssdp Documentation Issues
  Streamstream Documentation Issues
  Sunsun Documentation Issues
  Switchswitch Documentation Issues
  System Healthsystem_health Documentation Issues
  System Logsystem_log Documentation Issues
  Tagtag Documentation Issues
  Timertimer Documentation Issues
  IKEA TRÅDFRItradfri Documentation Issues
  Transmissiontransmission Documentation Issues
  Text-to-Speech (TTS)tts Documentation Issues
  Updaterupdater Documentation Issues
  Weatherweather Documentation Issues
  Weatherbitweatherbit Documentation Issues
  Webhookwebhook Documentation Issues
  Home Assistant WebSocket APIwebsocket_api Documentation Issues
  Zero-configuration networking (zeroconf)zeroconf Documentation Issues
  Zonezone Documentation Issues
ramyi commented 3 years ago

This is what i know. symptoms are that swap reaches 100% then ram starts to climb to 80-90% but never 100 after any heavy task. i can replicate with a full snapshot which often results in a reboot after a while. if i reboot the host after going to supervisor --> system-->reboot under host core then everything goes back to normal for hours until the next heavy task then spike of swap and increase of ram until it causes a reboot of the server not host which then reboots on a whim until you properly reboot the host again to make it stable. downgrading to .116.4 seems to make it more stable but still reboots. i suspect its the supervisor/hassos related.

if i dont reboot the host. after a crash and restart of the home assistant server but ram and swap are still high.

McGiverGim commented 3 years ago

@ramyi In my case 0.116.4 is totally stable. So maybe different issue than yours. This started to happen in 0.117.0 and has not been fixed in 0.117.2.

Is there a way to dump the memory and profile it? Maybe not me, that I use HA OS, that I suppose is more limited, but maybe some of you are able to do that.

hmmbob commented 3 years ago

Are all using supervisor/HA OS?

McGiverGim commented 3 years ago

I'm, on a raspberry pi4, ha os 5.4 64 bit booting from ssd. I don't know others.

tr1plus commented 3 years ago

I am indeed using the supervisor Version 2020.10.1

What I do notice is that there is a ram difference between the System monitor sensor and the supervisor screen:

image image

McGiverGim commented 3 years ago

The supervisor only shows the memory used by the Home Assistant docker. The other is the total system if I'm not wrong.

ramyi commented 3 years ago

im using rpi 3b+ tried with sd card and ssd and issue remains. i just moved to hyper-v to see if i can have it stable for now.

Gunth commented 3 years ago

I use the brother integration also on RPi3 on memory card.. same issue as you all swap to 100%, ram growing up then auto reboot.( can reboot 2,3 times by day ) Revert back to 116.4 is working correctly again ..

McGiverGim commented 3 years ago

I've seen that in 0.117.0 we have a new profiler integration: https://www.home-assistant.io/integrations/profiler/ and four days ago it was added the option for memory profiling: https://github.com/home-assistant/core/pull/42435

I suppose that the memory profiling is for version 0.118.0 but maybe we can add it as a custom_component and it will work? It will help with this issue?

McGiverGim commented 3 years ago

I suppose that the memory profiling is for version 0.118.0 but maybe we can add it as a custom_component and it will work? It will help with this issue?

It seems that yes, it can help added as custom_component, here is the post about it in the original ONVIF memory leak issue: https://github.com/home-assistant/core/issues/42390#issuecomment-717532291

I don't know if someone is able to test it. If not, I will try tomorrow, when I have time to let grow the memory for several hours before executing this profiling...

hunterjm commented 3 years ago

The profiling I added currently only profiles a 60 second window. If the leak is slow (MB over hours) it won't help as is. If you want to run it as a custom component, it might be better to comment out lines 101 and 102 in __init__.py to get a full memory snapshot when run.

McGiverGim commented 3 years ago

Thanks for the info @hunterjm I will try it tomorrow if nobody does it before. Maybe as suggestion, this can be an option in the data passed to the service

EDIT: I edit because I see in the doc that we can choose the number of seconds, but not the full memory snapshot.

hunterjm commented 3 years ago

Hmm, it might actually be better to just set seconds: 3600 in the service call after HA starts. Then we won't get all the memory consumption from that, just what gets added over an hour.

hunterjm commented 3 years ago

For the rest of the group, Integrations are not just on the Integrations page in the UI. Anything you have in configuration.yaml is also relevant.

hmmbob commented 3 years ago

But all should be showing if you go to "settings" -> "info" -> "integrations", right?

hunterjm commented 3 years ago

@hmmbob - Yes

Stimpy68 commented 3 years ago

Having the same issue, first noticed it after i upgraded to 0.117.1, HA restarted randomly (so it seemed), after 1 day and 4 restarts I went backup to 0.116.4, no problems. Read about the ONVIF integration that could cause a mem leak, so removed that yesterday. This morning upgraded to 0.117.2 again, installed glances and put that data in influxdb/grafana to get a historical picture. From restarting and installing glances it looks like the mem usage is rising steadily

Schermafbeelding 2020-11-03 183635

I'm running a esxi VM with HA OS, increased the memory from 2 to 3 GB Schermafbeelding 2020-11-03 184002

Here you can see that at about 9:30 I updated to 0.117.2, the memory usage started increasing, until about 15:00 when I saw that the HA container used about 800Mb + and warnings started showing in glances that mem usage was above 70% (and rising). I stopped the VM and increased the memory to 3 GB and started it again.

Schermafbeelding 2020-11-03 184138

McGiverGim commented 3 years ago

Hmm, it might actually be better to just set seconds: 3600 in the service call after HA starts. Then we won't get all the memory consumption from that, just what gets added over an hour.

True, it will be easy. I will do it in this way.

elupus commented 3 years ago

Guys, can you try disabling ssdp component?

DanskerUS commented 3 years ago

Sure, but how? Thanks.

cobirnm commented 3 years ago

I have the same issue. I’m using docker version of home assistant in raspberry pi 3b+. I have enlarged my swap file to 4gb and by what I can see the memory used in the container starts climbing up to 530mb. After that system starts swapping until it gets unresponsive after 30 hours.

Regarding integrations the only thing I have in common is deconz and brother.

ramyi commented 3 years ago

Guys, can you try disabling ssdp component?

i've removd ssdp. will upgrade back to .117.2 should i remove upnp as well? for others who want to try you can remove ssdp: and make sure you dont have default_config: in there.

bdraco commented 3 years ago
    @callback
    def _log_objects(*_):
        _LOGGER.debug("Most common types: %s", objgraph.most_common_types(limit=100))
        _LOGGER.debug("Growth: %s", objgraph.growth(limit=100))

    async_track_time_interval(hass, _log_objects, timedelta(seconds=30))

Maybe make a custom integration that uses objgraph to log the above. Then you can watch Growth over time to see what gets added.

tr1plus commented 3 years ago

We are almost 24h later (like 21 I think) from my last message where I disabled upnp and brother. My ram has been a bit less aggressive(still climbing but much slower): image

Stimpy68 commented 3 years ago

@tr1plus I also used the Brother integration, removed it yesterday in 0.117.2, but no real big difference, downgraded to 0.116.4, you can clearly see when I did that, also noticed that cpu is less choppy in 116.4. So maybe it's upnp?

Knipsel

bieniu commented 3 years ago

Do you turn off your Brother printer at night or when you don't use it? If yes, please test this https://github.com/home-assistant/core/issues/42749#issuecomment-721214380

Stimpy68 commented 3 years ago

@bieniu Only turn it on when I'm using it, and thats just a few times a week (it's Brother DCP-9020CDW laser printer/scanner). When done, turning it off. But as I said, removing the Brother integration didn't make a significant difference in the mem usage, it still was climbing.

McGiverGim commented 3 years ago

I have started a 60 minutes memory profiling. I will let you know when finished. I have done one with 60 seconds to test that it works, but I suppose this will show almost nothing, here is the result (I don't know nothing about python or HA code, so I can't help with this): image

And yes, I have it usually powered off Brother too, but I can't test your changes until I finish with the memory dump.

llevering commented 3 years ago

We are almost 24h later (like 21 I think) from my last message where I disabled upnp and brother. My ram has been a bit less aggressive(still climbing but much slower):

I can confirm as well. It might be that it is no so much the brother integration as being an underlying library of course. But for me disabling the Brother integration is a work around.

@bieniu I can confirm that my brother is mostly turned off. I will try to use your version, however I am at work now so it will at first just be tonight to test. But it seems that you're definitely on to something :)

bieniu commented 3 years ago

@llevering Probably brother integration causes memory leaking when device is turned off and restarting HA. If this scenario fits your usage, this test version may solve the problem.