home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
73.24k stars 30.59k forks source link

Homeassistant container memory leak - Out of Memory #93713

Closed litinoveweedle closed 1 year ago

litinoveweedle commented 1 year ago

Read before posting

https://github.com/home-assistant/core/issues/95386

The problem

HomeAssiastant container is leaking memory at about 1GB/day. Tried to disable not strictly necessary integrations, but could not find culprit. I am not exactly sure, when problem appeared first time, and if it is caused by new HA version or by new integration. I did disable OnVif integration, which was mentioned in separate OoM issue, but HA container kept growing anyway.

Profiler PROFILER.START_LOG_OBJECT_SOURCES log attached. I saw many messages like '"Failed to serialize <class 'list'>"'. but I am not sure, if it is real issue or not.

Could be related to #93556 or #92026

What version of Home Assistant Core has the issue?

core-2023.5.4

What was the last working version of Home Assistant Core?

core-2023-4.X?

What type of installation are you running?

Home Assistant Supervised

Integration causing the issue

unknown

Link to integration documentation on our website

No response

Diagnostics information

home-assistant.log.zip

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

litinoveweedle commented 1 year ago

After cca 18hrs my memory consumpion looks like:

CONTAINERS 20 sorted by memory_percent
Engine  Pod Name                Status  Uptime      CPU%    MEM /MAX    IOR/s   IOW/s   RX/s    TX/s    Command
docker  -   addon_a0d7b954_motioneye    running yesterday   21.4    1.92G   7.61G   0b  0b  NaNb    NaNb    /init
docker  -   homeassistant           running 18 hours    27.1    1.74G   7.61G   0b  0b  NaNb    NaNb    /init
docker  -   addon_core_whisper      running yesterday   0   908M    7.61G   0b  0b  0b  0b  /init
docker  -   addon_a0d7b954_nodered      running yesterday   4.3 262M    7.61G   0b  0b  NaNb    NaNb    /init
docker  -   addon_4a36bbd1_signal       running yesterday   0.1 213M    7.61G   0b  0b  248b    288b    /run.sh
docker  -   hassio_supervisor       running yesterday   1.7 207M    7.61G   0b  0b  0b  0b  /init
docker  -   addon_a0d7b954_adguard      running yesterday   0   147M    7.61G   0b  0b  NaNb    NaNb    /init
docker  -   addon_core_piper        running yesterday   0   111M    7.61G   0b  0b  0b  0b  /init
docker  -   addon_a0d7b954_glances      running yesterday   0.9 65.3M   7.61G   0b  0b  NaNb    NaNb    /run.sh
docker  -   addon_a0d7b954_logviewer    running yesterday   0   32.1M   7.61G   0b  0b  0b  0b  /init
docker  -   addon_a0d7b954_sqlite-web   running yesterday   0   31.8M   7.61G   0b  0b  0b  0b  /init
docker  -   hassio_dns          running yesterday   0.2 23.7M   7.61G   0b  0b  0b  0b  /init
docker  -   addon_core_configurator     running yesterday   0   22.5M   7.61G   0b  0b  0b  0b  /init
docker  -   hassio_audio            running yesterday   0   21.9M   7.61G   0b  0b  0b  0b  /init
docker  -   addon_core_ssh          running yesterday   0   20.7M   7.61G   0b  0b  0b  0b  /init
docker  -   hassio_observer         running yesterday   0   17.6M   7.61G   0b  0b  0b  0b  /init
docker  -   addon_45c828c3_snapcast-server  running yesterday   1.4 17.1M   7.61G   0b  0b  NaNb    NaNb    /init
docker  -   addon_a0d7b954_sonweb       running yesterday   0   13.2M   7.61G   0b  0b  0b  0b  /init
docker  -   hassio_cli          running yesterday   0   2.47M   7.61G   0b  0b  0b  0b  /init
docker  -   hassio_multicast        running yesterday   0   1.05M   7.61G   0b  0b  NaNb    NaNb    /init
litinoveweedle commented 1 year ago

After cca 30hrs memory consumpion looks like:

CONTAINERS 20 sorted by memory_percent
Engine  Pod Name                Status  Uptime  CPU%    MEM /MAX    IOR/s   IOW/s   RX/s    TX/s    Command
docker  -   homeassistant           running yester  1.7 2.55G   7.61G   44Kb    0b  NaNb    NaNb    /init
docker  -   addon_a0d7b954_motioneye    running 2 days  12.2    1.44G   7.61G   0b  0b  NaNb    NaNb    /init
docker  -   addon_core_whisper      running 2 days  0   818M    7.61G   0b  0b  1Kb 0b  /init
docker  -   addon_a0d7b954_nodered      running 2 days  2.4 270M    7.61G   0b  0b  NaNb    NaNb    /init
docker  -   addon_4a36bbd1_signal       running 2 days  0.2 215M    7.61G   0b  0b  1Kb 240b    /run.sh
docker  -   hassio_supervisor       running 2 days  0.5 210M    7.61G   0b  0b  0b  0b  /init
docker  -   addon_a0d7b954_adguard      running 2 days  0   142M    7.61G   0b  0b  NaNb    NaNb    /init
docker  -   addon_core_piper        running 2 days  0   98.7M   7.61G   0b  0b  1Kb 0b  /init
docker  -   addon_a0d7b954_glances      running 2 days  1.2 51.1M   7.61G   0b  0b  NaNb    NaNb    /run.sh
docker  -   addon_a0d7b954_sqlite-web   running 2 days  0   30.3M   7.61G   0b  0b  1Kb 0b  /init
docker  -   addon_a0d7b954_logviewer    running 2 days  0   27.7M   7.61G   0b  0b  1Kb 0b  /init
docker  -   addon_core_configurator     running 2 days  0   23.2M   7.61G   0b  0b  1Kb 0b  /init
docker  -   hassio_dns          running 2 days  0.1 22.5M   7.61G   0b  0b  2Kb 808b    /init
docker  -   addon_core_ssh          running 2 days  0   19.7M   7.61G   0b  0b  1Kb 0b  /init
docker  -   hassio_audio            running 2 days  0   19.2M   7.61G   0b  0b  1Kb 0b  /init
docker  -   addon_45c828c3_snapcast-server  running 2 days  1.4 14.0M   7.61G   0b  0b  NaNb    NaNb    /init
docker  -   addon_a0d7b954_sonweb       running 2 days  0   13.0M   7.61G   0b  0b  1Kb 0b  /init
docker  -   hassio_observer         running 2 days  0   12.4M   7.61G   0b  0b  1Kb 0b  /init
docker  -   hassio_cli          running 2 days  0   1.96M   7.61G   0b  0b  1Kb 0b  /init
docker  -   hassio_multicast        running 2 days  0   1.04M   7.61G   0b  0b  NaNb    NaNb    /init
bdraco commented 1 year ago

I'd start with start_log_objects as START_LOG_OBJECT_SOURCES is a lot more useful when we already know what type of object we are looking for

litinoveweedle commented 1 year ago

OK, thank you. I did as you suggested, the log is attached. Thank you for any help. home-assistant.zip

bdraco commented 1 year ago

Something is creating a massive amount of tasks 2023-05-31 13:41:50.794 CRITICAL (SyncWorker_4) [homeassistant.components.profiler] Memory Growth: [('coroutine', 202609, 339), ('frame', 202592, 339), ('builtin_function_or_method', 110601, 169), ('ReferenceType', 80809, 104), ('method', 56223, 90), ('Task', 50646, 84), ('Future', 50673, 84), ('FutureIter', 50627, 84), ('Context', 103933, 15)]

bdraco commented 1 year ago

Thats good thats its asyncio since it means a callgrind might show it.

Can you generate one with the profiler.start service?

litinoveweedle commented 1 year ago

Thank you for help, please find it attached - for 180s, if needed I can produce longer. callgrind.zip

bdraco commented 1 year ago

Its not immediately obvious where the task leak is coming from which means its likely a "slow" leak. They are always easier to find when they run the system out of memory quickly as you can see the abnormal growth quickly.

The next step would be to disable custom components below and see if the problem goes away.

If that turns out to be the cause we would than need to dig through the code of the one causing it to look for the leak

litinoveweedle commented 1 year ago

Its not immediately obvious where the task leak is coming from which means its likely a "slow" leak.

It is not that slow leak... I am running HA on server with 16GB RAM and HA was able to grow from ~300MB to 4GB in ~ two days. Should I keep running profiler for longer?

Does you reply means, that there was nothing obvious in callgrind? I am doing what you wrote anyway (test and try integrations), but I need at leas half of the integrations for base function of my house hold, so I can't simply switch them off. Therefore it would be really nice, if there is something in the logs at least to narrow my search. Thank you.

bdraco commented 1 year ago

It likely it is in the profile but I can’t find it because it’s so small.

It’s slow in the sense that in the time of the profile it only grows a little. When I say fast leak I mean something that will crash in a few minutes. These are easy to see where the source is because they show up in the profile by sorting

You could try running the profile for 30m to see if it’s large enough to correlate the source.

keep in mind the profile has all the calls so it’s looking for a needle in a haystack. Anything we can do to make the leak faster or bigger will make it easier to find

litinoveweedle commented 1 year ago

I did run profiler for 1800s but strangely the size of the files is similar. Anyway I hope it would be helpful. Thank you.

callgrind.zip

bdraco commented 1 year ago

create_task

Here is what is calling create_task

bdraco commented 1 year ago

what_is_messagemanager

I don't know what messagemanager is but its creating tasks (not saying its leaking), but I can't find it in the HA code so it must be a custom component or lib

litinoveweedle commented 1 year ago

Hello, thank you very much for support. It is strange as this brings nothing also>

root@smart:/usr/share/hassio/homeassistant/custom_components# grep -Ri messagemanager *
root@smart:/usr/share/hassio/homeassistant/custom_components# grep -Ri message_manager *

So messagemanager method seems not to be in the in the custom integrations either. I did even grep in the Homeassistant and Hassio root directories, but nothing was found. Are you sure about the name? (i.e. messagemanager)? I did not found it in the diagrams you attached either.

BTW: is there any HOWTO one can follow to get similar output from profiler? It would be helpful for user to try to identify faulty code themself.

litinoveweedle commented 1 year ago

@bdraco Can you please check "messagemanager" method/class name you mentioned? Maybe I am doing it wrong, but I was not able to find it in the custom integration either.

sknsean commented 1 year ago

Just to chime in here, I'm not using any of these integrations:

* cz_energy_spot_prices

* smartir

* floureon

* solax_modbus

* programmable_thermostat

* signalmessenger

* variable_history

I'm also experiencing a memory leak, that results in the home-assistant container gets killed (by the OOM) every 5-6 days. Quite annoying :)

One of the pictures is showing when the container got killed, and the other shows the free memory keeps getting lower.

IMG_1265 IMG_1262

litinoveweedle commented 1 year ago

@bdraco I am really sorry, but I have politely asked you several times during past weeks simple question regarding your original findings - i.e. where did you exactly found "messagemanager" in the trace and why I can't grep anything in HA source code including custom integrations with such name. I need such information to continue with my troubleshooting.

Do I assume correctly from your silence, that I should not expect an answer at all?

image

bdraco commented 1 year ago

Day job has me traveling quite a bit. I’m about 150 GitHub notifications behind

backcountrymountains commented 1 year ago

So, I just put in an issue with HAOS because I'm also experiencing OOM reboots. My memory allocation graph is even steeper and my machine reboots more often. #2595.

Is this more likely to be an HA than an HAOS issue? I'm going to try to downgrade to HAOS 9.5 and see if I still have leaks.

VinceRMP commented 1 year ago

I see the same on my instance. It seemed to start after updating HAOS or 2023.6 beta (coming from a dev build in the early stage of 2023.5, so it could be also related to a 2023.5 specific version). What I have noticed is that my memory increase is caused by my tablet which shows the dashboard 24/7. When I disable the tablet the memory stops increasing, but doesn’t decrease. When I open the display on the tablet the memory starts to increase again. Is this something you use aswell?

litinoveweedle commented 1 year ago

I did observed memory leak on 2023.5. There was question why my graph of memory usage is not as steep as for others. Well I do run HA on server and I have currently 64GB of RAM, so it takes longer to consume it. (I extended my memory as intermediate workaround as my original 16GB RAM was exhausted in just 2-3 days.)

bdraco commented 1 year ago

If you have a memory leak please post a start log objects (see above) and a callgrind. Me too comments are not helpful as they don’t add any more information that gets us closer to tracking down a leak.

VinceRMP commented 1 year ago

callgrind_start_log_objects.zip

See the attached file. If you need more or longer logs let me know.

bdraco commented 1 year ago

@VinceRMP Can you run the start log objects for 30 minutes? The way it works is it keep looking at whats in memory and sees what grows over time. Without a longer run it won't show what is leaking as it doesn't have enough time to figure out what is normal memory churn and what is not going away.

VinceRMP commented 1 year ago

At the moment I’m not able to do that. I will do it tomorrow and post the full log.

backcountrymountains commented 1 year ago

So, I've noticed a correlation with a CPU spike and the incrementally increasing memory use: image

Is there a way to monitor what causes the CPU spike to maybe get insight into what's using and not releasing memory?

VinceRMP commented 1 year ago

I can reproduce the start of the memory leak and stop it aswell after some testing. I'm not sure if this is the same issue as described here and I don't want to hijack this issue. So if I need to create a seperate one please let me know.

image At the first drop in memory I did a restart of Home Assistant, but didn't refresh the webapp running on my tablet. This causes the memory to increase as you can see. Around 08:00 I restarted the webapp on the tablet. The memory stays the same, so no memory leak visible. Around 09:00 I did another restart of Home Assistant without restarting the webapp on the tablet and there is the memory leak again.

This is also reported in the logs (I somehow didn't notice this before): [139928754006096] Tablet from (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":22632,"type":"result","success":true,"result":null} [139928813456528] Tablet from (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":20399,"type":"result","success":true,"result":null} [139928532840656] Tablet from (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":20538,"type":"result","success":true,"result":null} [139928688859152] Tablet from (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":70314,"type":"result","success":true,"result":null} [139928776142096] Tablet from (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":30737,"type":"result","success":true,"result":null}

This is the log which now has run for around 30 minutes: home-assistant.log

bdraco commented 1 year ago

@VinceRMP that definitely looks like a different problem as there are no obvious task leaks in the log. Please continue in a new issue.

VinceRMP commented 1 year ago

Thanks for looking into it! I will create an issue in the frontend section.

bdraco commented 1 year ago

I think it’s still should go to core as we need to figure out why it isn’t releasing memory when the client disconnects

litinoveweedle commented 1 year ago

@bdraco how about original issue I logged. Did you got time to find out how I should identify "messagemanager" you mentioned?

bdraco commented 1 year ago

BTW: is there any HOWTO one can follow to get similar output from profiler? It would be helpful for user to try to identify faulty code themself.

https://www.home-assistant.io/integrations/profiler/

Additionally, the profiler will generate a callgrind.out file that can be viewed with:

KCachegrind or QCachegrind Gprof2dot

bdraco commented 1 year ago

@bdraco how about original issue I logged. Did you got time to find out how I should identify "messagemanager" you mentioned?

I wasn't able to find it. I don't think its going to help anyways

bdraco commented 1 year ago

@litinoveweedle I think you might have some luck figuring out the source if you do dump_log_objects on Task

bdraco commented 1 year ago
service: profiler.dump_log_objects
data:
  type: Task

You'll get something like

2023-06-16 09:17:13.870 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1215' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:486> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.870 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1216' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:486> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.870 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1222' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:486> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.871 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1223' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:486> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.871 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1229' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:512> wait_for=<Task pending name='Task-1230' coro=<RequestHandler._handle_request() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:433> wait_for=<Future pending cb=[Task.task_wakeup()]> cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.871 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1230' coro=<RequestHandler._handle_request() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:433> wait_for=<Future pending cb=[Task.task_wakeup()]> cb=[Task.task_wakeup()]>
2023-06-16 09:17:13.872 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1231' coro=<WebSocketHandler._writer() running at /Users/bdraco/home-assistant/homeassistant/components/websocket_api/http.py:109> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.872 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1208' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:486> wait_for=<Future pending cb=[Task.task_wakeup()]>>

Keep in mind you have 50000 in your dump above so it may run out of memory trying to list them all so you might want to do the dump before too many memory has leaked but not too soon that there aren't enough to identify.

litinoveweedle commented 1 year ago

@litinoveweedle I think you might have some luck figuring out the source if you do dump_log_objects on Task

Sorry I don't understand. (On Task part). I will be back home on Sunday, I will try to see how this could be done.

litinoveweedle commented 1 year ago

Ok, thank you for explanation, I will try to do that after HA restart

NicoIIT commented 1 year ago

Not sure it can help or even if it is relevant, but investigating myself an unrelated memory leak I found out a potential interesting library for your case: aiocoap which defines a messagemanager and functions called fill_or_recognize_remote as it appears in the callgrind graphs.

This pyhton library is used by another library: pytradfri, itself used by the IKEA Tradfri HA Component: https://github.com/home-assistant/core/blob/dev/homeassistant/components/tradfri/manifest.json

@litinoveweedle if you use this component it may worse deactivating it to check if it could be the culprit.

bdraco commented 1 year ago

Pretty sure I just stumbled on the source of this leak https://github.com/home-assistant/core/pull/96981

litinoveweedle commented 1 year ago

Thank you for your effort and not forgetting this issue. I did upgraded to 2023.7.3, which contains your fix yesterday and I can confirm that so far there is no memory leak anymore!

As additional info, three days ago I discovered, that my official IKEA integration has new 'unknown' and 'unconfigured' device. I was able to remove this device via integration configuration GUI and it slowed down a leak significantly (but not completely). As to the origin of the this mysterious device - I remember I had some issue with IKEA battery powered blinds (Fyrtur), as after recharging their batteries some were not detected and I had to pair them to the IKEA gateway again. This was about in the same time when I discovered the leak. I was going to report this finding, but you were faster with the fix. ;-)

Thank you again.

bdraco commented 1 year ago

Thanks for confirming