Closed litinoveweedle closed 1 year ago
After cca 18hrs my memory consumpion looks like:
CONTAINERS 20 sorted by memory_percent
Engine Pod Name Status Uptime CPU% MEM /MAX IOR/s IOW/s RX/s TX/s Command
docker - addon_a0d7b954_motioneye running yesterday 21.4 1.92G 7.61G 0b 0b NaNb NaNb /init
docker - homeassistant running 18 hours 27.1 1.74G 7.61G 0b 0b NaNb NaNb /init
docker - addon_core_whisper running yesterday 0 908M 7.61G 0b 0b 0b 0b /init
docker - addon_a0d7b954_nodered running yesterday 4.3 262M 7.61G 0b 0b NaNb NaNb /init
docker - addon_4a36bbd1_signal running yesterday 0.1 213M 7.61G 0b 0b 248b 288b /run.sh
docker - hassio_supervisor running yesterday 1.7 207M 7.61G 0b 0b 0b 0b /init
docker - addon_a0d7b954_adguard running yesterday 0 147M 7.61G 0b 0b NaNb NaNb /init
docker - addon_core_piper running yesterday 0 111M 7.61G 0b 0b 0b 0b /init
docker - addon_a0d7b954_glances running yesterday 0.9 65.3M 7.61G 0b 0b NaNb NaNb /run.sh
docker - addon_a0d7b954_logviewer running yesterday 0 32.1M 7.61G 0b 0b 0b 0b /init
docker - addon_a0d7b954_sqlite-web running yesterday 0 31.8M 7.61G 0b 0b 0b 0b /init
docker - hassio_dns running yesterday 0.2 23.7M 7.61G 0b 0b 0b 0b /init
docker - addon_core_configurator running yesterday 0 22.5M 7.61G 0b 0b 0b 0b /init
docker - hassio_audio running yesterday 0 21.9M 7.61G 0b 0b 0b 0b /init
docker - addon_core_ssh running yesterday 0 20.7M 7.61G 0b 0b 0b 0b /init
docker - hassio_observer running yesterday 0 17.6M 7.61G 0b 0b 0b 0b /init
docker - addon_45c828c3_snapcast-server running yesterday 1.4 17.1M 7.61G 0b 0b NaNb NaNb /init
docker - addon_a0d7b954_sonweb running yesterday 0 13.2M 7.61G 0b 0b 0b 0b /init
docker - hassio_cli running yesterday 0 2.47M 7.61G 0b 0b 0b 0b /init
docker - hassio_multicast running yesterday 0 1.05M 7.61G 0b 0b NaNb NaNb /init
After cca 30hrs memory consumpion looks like:
CONTAINERS 20 sorted by memory_percent
Engine Pod Name Status Uptime CPU% MEM /MAX IOR/s IOW/s RX/s TX/s Command
docker - homeassistant running yester 1.7 2.55G 7.61G 44Kb 0b NaNb NaNb /init
docker - addon_a0d7b954_motioneye running 2 days 12.2 1.44G 7.61G 0b 0b NaNb NaNb /init
docker - addon_core_whisper running 2 days 0 818M 7.61G 0b 0b 1Kb 0b /init
docker - addon_a0d7b954_nodered running 2 days 2.4 270M 7.61G 0b 0b NaNb NaNb /init
docker - addon_4a36bbd1_signal running 2 days 0.2 215M 7.61G 0b 0b 1Kb 240b /run.sh
docker - hassio_supervisor running 2 days 0.5 210M 7.61G 0b 0b 0b 0b /init
docker - addon_a0d7b954_adguard running 2 days 0 142M 7.61G 0b 0b NaNb NaNb /init
docker - addon_core_piper running 2 days 0 98.7M 7.61G 0b 0b 1Kb 0b /init
docker - addon_a0d7b954_glances running 2 days 1.2 51.1M 7.61G 0b 0b NaNb NaNb /run.sh
docker - addon_a0d7b954_sqlite-web running 2 days 0 30.3M 7.61G 0b 0b 1Kb 0b /init
docker - addon_a0d7b954_logviewer running 2 days 0 27.7M 7.61G 0b 0b 1Kb 0b /init
docker - addon_core_configurator running 2 days 0 23.2M 7.61G 0b 0b 1Kb 0b /init
docker - hassio_dns running 2 days 0.1 22.5M 7.61G 0b 0b 2Kb 808b /init
docker - addon_core_ssh running 2 days 0 19.7M 7.61G 0b 0b 1Kb 0b /init
docker - hassio_audio running 2 days 0 19.2M 7.61G 0b 0b 1Kb 0b /init
docker - addon_45c828c3_snapcast-server running 2 days 1.4 14.0M 7.61G 0b 0b NaNb NaNb /init
docker - addon_a0d7b954_sonweb running 2 days 0 13.0M 7.61G 0b 0b 1Kb 0b /init
docker - hassio_observer running 2 days 0 12.4M 7.61G 0b 0b 1Kb 0b /init
docker - hassio_cli running 2 days 0 1.96M 7.61G 0b 0b 1Kb 0b /init
docker - hassio_multicast running 2 days 0 1.04M 7.61G 0b 0b NaNb NaNb /init
I'd start with start_log_objects
as START_LOG_OBJECT_SOURCES
is a lot more useful when we already know what type of object we are looking for
OK, thank you. I did as you suggested, the log is attached. Thank you for any help. home-assistant.zip
Something is creating a massive amount of tasks
2023-05-31 13:41:50.794 CRITICAL (SyncWorker_4) [homeassistant.components.profiler] Memory Growth: [('coroutine', 202609, 339), ('frame', 202592, 339), ('builtin_function_or_method', 110601, 169), ('ReferenceType', 80809, 104), ('method', 56223, 90), ('Task', 50646, 84), ('Future', 50673, 84), ('FutureIter', 50627, 84), ('Context', 103933, 15)]
Thats good thats its asyncio since it means a callgrind might show it.
Can you generate one with the profiler.start
service?
Thank you for help, please find it attached - for 180s, if needed I can produce longer. callgrind.zip
Its not immediately obvious where the task leak is coming from which means its likely a "slow" leak. They are always easier to find when they run the system out of memory quickly as you can see the abnormal growth quickly.
The next step would be to disable custom components below and see if the problem goes away.
If that turns out to be the cause we would than need to dig through the code of the one causing it to look for the leak
Its not immediately obvious where the task leak is coming from which means its likely a "slow" leak.
It is not that slow leak... I am running HA on server with 16GB RAM and HA was able to grow from ~300MB to 4GB in ~ two days. Should I keep running profiler for longer?
Does you reply means, that there was nothing obvious in callgrind? I am doing what you wrote anyway (test and try integrations), but I need at leas half of the integrations for base function of my house hold, so I can't simply switch them off. Therefore it would be really nice, if there is something in the logs at least to narrow my search. Thank you.
It likely it is in the profile but I can’t find it because it’s so small.
It’s slow in the sense that in the time of the profile it only grows a little. When I say fast leak I mean something that will crash in a few minutes. These are easy to see where the source is because they show up in the profile by sorting
You could try running the profile for 30m to see if it’s large enough to correlate the source.
keep in mind the profile has all the calls so it’s looking for a needle in a haystack. Anything we can do to make the leak faster or bigger will make it easier to find
I did run profiler for 1800s but strangely the size of the files is similar. Anyway I hope it would be helpful. Thank you.
Here is what is calling create_task
I don't know what messagemanager is but its creating tasks (not saying its leaking), but I can't find it in the HA code so it must be a custom component or lib
Hello, thank you very much for support. It is strange as this brings nothing also>
root@smart:/usr/share/hassio/homeassistant/custom_components# grep -Ri messagemanager *
root@smart:/usr/share/hassio/homeassistant/custom_components# grep -Ri message_manager *
So messagemanager method seems not to be in the in the custom integrations either. I did even grep in the Homeassistant and Hassio root directories, but nothing was found. Are you sure about the name? (i.e. messagemanager)? I did not found it in the diagrams you attached either.
BTW: is there any HOWTO one can follow to get similar output from profiler? It would be helpful for user to try to identify faulty code themself.
@bdraco Can you please check "messagemanager" method/class name you mentioned? Maybe I am doing it wrong, but I was not able to find it in the custom integration either.
Just to chime in here, I'm not using any of these integrations:
* cz_energy_spot_prices * smartir * floureon * solax_modbus * programmable_thermostat * signalmessenger * variable_history
I'm also experiencing a memory leak, that results in the home-assistant container gets killed (by the OOM) every 5-6 days. Quite annoying :)
One of the pictures is showing when the container got killed, and the other shows the free memory keeps getting lower.
@bdraco I am really sorry, but I have politely asked you several times during past weeks simple question regarding your original findings - i.e. where did you exactly found "messagemanager" in the trace and why I can't grep anything in HA source code including custom integrations with such name. I need such information to continue with my troubleshooting.
Do I assume correctly from your silence, that I should not expect an answer at all?
Day job has me traveling quite a bit. I’m about 150 GitHub notifications behind
So, I just put in an issue with HAOS because I'm also experiencing OOM reboots. My memory allocation graph is even steeper and my machine reboots more often. #2595.
Is this more likely to be an HA than an HAOS issue? I'm going to try to downgrade to HAOS 9.5 and see if I still have leaks.
I see the same on my instance. It seemed to start after updating HAOS or 2023.6 beta (coming from a dev build in the early stage of 2023.5, so it could be also related to a 2023.5 specific version). What I have noticed is that my memory increase is caused by my tablet which shows the dashboard 24/7. When I disable the tablet the memory stops increasing, but doesn’t decrease. When I open the display on the tablet the memory starts to increase again. Is this something you use aswell?
I did observed memory leak on 2023.5. There was question why my graph of memory usage is not as steep as for others. Well I do run HA on server and I have currently 64GB of RAM, so it takes longer to consume it. (I extended my memory as intermediate workaround as my original 16GB RAM was exhausted in just 2-3 days.)
If you have a memory leak please post a start log objects (see above) and a callgrind. Me too comments are not helpful as they don’t add any more information that gets us closer to tracking down a leak.
callgrind_start_log_objects.zip
See the attached file. If you need more or longer logs let me know.
@VinceRMP Can you run the start log objects for 30 minutes? The way it works is it keep looking at whats in memory and sees what grows over time. Without a longer run it won't show what is leaking as it doesn't have enough time to figure out what is normal memory churn and what is not going away.
At the moment I’m not able to do that. I will do it tomorrow and post the full log.
So, I've noticed a correlation with a CPU spike and the incrementally increasing memory use:
Is there a way to monitor what causes the CPU spike to maybe get insight into what's using and not releasing memory?
I can reproduce the start of the memory leak and stop it aswell after some testing. I'm not sure if this is the same issue as described here and I don't want to hijack this issue. So if I need to create a seperate one please let me know.
At the first drop in memory I did a restart of Home Assistant, but didn't refresh the webapp running on my tablet. This causes the memory to increase as you can see. Around 08:00 I restarted the webapp on the tablet. The memory stays the same, so no memory leak visible. Around 09:00 I did another restart of Home Assistant without restarting the webapp on the tablet and there is the memory leak again.
This is also reported in the logs (I somehow didn't notice this before):
[139928754006096] Tablet from (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":22632,"type":"result","success":true,"result":null} [139928813456528] Tablet from (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":20399,"type":"result","success":true,"result":null} [139928532840656] Tablet from (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":20538,"type":"result","success":true,"result":null} [139928688859152] Tablet from (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":70314,"type":"result","success":true,"result":null} [139928776142096] Tablet from (Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36 Edg/114.0.1823.43): Client unable to keep up with pending messages. Reached 4096 pending messages. The system's load is too high or an integration is misbehaving; Last message was: {"id":30737,"type":"result","success":true,"result":null}
This is the log which now has run for around 30 minutes: home-assistant.log
@VinceRMP that definitely looks like a different problem as there are no obvious task leaks in the log. Please continue in a new issue.
Thanks for looking into it! I will create an issue in the frontend section.
I think it’s still should go to core as we need to figure out why it isn’t releasing memory when the client disconnects
@bdraco how about original issue I logged. Did you got time to find out how I should identify "messagemanager" you mentioned?
BTW: is there any HOWTO one can follow to get similar output from profiler? It would be helpful for user to try to identify faulty code themself.
https://www.home-assistant.io/integrations/profiler/
Additionally, the profiler will generate a callgrind.out file that can be viewed with:
@bdraco how about original issue I logged. Did you got time to find out how I should identify "messagemanager" you mentioned?
I wasn't able to find it. I don't think its going to help anyways
@litinoveweedle I think you might have some luck figuring out the source if you do dump_log_objects
on Task
service: profiler.dump_log_objects
data:
type: Task
You'll get something like
2023-06-16 09:17:13.870 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1215' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:486> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.870 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1216' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:486> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.870 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1222' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:486> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.871 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1223' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:486> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.871 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1229' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:512> wait_for=<Task pending name='Task-1230' coro=<RequestHandler._handle_request() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:433> wait_for=<Future pending cb=[Task.task_wakeup()]> cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.871 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1230' coro=<RequestHandler._handle_request() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:433> wait_for=<Future pending cb=[Task.task_wakeup()]> cb=[Task.task_wakeup()]>
2023-06-16 09:17:13.872 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1231' coro=<WebSocketHandler._writer() running at /Users/bdraco/home-assistant/homeassistant/components/websocket_api/http.py:109> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2023-06-16 09:17:13.872 CRITICAL (SyncWorker_2) [homeassistant.components.profiler] Task object in memory: <Task pending name='Task-1208' coro=<RequestHandler.start() running at /Users/bdraco/home-assistant/venv/lib/python3.11/site-packages/aiohttp/web_protocol.py:486> wait_for=<Future pending cb=[Task.task_wakeup()]>>
Keep in mind you have 50000 in your dump above so it may run out of memory trying to list them all so you might want to do the dump before too many memory has leaked but not too soon that there aren't enough to identify.
@litinoveweedle I think you might have some luck figuring out the source if you do
dump_log_objects
onTask
Sorry I don't understand. (On Task part). I will be back home on Sunday, I will try to see how this could be done.
Ok, thank you for explanation, I will try to do that after HA restart
Not sure it can help or even if it is relevant, but investigating myself an unrelated memory leak I found out a potential interesting library for your case: aiocoap which defines a messagemanager and functions called fill_or_recognize_remote as it appears in the callgrind graphs.
This pyhton library is used by another library: pytradfri, itself used by the IKEA Tradfri HA Component: https://github.com/home-assistant/core/blob/dev/homeassistant/components/tradfri/manifest.json
@litinoveweedle if you use this component it may worse deactivating it to check if it could be the culprit.
Pretty sure I just stumbled on the source of this leak https://github.com/home-assistant/core/pull/96981
Thank you for your effort and not forgetting this issue. I did upgraded to 2023.7.3, which contains your fix yesterday and I can confirm that so far there is no memory leak anymore!
As additional info, three days ago I discovered, that my official IKEA integration has new 'unknown' and 'unconfigured' device. I was able to remove this device via integration configuration GUI and it slowed down a leak significantly (but not completely). As to the origin of the this mysterious device - I remember I had some issue with IKEA battery powered blinds (Fyrtur), as after recharging their batteries some were not detected and I had to pair them to the IKEA gateway again. This was about in the same time when I discovered the leak. I was going to report this finding, but you were faster with the fix. ;-)
Thank you again.
Thanks for confirming
Read before posting
https://github.com/home-assistant/core/issues/95386
The problem
HomeAssiastant container is leaking memory at about 1GB/day. Tried to disable not strictly necessary integrations, but could not find culprit. I am not exactly sure, when problem appeared first time, and if it is caused by new HA version or by new integration. I did disable OnVif integration, which was mentioned in separate OoM issue, but HA container kept growing anyway.
Profiler PROFILER.START_LOG_OBJECT_SOURCES log attached. I saw many messages like '"Failed to serialize <class 'list'>"'. but I am not sure, if it is real issue or not.
Could be related to #93556 or #92026
What version of Home Assistant Core has the issue?
core-2023.5.4
What was the last working version of Home Assistant Core?
core-2023-4.X?
What type of installation are you running?
Home Assistant Supervised
Integration causing the issue
unknown
Link to integration documentation on our website
No response
Diagnostics information
home-assistant.log.zip
Example YAML snippet
No response
Anything in the logs that might be useful for us?
No response
Additional information
No response