Open davinkevin opened 3 months ago
Hey there @home-assistant/core, mind taking a look at this issue as it has been labeled with an integration (recorder
) you are listed as a code owner for? Thanks!
(message by CodeOwnersMention)
recorder documentation recorder source (message by IssueLinks)
First thing to look at is your Postgres config that controls memory usage. These may not be configured correctly based on the amount of RAM being allocated or the config is not being picked up. In an ideal world, OOM killing should not happen if the Postgres memory settings are set correctly. Look at Postgres forums on this topic.
Second, try the purge operation with a smaller set of entities or a shorter timespan that results in less entities being purged. This will allow you to chunk it into smaller transactions.
Hey 👋
Thank you for your answer. I've opened a question to the editor of cnpg
, the might help on that.
However, I feel the HomeAssistant code is maybe not optimized, a single round of deletion takes ~7mins, which is a smell to me (even without the long transaction issue). Maybe the recent modification on the schema has introduced a regression on this part?
For the moment, I've hacked the code a bit and replace the those values by 1
, so I'm sure the CPU/RAM is not consumed for nothing, because right now… the db
always crash before the system finishes. Do you see any problem this modification can trigger? It'll achieve the "chunkification", because I see no other solution (the "action" only accept days as value, and a day is too big and triggers OOM).
Up if some help could be provided.
I've adjusted some queries done in the queries.py
to be faster (to improve cache or index hit). It's not perfect, I went from ~9mins with all purge options to ~6mins now.
I still have to spend some time on some other select
that are very slow (20seconds).
I'll provide a PR for those changes if you are interested.
The best solution for me would be to run multiple purge in parallel on different data-range to clean. However, the system in HA seems to be mono thread. Do you know a potential solution for this?
The problem
I have a relatively big database (recorder retention set to 365d), and I try to purge it using the service.
The main issue is this doesn't work like expected, because the database get important load, and even with 10Gb of memory, the system reach the limit and get killed by the orchestrator.
In that condition, I can't purge my db, it keeps growing… and it becomes hard to maange (180Gb).
Is there a process I can follow to do that, even offline, to purge the DB?
NOTE: From my analysis, it seems a process stays "up" for a long time, and instead of doing a lot of "small queries/transactions" we have a very long running transaction which can't be achieve by the DB. I might be wrong, but I can't explain the behavior and issue I have on the PG side.
What version of Home Assistant Core has the issue?
core-2024.8.0
What was the last working version of Home Assistant Core?
No response
What type of installation are you running?
Home Assistant Container
Integration causing the issue
recorder
Link to integration documentation on our website
https://www.home-assistant.io/integrations/recorder/
Diagnostics information
See CPU & Memory consumption. Spikes happen after I launch the
purge
process. It consummes memory until it reaches the limit defined in k8s… and then the db pod is restarted. Because the connection is lost, thepurge
process is killed/restarted.Some query I run during my investigation:
NOTE It's maybe a red herring, ~but those transactions seem suspicious and could be the reason why the system memory keeps growing and growing during the process~.
After deeper investigation, it's the query with the ID 130 that consumes all the RAM. I followed this blogpost and run this command
cat /proc/130/smaps | grep ^Pss: | awk '{A+=$2} END{print A}'
and it confirmed this process keeps eating all memory.My
database
configuration is done usingcnpg
operator in k8s, you can consult the current configuration in my gitOps repository. The current setup can be summarized like this:NOTE: I tried to do the same with the recorder
off
, without any change 🤷If I turn on logging for the recorder, we can see this during the first 10 minutes:
One loop takes ~7mins and it has consummed ~1Gb of RAM on the PG side:
Example YAML snippet
YAML configuration that led to this situation:
Current configuration, doesn't seem to help:
Anything in the logs that might be useful for us?
Additional information
No response