home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
74.2k stars 31.16k forks source link

`purge` not working and causing OOO error with PG15 #124994

Open davinkevin opened 3 months ago

davinkevin commented 3 months ago

The problem

I have a relatively big database (recorder retention set to 365d), and I try to purge it using the service.

The main issue is this doesn't work like expected, because the database get important load, and even with 10Gb of memory, the system reach the limit and get killed by the orchestrator.

In that condition, I can't purge my db, it keeps growing… and it becomes hard to maange (180Gb).

Is there a process I can follow to do that, even offline, to purge the DB?

NOTE: From my analysis, it seems a process stays "up" for a long time, and instead of doing a lot of "small queries/transactions" we have a very long running transaction which can't be achieve by the DB. I might be wrong, but I can't explain the behavior and issue I have on the PG side.

What version of Home Assistant Core has the issue?

core-2024.8.0

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Container

Integration causing the issue

recorder

Link to integration documentation on our website

https://www.home-assistant.io/integrations/recorder/

Diagnostics information

See CPU & Memory consumption. Spikes happen after I launch the purge process. It consummes memory until it reaches the limit defined in k8s… and then the db pod is restarted. Because the connection is lost, the purge process is killed/restarted.

image image

Some query I run during my investigation:

select
    table_name,
    pg_size_pretty(pg_total_relation_size(quote_ident(table_name))),
    pg_total_relation_size(quote_ident(table_name))
from information_schema.tables
where table_schema = 'public'
order by 3 desc;
image
SELECT
    pid,
    now() - pg_stat_activity.query_start AS duration,
    query,
    state
FROM pg_stat_activity
WHERE 1=1
    and datname is not null
;
image

NOTE It's maybe a red herring, ~but those transactions seem suspicious and could be the reason why the system memory keeps growing and growing during the process~.

After deeper investigation, it's the query with the ID 130 that consumes all the RAM. I followed this blogpost and run this command cat /proc/130/smaps | grep ^Pss: | awk '{A+=$2} END{print A}' and it confirmed this process keeps eating all memory.

My database configuration is done using cnpg operator in k8s, you can consult the current configuration in my gitOps repository. The current setup can be summarized like this:

NOTE: I tried to do the same with the recorder off, without any change 🤷

If I turn on logging for the recorder, we can see this during the first 10 minutes:

2024-09-01 11:22:47.361 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purging states and events before target 2024-06-03 09:22:47+00:00
2024-09-01 11:22:47.361 DEBUG (Recorder) [homeassistant.components.recorder.purge] Purge running in new format as there are NO states with event_id remaining
2024-09-01 11:22:59.062 DEBUG (Recorder) [homeassistant.components.recorder.purge] Selected 4000 state ids and 3955 attributes_ids to remove
2024-09-01 11:23:47.656 DEBUG (Recorder) [homeassistant.components.recorder.purge] Updated <sqlalchemy.engine.cursor.CursorResult object at 0x7ff5a0495be0> states to remove old_state_id
2024-09-01 11:27:00.454 DEBUG (MainThread) [homeassistant.components.recorder.core] Recorder queue size is: 8434
2024-09-01 11:29:28.364 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted <sqlalchemy.engine.cursor.CursorResult object at 0x7ff5a92b7690> states
2024-09-01 11:29:28.375 DEBUG (Recorder) [homeassistant.components.recorder.purge] Selected 4000 state ids and 3979 attributes_ids to remove
2024-09-01 11:30:16.054 DEBUG (Recorder) [homeassistant.components.recorder.purge] Updated <sqlalchemy.engine.cursor.CursorResult object at 0x7ff58ee6c360> states to remove old_state_id
2024-09-01 11:32:00.455 DEBUG (MainThread) [homeassistant.components.recorder.core] Recorder queue size is: 18488
2024-09-01 11:35:59.679 DEBUG (Recorder) [homeassistant.components.recorder.purge] Deleted <sqlalchemy.engine.cursor.CursorResult object at 0x7ff58ef42eb0> states
2024-09-01 11:35:59.694 DEBUG (Recorder) [homeassistant.components.recorder.purge] Selected 4000 state ids and 3975 attributes_ids to remove
2024-09-01 11:36:51.005 DEBUG (Recorder) [homeassistant.components.recorder.purge] Updated <sqlalchemy.engine.cursor.CursorResult object at 0x7ff5a0610de0> states to remove old_state_id
2024-09-01 11:37:00.456 DEBUG (MainThread) [homeassistant.components.recorder.core] Recorder queue size is: 28533

One loop takes ~7mins and it has consummed ~1Gb of RAM on the PG side:

image

Example YAML snippet

YAML configuration that led to this situation:

recorder:
  db_url: postgresql://…:…@database-rw:5432/hass
  purge_keep_days: 365

Current configuration, doesn't seem to help:

recorder:
  db_url: postgresql://…:…@database-rw:5432/hass
  purge_keep_days: 30
  include:
    domains:
      - climate
      - sensor

Anything in the logs that might be useful for us?

Nothing useful IMO, please ask if you want me to extract something.

Additional information

No response

home-assistant[bot] commented 3 months ago

Hey there @home-assistant/core, mind taking a look at this issue as it has been labeled with an integration (recorder) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `recorder` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign recorder` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


recorder documentation recorder source (message by IssueLinks)

PeteRager commented 3 months ago

First thing to look at is your Postgres config that controls memory usage. These may not be configured correctly based on the amount of RAM being allocated or the config is not being picked up. In an ideal world, OOM killing should not happen if the Postgres memory settings are set correctly. Look at Postgres forums on this topic.

Second, try the purge operation with a smaller set of entities or a shorter timespan that results in less entities being purged. This will allow you to chunk it into smaller transactions.

davinkevin commented 3 months ago

Hey 👋

Thank you for your answer. I've opened a question to the editor of cnpg, the might help on that.

However, I feel the HomeAssistant code is maybe not optimized, a single round of deletion takes ~7mins, which is a smell to me (even without the long transaction issue). Maybe the recent modification on the schema has introduced a regression on this part?

For the moment, I've hacked the code a bit and replace the those values by 1, so I'm sure the CPU/RAM is not consumed for nothing, because right now… the db always crash before the system finishes. Do you see any problem this modification can trigger? It'll achieve the "chunkification", because I see no other solution (the "action" only accept days as value, and a day is too big and triggers OOM).

davinkevin commented 2 months ago

Up if some help could be provided.

I've adjusted some queries done in the queries.py to be faster (to improve cache or index hit). It's not perfect, I went from ~9mins with all purge options to ~6mins now. I still have to spend some time on some other select that are very slow (20seconds).

I'll provide a PR for those changes if you are interested.

The best solution for me would be to run multiple purge in parallel on different data-range to clean. However, the system in HA seems to be mono thread. Do you know a potential solution for this?