home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
73.62k stars 30.77k forks source link

Purge causes recorder to stop writing to the DB until HA is restarted (Auto purge happens at 4:12am) #117263

Closed HertogArjan closed 4 months ago

HertogArjan commented 6 months ago

This problem is solved in 2024.7.2. If the system ran out of disk space, and the table rebuild failed, it will try again in 2024.8.1+ see issue https://github.com/home-assistant/core/issues/123348 and solution https://github.com/home-assistant/core/pull/123388

Workaround: Disabling nightly auto purge will prevent the issue from occurring (this is not a long term solution)

# Example configuration.yaml entry
recorder:
  auto_purge: false

Be sure to re-enable auto-purge after installing 2024.7.2 or your database will grow without bounds, and your system will eventually run out of disk space or become sluggish.

Cause: https://github.com/home-assistant/core/issues/117263#issuecomment-2197311144 Solution: https://github.com/home-assistant/core/pull/120779

The problem

Every night at around ~4:10 the histories for all entities stop. This has been happening since at least April 9th. I updated Home Assistant to 2024.4.1 on April 5th, but I can't say for sure if this issue started directly afterwards. A restart of Home Assistant allows recording again but does not restore the history missed since ~4:10. I suspect it has something to do with the Recorder auto purge at 4:12 because the same symptoms happen when the purge is run manually.

I don't think the manual or automatic purge is currently able to finish because the (SQLite) database seems way too large (>6GB) for my configured purge_keep_days of 7.

If I run recorder.purge from the web UI the same symptoms happen like during the night. By looking at the mtime it is clear home-assistant_v2.db does not get written to anymore. htop shows HA using 100% of one CPU core continously and iotop show HA reading from disk at ~400MB/s continously. This went on for at least 25 minutes before I stopped the process.

The logs show nothing unusual happening around 4:12. When I run recorder.purge from the web UI with verbose logging enabled the logs just show:

2024-05-11 15:16:28.560 INFO (MainThread) [homeassistant.helpers.script.websocket_api_script] websocket_api script: Running websocket_api script
2024-05-11 15:16:28.560 INFO (MainThread) [homeassistant.helpers.script.websocket_api_script] websocket_api script: Executing step call service

When HA is stopped using SIGTERM the shutdown takes a long time and it is clear from the logs it is waiting for a Recorder task:

2024-05-11 15:20:00.573 WARNING (MainThread) [homeassistant.core] Shutdown stage 'final write': still running: <Task pending name='Task-2684' coro=<Recorder._async_shutdown() running at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py:475> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.12/asyncio/futures.py:387, <1 more>, Task.task_wakeup()] created at /usr/lib/python3.12/asyncio/base_events.py:449> cb=[set.remove()] created at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/util/async_.py:40>

See the rest of the relevant messages during shutdown below.

What version of Home Assistant Core has the issue?

core-2024.5.2

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Core

Integration causing the issue

Recorder

Link to integration documentation on our website

https://www.home-assistant.io/integrations/recorder/#service-purge

Diagnostics information

No response

Example YAML snippet

recorder:
  # keep 30 days of history for all states by default
  purge_keep_days: 7
  exclude:
    domains:
      - weather
    entities:
      - sun.sun
    entity_globs:
      - 'automation.abrp_live_data_*'
      - 'timer.abrp_live_data_*'
      - 'automation.pvoutput_*'
      - 'timer.pvoutput_*'
      - 'sensor.sampled_stroomsterkte_fase_l?'
      - 'sensor.stroomsterkte_fase_l?_*_sec_gem'

Anything in the logs that might be useful for us?

2024-05-11 15:20:00.573 WARNING (MainThread) [homeassistant.core] Timed out waiting for final writes to complete, the shutdown will continue
2024-05-11 15:20:00.573 WARNING (MainThread) [homeassistant.core] Shutdown stage 'final write': still running: <Task pending name='Task-2684' coro=<Recorder._async_shutdown() running at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py:475> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.12/asyncio/futures.py:387, <1 more>, Task.task_wakeup()] created at /usr/lib/python3.12/asyncio/base_events.py:449> cb=[set.remove()] created at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/util/async_.py:40>
2024-05-11 15:20:30.580 WARNING (MainThread) [homeassistant.core] Timed out waiting for close event to be processed, the shutdown will continue
2024-05-11 15:20:30.580 WARNING (MainThread) [homeassistant.core] Shutdown stage 'close': still running: <Task pending name='Task-2684' coro=<Recorder._async_shutdown() running at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py:475> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.12/asyncio/futures.py:387, <1 more>, Task.task_wakeup()] created at /usr/lib/python3.12/asyncio/base_events.py:449> cb=[set.remove()] created at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/util/async_.py:40>
2024-05-11 15:20:30.580 WARNING (MainThread) [homeassistant.core] Shutdown stage 'close': still running: <Task pending name='Task-2714' coro=<Recorder._async_close() running at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py:467> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.12/asyncio/futures.py:387, <1 more>, Task.task_wakeup()] created at /usr/lib/python3.12/asyncio/base_events.py:449> cb=[set.remove()] created at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/util/async_.py:40>
2024-05-11 15:20:30.752 WARNING (Thread-4 (_do_shutdown)) [homeassistant.util.executor] Thread[SyncWorker_2] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:30.919 WARNING (Thread-4 (_do_shutdown)) [homeassistant.util.executor] Thread[SyncWorker_4] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:31.403 WARNING (Thread-4 (_do_shutdown)) [homeassistant.util.executor] Thread[SyncWorker_2] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:31.887 WARNING (Thread-4 (_do_shutdown)) [homeassistant.util.executor] Thread[SyncWorker_4] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:40.751 WARNING (MainThread) [homeassistant.util.executor] Thread[SyncWorker_2] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:40.918 WARNING (MainThread) [homeassistant.util.executor] Thread[SyncWorker_4] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:41.402 WARNING (MainThread) [homeassistant.util.executor] Thread[SyncWorker_2] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:41.886 WARNING (MainThread) [homeassistant.util.executor] Thread[SyncWorker_4] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):

Additional information

I thought maybe my database could be corrupted, so with HA shutdown I ran mv home-assistant_v2.db home-assistant_v2_old.db; sqlite3 home-assistant_v2_old.db ".recover" | sqlite3 home-assistant_v2.db and then tried to run a purge again. Unfortunately the problem was not resolved. My database did shrink by about 1.5 GB.

bdraco commented 4 months ago

@bdraco tested your addon. unfortunately don't have enough space on the device to make it rebuild the entire DB. Is there any way we can do this "offline"? I'm more than welcome to just shoot a bunch of sql statements into sqlite on another box and give it a try.

Here are the sql statements for an offline migration. Be sure to make a backup first. This code is only safe to run with schema 43 or 44. Do not run it on older database (from before May 2024).

BEGIN;
PRAGMA foreign_keys=OFF;
COMMIT;
BEGIN;
CREATE TABLE states_temp_1720542205 (
    state_id INTEGER NOT NULL, 
    entity_id CHAR(0), 
    state VARCHAR(255), 
    attributes CHAR(0), 
    event_id SMALLINT, 
    last_changed CHAR(0), 
    last_changed_ts FLOAT, 
    last_reported_ts FLOAT, 
    last_updated CHAR(0), 
    last_updated_ts FLOAT, 
    old_state_id INTEGER, 
    attributes_id INTEGER, 
    context_id CHAR(0), 
    context_user_id CHAR(0), 
    context_parent_id CHAR(0), 
    origin_idx SMALLINT, 
    context_id_bin BLOB, 
    context_user_id_bin BLOB, 
    context_parent_id_bin BLOB, 
    metadata_id INTEGER, 
    PRIMARY KEY (state_id), 
    FOREIGN KEY(old_state_id) REFERENCES states (state_id), 
    FOREIGN KEY(attributes_id) REFERENCES state_attributes (attributes_id), 
    FOREIGN KEY(metadata_id) REFERENCES states_meta (metadata_id)
);
INSERT INTO states_temp_1720542205 SELECT state_id,entity_id,state,attributes,event_id,last_changed,last_changed_ts,last_reported_ts,last_updated,last_updated_ts,old_state_id,attributes_id,context_id,context_user_id,context_parent_id,origin_idx,context_id_bin,context_user_id_bin,context_parent_id_bin,metadata_id FROM states;
DROP TABLE states;
ALTER TABLE states_temp_1720542205 RENAME TO states;
CREATE INDEX ix_states_metadata_id_last_updated_ts ON states (metadata_id, last_updated_ts);
CREATE INDEX ix_states_last_updated_ts ON states (last_updated_ts);
CREATE INDEX ix_states_context_id_bin ON states (context_id_bin);
CREATE INDEX ix_states_attributes_id ON states (attributes_id);
CREATE INDEX ix_states_old_state_id ON states (old_state_id);
PRAGMA foreign_key_check;
COMMIT;
BEGIN;
PRAGMA foreign_keys=ON;
COMMIT;
Write commented 4 months ago

could not get it to work, saw nothing in logs, but DB still does not record data, I see the sensors update, but as soon as I restart or reboot, all that data is lost, also after a period of time the sensors stop history updating.

Same here, very weird. I do have issue117263:

In my configuration.yaml and did run the command previously.

Only things in log is :

cat home-assistant.log | grep issue117263

2024-07-09 18:34:16.064 WARNING (SyncWorker_1) [homeassistant.loader] We found a custom integration issue117263 which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant

For now I have no idea how to properly apply the patch.

Guessing the DB is already "patched" but then, manually purging locks up the cpu at 100% so I'm really not sure.

andyblac commented 4 months ago

@bdraco I seem to have different issue, my issue is not to do with the purge, but just 2024.7.x it does not record ANY history at all, it seems to store in memory for a while, but as soon as I restart HA all the current history is lost.

Is this an known issue with older database ( I started using HA around d September last year), should I open a new issue ?

I tried running the integration you created but I did not see anything other than

config # cat home-assistant.log | grep issue117263
2024-07-09 18:18:06.211 WARNING (SyncWorker_0) [homeassistant.loader] We found a custom integration issue117263 which has not been tested by Home Assistant. This component might cause stability problems, be sure to disable it if you experience issues with Home Assistant

So take it I do not have the foreign_keys issue ?

bdraco commented 4 months ago

@bdraco I seem to have different issue

please open a fresh issue report. Thanks

andyblac commented 4 months ago

@bdraco I seem to have different issue

please open a fresh issue report. Thanks

done 121614

fuslwusl commented 4 months ago

Have the same problem. Ist there a way to save History valued for energy dashboard? I could then delete the homeassistant db and let it create a new db by doing a restart and afterwards import the energy data into this new empty DB. But how - never tried that.

steerage250 commented 4 months ago

With all these database and SQL solutions to the problem of the recorder not purging and stopping recording, can I just disable purging, wait for the August version, re-enable purging at it will all sort itself out ?

(I have oodles of disc space)

vogtmh commented 4 months ago

With all these database and SQL solutions to the problem of the recorder not purging and stopping recording, can I just disable purging, wait for the August version, re-enable purging at it will all sort itself out ?

(I have oodles of disc space)

I guess you can, I will do the same. Just make sure you have enough space inside home assistant. I run HAOS on KVM and the image is set to 32 GB disk size by default and my DB is around 10 GB now. So I just extended the disk by 30 GB.

davidsonimagerygmailcom commented 4 months ago

Yeah that's what I was hoping too. As much as I'd love to get hands dirty, I'm beyond time poor (yes this is an oxymoron for HA lovers but nevertheless .... :))

steerage250 commented 4 months ago

Maybe HA is one of the reasons you are so time poor :-)

ChristophCaina commented 4 months ago

With all these database and SQL solutions to the problem of the recorder not purging and stopping recording, can I just disable purging, wait for the August version, re-enable purging at it will all sort itself out ?

(I have oodles of disc space)

yes, you can add the following to your configuration.yaml to disable the automatic purge:

recorder:
  auto_purge: false

But don't forget to remove this again later ;)

Stratagm commented 4 months ago

Yeah that's what I was hoping too. As much as I'd love to get hands dirty, I'm beyond time poor (yes this is an oxymoron for HA lovers but nevertheless .... :))

That's not an oxymoron, that's cause and effect 😏

parautenbach commented 4 months ago

And keep in mind that the next purge may then take quite a while to complete.

corvy commented 4 months ago

See the PR #120779, the fix has been moved up to 2024.7.2.

choenig commented 4 months ago

Just to give my two cents regarding disk usage, now that it will come into 2024.7.2 ...

I converted using the custom_component from above. It took ~2 hours. The ~28GB database grew to ~72GB during the conversion:

Before (~28.5 GB):

-rw-r--r--  1 root root 28531658752 Jul  9 08:31 home-assistant_v2.db
-rw-r--r--  1 root root      163840 Jul  9 08:31 home-assistant_v2.db-shm
-rw-r--r--  1 root root     9076392 Jul  9 08:31 home-assistant_v2.db-wal

Right after the conversion (~72.8GB):

-rw-r--r--  1 root root 38595018752 Jul  9 12:17 home-assistant_v2.db
-rw-r--r--  1 root root    66453504 Jul  9 12:45 home-assistant_v2.db-shm
-rw-r--r--  1 root root 34212661312 Jul  9 12:45 home-assistant_v2.db-wal

After a manual purge with repack (25.8 GB):

-rw-r--r--  1 root root 25794158592 Jul  9 12:59 home-assistant_v2.db
-rw-r--r--  1 root root    66453504 Jul  9 12:59 home-assistant_v2.db-shm
-rw-r--r--  1 root root    22289232 Jul  9 13:00 home-assistant_v2.db-wal

This is on a 10 year old Core i5 with SATA-SSD.

br Christian

sonar98 commented 4 months ago

I tried using the custom module but after a while my home assistant stops responding. Now I tried using the queries on a standaline machine but when running the insert in the temp table query I get the following error: Runtime error near line 1: UNIQUE constraint failed: states_temp_1720542205.state_id (19)

frenck commented 4 months ago

ℹ️ Home Assistant 2024.7.2

https://github.com/home-assistant/core/releases/tag/2024.7.2

Which addresses this issue.

bdraco commented 4 months ago

I tried using the custom module but after a while my home assistant stops responding. Now I tried using the queries on a standaline machine but when running the insert in the temp table query I get the following error: Runtime error near line 1: UNIQUE constraint failed: states_temp_1720542205.state_id (19)

That looks like you have some corruption in your states table. You’ll need to manually remove the duplicate state_id

sonar98 commented 4 months ago

That looks like you have some corruption in your states table. You’ll need to manually remove the duplicate state_id

What is the easiest way check/fix this?

bdraco commented 4 months ago

That looks like you have some corruption in your states table. You’ll need to manually remove the duplicate state_id

What is the easiest way check/fix this?

That’s not something that can be solved in this issue. Please jump on discord and ask for help in #general as it will require a much more interactive process and something everyone subscribed to this issue would likely not want to see unfold.

davidsonimagerygmailcom commented 4 months ago

2024.7.2 resolved issue for me.

GigiPompieru commented 4 months ago

2024.7.2 resolved issue for me.

same here. I saw that last night there was no gap in my history

swainstm commented 4 months ago

For what it is worth, 2024.7.2 seems to be working now and fixed the issue (unless it was something else I did).

I was one of the people with issues without the known integrations that causes the issue. I worked around the issue, by rolling back to 2024.6.4 which is a workaround, as by database would have grown too big without doing this.

While normally I have the recorder set to keep 4 days of records, I did a manual recorder:purge to keep only 1 day and repacked the DB which brought its size down from 8Gb to about 3GB (in fact the recorder was probably only storing about 3 days of data at this time). I this took a bit less than half an hour. Of course this was all working on 2024.6.4.

Then I did an upgrade to 2024.7.2 and again manually did a recorder:purge keeping only 1 day which seemed to complete without issues without hanging the recorder. So assume issue is fixed. It finished very quickly because I assume not much is being purged because I only just did it. In a few days when it is running over night and purging more, I will know for sure, but looks good.

Hope that is useful.

bobvandevijver commented 4 months ago

For me this issue started with 2024.7.2. Now two nights in a row that the recorder hangs at 4.12. No useful information in the logs, except that when I restart HA I see the following:

2024-07-11 07:17:07.219 WARNING (Recorder) [homeassistant.components.recorder.util] The system could not validate that the sqlite3 database at //config/home-assistant_v2.db was shutdown cleanly
2024-07-11 07:17:07.226 WARNING (Recorder) [homeassistant.components.recorder.util] Ended unfinished session (id=232 from 2024-07-10 13:04:46.547373)

Anything in particular I can take a look at?


Looks like the database state table rebuild failed (which might have been due to not having enough diskspace for that), and it was never restarted. I executed the migration now manually and run the recorder purge service manually afterwards, which did complete correctly now🎉Now running a repack as the database file did grow a couple of GBs because of the rebuild 😄


And repack removed those additional GBs and some extra 👍🏻

fuslwusl commented 4 months ago

Great work. Update to homeassistant 2024.7.2 fixed the purge problem. Now i am happy with a tiny db of 300 mb instead of a 3 gb monster. This also reduced CPU load from 10% to 5% 👍

kchiem commented 4 months ago

ScreenClip

This is on 2024.7.2--seems to be getting stuck still. I manually restarted at 4:32 to get it to resume.

tackin commented 4 months ago

confirm: Update from 2024.7.1 to 2024.7.2 did fix the issue for me. Config-setting recorder: auto_purge: true

did not break recording anymore at about 4:00 o'clock. An additional service-call recorder.purge finished correctly this morning and shrinked my DB more than 60%. Thanks!

bdraco commented 4 months ago

This is on 2024.7.2--seems to be getting stuck still. I manually restarted at 4:32 to get it to resume.

If you still have a problem, please verify you have enough disk space (you need ~2x the size of the database free -- maybe a bit less but 2x is safest), and restart again.

If that doesn't solve the problem, please download and post your logs https://my.home-assistant.io/redirect/logs

kchiem commented 4 months ago

If you still have a problem, please verify you have enough disk space (you need ~2x the size of the database free -- maybe a bit less but 2x is safest), and restart again.

So.. my db file is 16.0 gb and even after removing all the local backup files, I only have 24 gb free. This doesn't seem to have been a problem with previous monthly purges though, so why is it suddenly a problem now?

ScreenClip

I manually ran:

service: recorder.purge
data:
  repack: true

And it seems to be the cause of the graph flatlining. It seems to be doing a lot of disk reads:

# iostat -dm sda 60
Linux 6.6.33-haos (homeassistant)       07/11/24        _x86_64_        (8 CPU)

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda             181.56         1.17         0.23    2112071     418545

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            8176.10        60.04         0.01       3602          0

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            7823.22        56.10         0.01       3366          0

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            7287.07        45.77         0.01       2746          0

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            8298.15        63.96         0.01       3837          0

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            7228.28        44.36         0.01       2661          0

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            7961.37        57.30         0.01       3437          0

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            7914.95        57.49         0.01       3449          0

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            7458.41        45.69         0.01       2741          0

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            8111.10        63.14         0.01       3788          0

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            7194.98        44.78         0.02       2686          1

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            8069.58        57.09         0.01       3425          0

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            7678.67        56.21         0.01       3372          0

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            7325.13        44.97         0.01       2698          0

The average appears to be over 50 mb/s, which should take less than 6 minutes to go through 16 gb, but it's been doing this for over 5.5 hrs now. Any idea what's going on?

If that doesn't solve the problem, please download and post your logs https://my.home-assistant.io/redirect/logs

I'm not seeing anything in the logs related to the recorder, but here it is:

home-assistant_2024-07-11T20-56-12.969Z.log

bdraco commented 4 months ago

So.. my db file is 16.0 gb and even after removing all the local backup files, I only have 24 gb free. This doesn't seem to have been a problem with previous monthly purges though, so why is it suddenly a problem now?

24 GB is likely enough. It should not be a problem, but is always better to have 2x the database size in case a table rebuild fails.

I don't see any table rebuilds in your logs so either the table rebuild already happened, or there is something wrong with your database that prevents it from getting to that point.

You can check to see if the table rebuild already happened by manually running the following

% sqlite3 home-assistant_v2.db
SQLite version 3.43.2 2023-10-10 13:08:14
Enter ".help" for usage hints.
sqlite> .schema states
CREATE TABLE IF NOT EXISTS "states" (
    "state_id" INTEGER NOT NULL  ,
    "entity_id" CHARACTER(0) NULL  ,
    "state" VARCHAR(255) NULL  ,
    "attributes" CHARACTER(0) NULL  ,
    "event_id" SMALLINT NULL  ,
    "last_changed" CHARACTER(0) NULL  ,
    "last_changed_ts" DOUBLE NULL  ,
    "last_updated" CHARACTER(0) NULL  ,
    "last_updated_ts" DOUBLE NULL  ,
    "old_state_id" INTEGER NULL  ,
    "attributes_id" INTEGER NULL  ,
    "context_id" CHARACTER(0) NULL  ,
    "context_user_id" CHARACTER(0) NULL  ,
    "context_parent_id" CHARACTER(0) NULL  ,
    "origin_idx" SMALLINT NULL  ,
    "context_id_bin" BLOB NULL  ,
    "context_user_id_bin" BLOB NULL  ,
    "context_parent_id_bin" BLOB NULL  ,
    "metadata_id" INTEGER NULL  , last_reported_ts FLOAT,
    PRIMARY KEY ("state_id"),
    FOREIGN KEY("old_state_id") REFERENCES "states" ("state_id") ON UPDATE RESTRICT ON DELETE RESTRICT,
    FOREIGN KEY("attributes_id") REFERENCES "state_attributes" ("attributes_id") ON UPDATE RESTRICT ON DELETE RESTRICT,
    FOREIGN KEY("metadata_id") REFERENCES "states_meta" ("metadata_id") ON UPDATE RESTRICT ON DELETE RESTRICT
);
CREATE INDEX "ix_states_attributes_id" ON "states" ("attributes_id");
CREATE INDEX "ix_states_context_id_bin" ON "states" ("context_id_bin");
CREATE INDEX "ix_states_last_updated_ts" ON "states" ("last_updated_ts");
CREATE INDEX "ix_states_metadata_id_last_updated_ts" ON "states" ("metadata_id", "last_updated_ts");
CREATE INDEX "ix_states_old_state_id" ON "states" ("old_state_id");
sqlite> 

If you only see the 3 FOREIGN KEYs, the problem one has been removed and the rebuild is complete.

If its still there, something else is wrong, and you'll need to enable debug logging for sqlalchemy and restart to get an idea what is going on.

kchiem commented 4 months ago

I only see 2 FOREIGN KEYs:

# sqlite3 home-assistant_v2.db
SQLite version 3.44.2 2023-11-24 11:41:44
Enter ".help" for usage hints.
sqlite> .schema states
CREATE TABLE states (
        state_id INTEGER NOT NULL,
        domain VARCHAR(64),
        entity_id VARCHAR(255),
        state VARCHAR(255),
        attributes TEXT,
        event_id INTEGER,
        last_changed DATETIME,
        last_updated DATETIME,
        created DATETIME,
        old_state_id INTEGER, attributes_id INTEGER, origin_idx INTEGER, context_id VARCHAR(36), context_user_id VARCHAR(36), context_parent_id VARCHAR(36), last_updated_ts FLOAT, last_changed_ts FLOAT, context_id_bin BLOB, context_user_id_bin BLOB, context_parent_id_bin BLOB, metadata_id INTEGER, last_reported_ts FLOAT,
        PRIMARY KEY (state_id),
        FOREIGN KEY(event_id) REFERENCES events (event_id) ON DELETE CASCADE,
        FOREIGN KEY(old_state_id) REFERENCES states (state_id)
);
CREATE INDEX ix_states_old_state_id ON states (old_state_id);
CREATE INDEX ix_states_attributes_id ON states (attributes_id);
CREATE INDEX ix_states_last_updated_ts ON states (last_updated_ts);
CREATE INDEX ix_states_context_id_bin ON states (context_id_bin);
CREATE INDEX ix_states_metadata_id_last_updated_ts ON states (metadata_id, last_updated_ts);
bdraco commented 4 months ago

I only see 2 FOREIGN KEYs:

It looks like you don't have this issue. There should be a FK on metadata_id but it shouldn't cause any issues if its missing.

kchiem commented 4 months ago

It looks like you don't have this issue. There should be a FK on metadata_id but it shouldn't cause any issues if its missing.

Then why does the recorder purge cause my graphs to flatline?

bdraco commented 4 months ago

It looks like you don't have this issue. There should be a FK on metadata_id but it shouldn't cause any issues if its missing.

Then why does the recorder purge cause my graphs to flatline?

While the symptoms are the same, since the problematic foreign key is not there, its a different problem which will require a different solution. Please start a fresh issue report with debug logs for sqlalchemy turned on.

# Example configuration.yaml entry
logger:
  default: info
  logs:
    sqlalchemy: debug
steerage250 commented 4 months ago

2024.7.2 has certainly solved the "recorder locking-up" problem, but something is still not right in my system.

The home-assistant_v2.db file grew from 3.2GB to 3.8GB after the upgrade, but has been static in size over the last 5 days. Despite having purge_keep_days: 10, I can select data that is 20 days old.

I have the following in configuration.yaml: recorder: purge_keep_days: 10

auto_purge: false

jasonwfrancis1 commented 3 months ago

Hey, I don't really know what to do, I'm having the same problem as above. recorder stops every day at4:30 and i have to reboot. I updated to 2024.7.2 and its the same. When i look in SQLite the temp table for events existed so it seems the migration did not complete. When i save the DB and try to do the migration externally it fails copying the data from the old table to the new because of "unique constraints". I really do not want to lose all of my energy data. I'm super annoyed that it was all working fine until a recent update and now I'm spending hours trying to fix it. Any ideas of where i might go for help? I tied the custom component but it didn't do anything (i assume it thinks the migration was completed).

7w1z4rd commented 3 months ago

I also have the problem, just like jasonwfrancis1

disconn3ct commented 3 months ago

As an added datapoint, I have been having this issue on mariadb. Upgrading didn't fix it.

jasonwfrancis1 commented 3 months ago

i managed to fix it myself, i basically migrated to maria DB and used a good tutorial to migrate all of the data. a few gaps while i was doing the whole process but i think im back up and running properly. just kicked off a purge, and its running...fingers crossed.

matomatusov commented 3 months ago

Hi @bdraco

After some time, my database keeps increasing. Don't know how to reduce it? I don't know if your corrections work correctly.

1

Sebazzz commented 3 months ago

Hey, I don't really know what to do, I'm having the same problem as above. recorder stops every day at4:30 and i have to reboot. I updated to 2024.7.2 and its the same. When i look in SQLite the temp table for events existed so it seems the migration did not complete. When i save the DB and try to do the migration externally it fails copying the data from the old table to the new because of "unique constraints". I really do not want to lose all of my energy data. I'm super annoyed that it was all working fine until a recent update and now I'm spending hours trying to fix it. Any ideas of where i might go for help? I tied the custom component but it didn't do anything (i assume it thinks the migration was completed).

Also having this problem, even with 2024.7.2:

# sqlite3 home-assistant_v2.db
SQLite version 3.44.2 2023-11-24 11:41:44
Enter ".help" for usage hints.
sqlite> .schema event_data
CREATE TABLE event_data (
        data_id INTEGER NOT NULL,
        hash BIGINT,
        shared_data TEXT,
        PRIMARY KEY (data_id)
);
CREATE INDEX ix_event_data_hash ON event_data (hash);
sqlite> .schema events
CREATE TABLE events (
        event_id INTEGER NOT NULL,
        event_type VARCHAR(64),
        event_data TEXT,
        origin VARCHAR(32),
        time_fired DATETIME,
        created DATETIME,
        context_id VARCHAR(36),
        context_user_id VARCHAR(36),
        context_parent_id VARCHAR(36), data_id INTEGER, origin_idx INTEGER, time_fired_ts FLOAT, context_id_bin BLOB, context_user_id_bin BLOB, context_parent_id_bin BLOB, event_type_id INTEGER,
        PRIMARY KEY (event_id)
);
CREATE INDEX ix_events_data_id ON events (data_id);
CREATE INDEX ix_events_time_fired_ts ON events (time_fired_ts);
CREATE INDEX ix_events_context_id_bin ON events (context_id_bin);
CREATE INDEX ix_events_event_type_id_time_fired_ts ON events (event_type_id, time_fired_ts);
sqlite> .tables
event_data              schema_changes          states_temp_1720627841
event_types             state_attributes        statistics
events                  states                  statistics_meta
migration_changes       states_meta             statistics_runs
recorder_runs           states_temp_1720625711  statistics_short_term

It happens some time after the purge of 4 AM. However, events are no longer recorded after 4 AM.

2024-07-20 08:15:58.622 ERROR (MainThread) [homeassistant.components.recorder.core] The recorder backlog queue reached the maximum size of 255442 events; usually, the system is CPU bound, I/O bound, or the database is corrupt due to a disk problem; The recorder will stop recording events to avoid running out of memory

Until 04:12:xx log messages like this:

2024-07-20 04:10:10.268 INFO (Recorder) [sqlalchemy.engine.Engine] [insertmanyvalues 9/11 (ordered; batch not supported)] (None, '90.9', None, None, None, None, None, None, 1721441409.5084128, 1225118537, 12273833, None, None, None, 0, <memory at 0x7f942b3d71c0>, None, None, 185170)
2024-07-20 04:10:10.269 INFO (Recorder) [sqlalchemy.engine.Engine] INSERT INTO states (entity_id, state, attributes, event_id, last_changed, last_changed_ts, last_reported_ts, last_updated, last_updated_ts, old_state_id, attributes_id, context_id, context_user_id, context_parent_id, origin_idx, context_id_bin, context_user_id_bin, context_parent_id_bin, metadata_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) RETURNING state_id
2024-07-20 04:10:10.269 INFO (Recorder) [sqlalchemy.engine.Engine] [insertmanyvalues 10/11 (ordered; batch not supported)] (None, '5.309', None, None, None, None, None, None, 1721441410.0638864, 1225118451, 338851, None, None, None, 0, <memory at 0x7f942b3d5600>, None, None, 4)
2024-07-20 04:10:10.269 INFO (Recorder) [sqlalchemy.engine.Engine] INSERT INTO states (entity_id, state, attributes, event_id, last_changed, last_changed_ts, last_reported_ts, last_updated, last_updated_ts, old_state_id, attributes_id, context_id, context_user_id, context_parent_id, origin_idx, context_id_bin, context_user_id_bin, context_parent_id_bin, metadata_id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) RETURNING state_id
2024-07-20 04:10:10.270 INFO (Recorder) [sqlalchemy.engine.Engine] [insertmanyvalues 11/11 (ordered; batch not supported)] (None, '5.309', None, None, None, None, None, None, 1721441410.0642517, 1225118452, 338852, None, None, None, 0, <memory at 0x7f942b3d7640>, None, None, 39)
2024-07-20 04:10:10.271 INFO (Recorder) [sqlalchemy.engine.Engine] COMMIT

Then:

2024-07-20 04:12:00.160 INFO (Recorder) [sqlalchemy.engine.Engine] BEGIN (implicit)
2024-07-20 04:12:00.162 INFO (Recorder) [sqlalchemy.engine.Engine] SELECT states.state_id, states.attributes_id 
FROM states 
WHERE states.last_updated_ts < ?
 LIMIT ? OFFSET ?
2024-07-20 04:12:00.162 INFO (Recorder) [sqlalchemy.engine.Engine] [generated in 0.00029s] (1718849520.132695, 4000, 0)
2024-07-20 04:12:00.285 INFO (Recorder) [sqlalchemy.engine.Engine] UPDATE states SET old_state_id=? WHERE states.old_state_id IN ([many])
2024-07-20 04:12:00.285 INFO (Recorder) [sqlalchemy.engine.Engine] [generated in 0.01305s] (None, 1139875840, 1139875841, 1139875842, 1139875843, 1139875844, 1139875845, 1139875846, 1139875847, 1139875848, 1139875849, 1139875850, 1139875851, 1139875852, 1139875853, 1139875854, 1139875855, 1139875856, 1139875857, 1139875858, 1139875859, 1139875860, 1139875861, 1139875862, 1139875863, 1139875864, 1139875865, 1139875866, 1139875867, 1139875868, 1139875869, 1139875870, 1139875871, 1139875872, 1139875873, 1139875874, 1139875875, 1139875876, 1139875877, 1139875878, 1139875879, 1139875880, 1139875881, 1139875882, 1139875883, 1139875884, 1139875885, 1139875886, 1139875887, 1139875888 ... 3901 parameters truncated ... 1139875790, 1139875791, 1139875792, 1139875793, 1139875794, 1139875795, 1139875796, 1139875797, 1139875798, 1139875799, 1139875800, 1139875801, 1139875802, 1139875803, 1139875804, 1139875805, 1139875806, 1139875807, 1139875808, 1139875809, 1139875810, 1139875811, 1139875812, 1139875813, 1139875814, 1139875815, 1139875816, 1139875817, 1139875818, 1139875819, 1139875820, 1139875821, 1139875822, 1139875823, 1139875824, 1139875825, 1139875826, 1139875827, 1139875828, 1139875829, 1139875830, 1139875831, 1139875832, 1139875833, 1139875834, 1139875835, 1139875836, 1139875837, 1139875838, 1139875839)
2024-07-20 04:12:00.402 INFO (Recorder) [sqlalchemy.engine.Engine] DELETE FROM states WHERE states.state_id IN ([many])
2024-07-20 04:12:00.402 INFO (Recorder) [sqlalchemy.engine.Engine] [generated in 0.00377s] (1139875840, 1139875841, 1139875842, 1139875843, 1139875844, 1139875845, 1139875846, 1139875847, 1139875848, 1139875849, 1139875850, 1139875851, 1139875852, 1139875853, 1139875854, 1139875855, 1139875856, 1139875857, 1139875858, 1139875859, 1139875860, 1139875861, 1139875862, 1139875863, 1139875864, 1139875865, 1139875866, 1139875867, 1139875868, 1139875869, 1139875870, 1139875871, 1139875872, 1139875873, 1139875874, 1139875875, 1139875876, 1139875877, 1139875878, 1139875879, 1139875880, 1139875881, 1139875882, 1139875883, 1139875884, 1139875885, 1139875886, 1139875887, 1139875888, 1139875889 ... 3900 parameters truncated ... 1139875790, 1139875791, 1139875792, 1139875793, 1139875794, 1139875795, 1139875796, 1139875797, 1139875798, 1139875799, 1139875800, 1139875801, 1139875802, 1139875803, 1139875804, 1139875805, 1139875806, 1139875807, 1139875808, 1139875809, 1139875810, 1139875811, 1139875812, 1139875813, 1139875814, 1139875815, 1139875816, 1139875817, 1139875818, 1139875819, 1139875820, 1139875821, 1139875822, 1139875823, 1139875824, 1139875825, 1139875826, 1139875827, 1139875828, 1139875829, 1139875830, 1139875831, 1139875832, 1139875833, 1139875834, 1139875835, 1139875836, 1139875837, 1139875838, 1139875839)
2024-07-20 04:12:00.433 INFO (Recorder) [sqlalchemy.engine.Engine] SELECT states.state_id, states.attributes_id 
FROM states 
WHERE states.last_updated_ts < ?

And then nothing from the recorder anymore.

PatrikNorrgard commented 3 months ago

I am also still having this issue with 2024.7.4, recorder hangs after 4 in the morning and a reboot gets it going again. I do however have a longer keep value since some months back: recorder: purge_keep_days: 30 commit_interval: 0

It was still working fine until I updated to 2024.7

alexwhittemore commented 3 months ago

Same over here, I’m on 2024.7.4 but recorder reliably does at 4am every day until a restart. What debug info can I provide to the effort?

Sinandgrin commented 3 months ago

Still having this issue here as well, running 2024.8. I do have the recorder set to keep for 1 year, since I wanted the history but as mentioned above (https://github.com/home-assistant/core/issues/117263#issuecomment-2264623334) this was working fine up until last month (07/19 for me but I can't recall when or what version I may have updated to at that time).

My recorder settings in configuration.yaml recorder: purge_keep_days: 365

alexwhittemore commented 3 months ago

I have the same configuration, with 365 days of retention, but after upgrading to 2024.8 yesterday it seems my recorder didn't die last night at 4am!

mobot95 commented 3 months ago

Same issue with 2024.8

tomas-dorda commented 3 months ago

I have default configuration and still have the issue 2024.8

hapklaar commented 3 months ago

After upgrading to 2024.8 and re-enabling auto_purge, today I see my graphs have stopped updating at 4 am again. SDD is at 100% and crunching away. After restarting HA, graphs are back, but with no data from 4am till restart.

My recorder settings:

recorder:
  purge_keep_days: 90
  auto_purge: true
  exclude:
    domains: 
      - updater
      - media_player
      - device_tracker
      - weather
      - sun

@bdraco tagged as you provided the fix and closed the issue. Could it be the issue was not fixed in certain cases and can it be reopened?

EDIT: added some info

disconn3ct commented 3 months ago

SDD is at 100% and crunching away

Run a manual purge with your retention time (1 year) and wait for it to stop crunching before restarting. Mine has made it 3 days since doing that, still running 2024.07

Sebazzz commented 3 months ago

I wish there was a way to do a manual migration. That would be fine for me.