home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
71.74k stars 30k forks source link

Purge causes recorder to stop writing to the DB until HA is restarted (Auto purge happens at 4:12am) #117263

Closed HertogArjan closed 2 months ago

HertogArjan commented 4 months ago

This problem is solved in 2024.7.2. If the system ran out of disk space, and the table rebuild failed, it will try again in 2024.8.1+ see issue https://github.com/home-assistant/core/issues/123348 and solution https://github.com/home-assistant/core/pull/123388

Workaround: Disabling nightly auto purge will prevent the issue from occurring (this is not a long term solution)

# Example configuration.yaml entry
recorder:
  auto_purge: false

Be sure to re-enable auto-purge after installing 2024.7.2 or your database will grow without bounds, and your system will eventually run out of disk space or become sluggish.

Cause: https://github.com/home-assistant/core/issues/117263#issuecomment-2197311144 Solution: https://github.com/home-assistant/core/pull/120779

The problem

Every night at around ~4:10 the histories for all entities stop. This has been happening since at least April 9th. I updated Home Assistant to 2024.4.1 on April 5th, but I can't say for sure if this issue started directly afterwards. A restart of Home Assistant allows recording again but does not restore the history missed since ~4:10. I suspect it has something to do with the Recorder auto purge at 4:12 because the same symptoms happen when the purge is run manually.

I don't think the manual or automatic purge is currently able to finish because the (SQLite) database seems way too large (>6GB) for my configured purge_keep_days of 7.

If I run recorder.purge from the web UI the same symptoms happen like during the night. By looking at the mtime it is clear home-assistant_v2.db does not get written to anymore. htop shows HA using 100% of one CPU core continously and iotop show HA reading from disk at ~400MB/s continously. This went on for at least 25 minutes before I stopped the process.

The logs show nothing unusual happening around 4:12. When I run recorder.purge from the web UI with verbose logging enabled the logs just show:

2024-05-11 15:16:28.560 INFO (MainThread) [homeassistant.helpers.script.websocket_api_script] websocket_api script: Running websocket_api script
2024-05-11 15:16:28.560 INFO (MainThread) [homeassistant.helpers.script.websocket_api_script] websocket_api script: Executing step call service

When HA is stopped using SIGTERM the shutdown takes a long time and it is clear from the logs it is waiting for a Recorder task:

2024-05-11 15:20:00.573 WARNING (MainThread) [homeassistant.core] Shutdown stage 'final write': still running: <Task pending name='Task-2684' coro=<Recorder._async_shutdown() running at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py:475> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.12/asyncio/futures.py:387, <1 more>, Task.task_wakeup()] created at /usr/lib/python3.12/asyncio/base_events.py:449> cb=[set.remove()] created at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/util/async_.py:40>

See the rest of the relevant messages during shutdown below.

What version of Home Assistant Core has the issue?

core-2024.5.2

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Core

Integration causing the issue

Recorder

Link to integration documentation on our website

https://www.home-assistant.io/integrations/recorder/#service-purge

Diagnostics information

No response

Example YAML snippet

recorder:
  # keep 30 days of history for all states by default
  purge_keep_days: 7
  exclude:
    domains:
      - weather
    entities:
      - sun.sun
    entity_globs:
      - 'automation.abrp_live_data_*'
      - 'timer.abrp_live_data_*'
      - 'automation.pvoutput_*'
      - 'timer.pvoutput_*'
      - 'sensor.sampled_stroomsterkte_fase_l?'
      - 'sensor.stroomsterkte_fase_l?_*_sec_gem'

Anything in the logs that might be useful for us?

2024-05-11 15:20:00.573 WARNING (MainThread) [homeassistant.core] Timed out waiting for final writes to complete, the shutdown will continue
2024-05-11 15:20:00.573 WARNING (MainThread) [homeassistant.core] Shutdown stage 'final write': still running: <Task pending name='Task-2684' coro=<Recorder._async_shutdown() running at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py:475> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.12/asyncio/futures.py:387, <1 more>, Task.task_wakeup()] created at /usr/lib/python3.12/asyncio/base_events.py:449> cb=[set.remove()] created at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/util/async_.py:40>
2024-05-11 15:20:30.580 WARNING (MainThread) [homeassistant.core] Timed out waiting for close event to be processed, the shutdown will continue
2024-05-11 15:20:30.580 WARNING (MainThread) [homeassistant.core] Shutdown stage 'close': still running: <Task pending name='Task-2684' coro=<Recorder._async_shutdown() running at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py:475> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.12/asyncio/futures.py:387, <1 more>, Task.task_wakeup()] created at /usr/lib/python3.12/asyncio/base_events.py:449> cb=[set.remove()] created at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/util/async_.py:40>
2024-05-11 15:20:30.580 WARNING (MainThread) [homeassistant.core] Shutdown stage 'close': still running: <Task pending name='Task-2714' coro=<Recorder._async_close() running at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/components/recorder/core.py:467> wait_for=<Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/lib/python3.12/asyncio/futures.py:387, <1 more>, Task.task_wakeup()] created at /usr/lib/python3.12/asyncio/base_events.py:449> cb=[set.remove()] created at /srv/homeassistant/lib/python3.12/site-packages/homeassistant/util/async_.py:40>
2024-05-11 15:20:30.752 WARNING (Thread-4 (_do_shutdown)) [homeassistant.util.executor] Thread[SyncWorker_2] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:30.919 WARNING (Thread-4 (_do_shutdown)) [homeassistant.util.executor] Thread[SyncWorker_4] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:31.403 WARNING (Thread-4 (_do_shutdown)) [homeassistant.util.executor] Thread[SyncWorker_2] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:31.887 WARNING (Thread-4 (_do_shutdown)) [homeassistant.util.executor] Thread[SyncWorker_4] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:40.751 WARNING (MainThread) [homeassistant.util.executor] Thread[SyncWorker_2] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:40.918 WARNING (MainThread) [homeassistant.util.executor] Thread[SyncWorker_4] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:41.402 WARNING (MainThread) [homeassistant.util.executor] Thread[SyncWorker_2] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
2024-05-11 15:20:41.886 WARNING (MainThread) [homeassistant.util.executor] Thread[SyncWorker_4] is still running at shutdown: File "/usr/lib/python3.12/threading.py", line 1030, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 92, in _worker
    work_item.run()
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3.12/threading.py", line 1147, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.12/threading.py", line 1167, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):

Additional information

I thought maybe my database could be corrupted, so with HA shutdown I ran mv home-assistant_v2.db home-assistant_v2_old.db; sqlite3 home-assistant_v2_old.db ".recover" | sqlite3 home-assistant_v2.db and then tried to run a purge again. Unfortunately the problem was not resolved. My database did shrink by about 1.5 GB.

swainstm commented 2 months ago

I fit into the catagory who has none of the known problem integrations, but has the problem after the 2024.7.1 upgrade. Upgraded to 2024.7.2, and still not working.

So I rolled back to 2024.6.4 and the problem was fixed.

I will keep watching this space and not updated until there is some evidence the 2024.8 (or something else) fixes the issue. For me it was not an option to stay with 2024.7, as by then with lots of entities, the db size would have grown and filled the disk.

Pel1can111 commented 2 months ago

I fit into the catagory who has none of the known problem integrations, but has the problem after the 2024.7.1 upgrade. Upgraded to 2024.7.2, and still not working.

So I rolled back to 2024.6.4 and the problem was fixed.

I will keep watching this space and not updated until there is some evidence the 2024.8 (or something else) fixes the issue. For me it was not an option to stay with 2024.7, as by then with lots of entities, the db size would have grown and filled the disk.

yep, same here. will wait for a month before updating to the next version.

ScratMan commented 2 months ago

Facing the issue with 2024.7.1, with purge failing, database was about 4GB. I tried to update my docker container to 2024.8.0.dev202407040219, booted HA, everything went well, it started to cleanup the database, removing invalid entries. After a few hours, to make sure the DB maintenance is finished (DB size was 3.7GB), I reverted my container to 2024.7.1. I could then manually start the recorder.purge service with repacking option enabled, and the database reduced to 2GB. Will see if automatic purging at 4am is working tomorrow.

steerage250 commented 2 months ago

Adding “auto_purge: false” did not fix the issue for me, I think I’ll restore an old version from a backup

ChristophCaina commented 2 months ago

Adding “auto_purge: false” did not fix the issue for me, I think I’ll restore an old version from a backup

do you have one of the following integrations (custom_components)? https://community.home-assistant.io/t/psa-2024-7-recorder-problems/746428

ScratMan commented 2 months ago

Facing the issue with 2024.7.1, with purge failing, database was about 4GB. I tried to update my docker container to 2024.8.0.dev202407040219, booted HA, everything went well, it started to cleanup the database, removing invalid entries. After a few hours, to make sure the DB maintenance is finished (DB size was 3.7GB), I reverted my container to 2024.7.1. I could then manually start the recorder.purge service with repacking option enabled, and the database reduced to 2GB. Will see if automatic purging at 4am is working tomorrow.

I confirm auto purge with auto repack worked fine last night with 2024.7.1 after using the "upgrade and downgrade" method described.

steerage250 commented 2 months ago

ChristophCaina: yes, I have a custom component called Variables (which I presume is hass_variables). It is a very old version, and I’m not sure I need it anymore anyway. I’ll update it and see what happens.

simoneluconi commented 2 months ago

I'm having this problem after updating to 2024.7.1. As you can see the data logging stops from 4am until i reboot HA. Screenshot 2024-07-08 100125

This is not a problem that i had before the update, and i don't have any of the problematic integrations. I think that is unacceptable to wait for a month for an update that should fix the problem. If it not possibile to have the update sooner maybe a workaround should be published.

I have HA Supervised so i don't know how to do the "upgrade and downgrade" to the container. And unfortunately for me i've done some change to the configuration of HA (i added some new devices and created some new automations) so if restore an old backup i lose theese configuration changes. I tried mixing the folders from the different backups but i was unsuccessful.

I think that for now i'll stick with some autoreboot method, as i don't want to disable the autopurge to avoild filling the disk.

I'll wait for some news.

bdraco commented 2 months ago

Since there are potentially thousands (probably 10s of thousands) of installs affected by this, and the fix is coming in 2024.8.x, I reopened and pinned this so we don't keep getting duplicate issues in the mean time.

ohkaja commented 2 months ago

@simoneluconi the Workaround is to disable the automatic recorder purge.

recorder:
  auto_purge: false
wuppiwuppi commented 2 months ago

@simoneluconi the Workaround is to disable the automatic recorder purge.

recorder:
  auto_purge: false

Be aware, that this might bloat your database and a purge after re-enabling it might take VERY long. Or ppl even forget to re-enable purging which ends up even worse. It took me a long time to clean-up my database after it got too big....

simoneluconi commented 2 months ago

As i said, i didn't want to disable the autopurge to avoild filling the disk. For now this is my workaround automation (testing for tomorrow)

alias: Reboot home assistant
description: ""
trigger:
  - platform: time
    at: "04:30:00"
condition: []
action:
  - service: homeassistant.restart
    data: {}
mode: single
vogtmh commented 2 months ago

As i said, i didn't want to disable the autopurge to avoild filling the disk. For now this is my workaround automation (testing for tomorrow)

alias: Reboot home assistant
description: ""
trigger:
  - platform: time
    at: "04:30:00"
condition: []
action:
  - service: homeassistant.restart
    data: {}
mode: single

As far as I understood, the purge cannot complete anyway and locks the db. So a restart won't fix the problem, it will just unlock your database.

simoneluconi commented 2 months ago

As i said, i didn't want to disable the autopurge to avoild filling the disk. For now this is my workaround automation (testing for tomorrow)

alias: Reboot home assistant
description: ""
trigger:
  - platform: time
    at: "04:30:00"
condition: []
action:
  - service: homeassistant.restart
    data: {}
mode: single

As far as I understood, the purge cannot complete anyway and locks the db. So a restart won't fix the problem, it will just unlock your database.

Mhh i'll check, because after i rebooted HA the storage space left reported in Settings -> System -> Memory increased, so i thought it completed, or a least it completed some part of it.

zigomatichub commented 2 months ago

@simoneluconi Doing this automation or put auto_purge : false is same as your purge is not executed. better to update with exclude like this example:

recorder:
  auto_purge: false
  exclude:
    domains:
      - automation
      - update
      - media_player
      - binary_sensor
      - scene
      - input_boolean
      - input_button
      - button
      - input_number
      - number
      - input_select
      - select
      - text
      - camera
    event_types:
      - call_service

and you can use this card to see what you have in your system

type: markdown
content: |-
  >
            Domain | Count 
            {% for d in states | groupby('domain') %} {{ d[0].replace('_', ' ') | title }} | {{ states[d[0]] | count  }}
            {% endfor %}
RoboMagus commented 2 months ago

Since there are potentially thousands (probably 10s of thousands) of installs affected by this, and the fix is coming in 2024.8.x, I reopened and pinned this so we don't keep getting duplicate issues in the mean time.

Shouldn't there be some sort of hotfix in the next patch release? Leaving thousands of HA installs to have DB issues every single night until the next major release that's still about a month away does not sound great.

bdraco commented 2 months ago

Shouldn't there be some sort of hotfix in the next patch release? Leaving thousands of HA installs to have DB issues every single night until the next major release that's still about a month away does not sound great.

The change is risky as data migrations always take some risk, and this is the first time we have done a 12-step migration. It might be an acceptable risk, though, given the impact. However, if we got anything wrong and destroyed data because we rushed out a solution by skipping beta, that would be much worse. Please see the opening text in the linked PR https://github.com/home-assistant/core/pull/120779

zigomatichub commented 2 months ago

Since there are potentially thousands (probably 10s of thousands) of installs affected by this, and the fix is coming in 2024.8.x, I reopened and pinned this so we don't keep getting duplicate issues in the mean time.

Shouldn't there be some sort of hotfix in the next patch release? Leaving thousands of HA installs to have DB issues every single night until the next major release that's still about a month away does not sound great.

I think they need more time to validate the code as you touch to recreate tables and change schema, it's not every day that a migration need to happen. The recommendation would be to do a backup of the db first but even with the backup going back to previous schema may be difficult. I'm not DBA but I know how things can go wrong fast ;)

RoboMagus commented 2 months ago

I can agree on not pushing something risky to fully fix this issue, but as a hotfix could the change that causes this behavior be reverted instead? That should be a far less risky move

bdraco commented 2 months ago

I can agree on not pushing something risky to fully fix this issue, but as a hotfix could the change that causes this behavior be reverted instead? That should be a far less risky move

Its not possible to do a revert in this case because the problem isn't in the Home Assistant code, and downgrading the whole operation system version would be far more risky.

simoneluconi commented 2 months ago

Maybe consider adding, also for future release, a timeout after the purge stop and unlock the db, instead of keeping the db locked and stop history recording until manual action are taken. This could also be the hotfix for me, i think it's better to have the db increase in size than stop all writing to it until a reboot. I know you could manually disable the autopurge in the config, but people also need to find this solution first.

bdraco commented 2 months ago

Maybe consider adding, also for future release, a timeout after the purge stop and unlock the db, instead of keeping the db locked and stop history recording until manual action are taken. This could also be the hotfix for me, i think it's better to have the db increase in size than stop all writing to it until a reboot. I know you could manually disable the autopurge in the config, but people also need to find this solution first.

The locking happens inside the database engine itself. It's impossible to turn off that locking as it would allow the data in the database to enter an inconsistent/corrupt state.

RoboMagus commented 2 months ago

Maybe consider adding, also for future release, a timeout after the purge stop and unlock the db, instead of keeping the db locked and stop history recording until manual action are taken.

Adding a timeout could itself have some unintended consequences. E.g. on terribly slow systems with huge DBs it might be tolerated to have some operations take a long time overnight which would make the timeout threshold much too large that would cause loss of data that could be unacceptable for others.

Some logging when these operations are started would be appreciated though. Not debug logging mind you, but something that would show up by default. Anything that could lock the database should be mentioned when it's running to have better traceability for when issues like this one occur.

bdraco commented 2 months ago

Anything that could lock the database should be mentioned when it's running to have better traceability for when issues like this one occur.

Any write to the database can potentially lock it. SQLite does have a debug mode https://www.sqlite.org/compile.html#debug and special pragma statements for runtime tracing...but

turning them on makes SQLite run approximately three times slower

zigomatichub commented 2 months ago

@bdraco. By looking into https://github.com/home-assistant/core/pull/120779 , may be for advance user and testing the sql process, you have the sql script that we can try ?

Steps would be : take copy of existing DB to another env. apply the change (sql script). put the new file(s) on HA and restart

btw, I don't know if HA can start without DB.

ChristophCaina commented 2 months ago

btw, I don't know if HA can start without DB.

It would create a new DB file

bdraco commented 2 months ago

DO NOT RUN THESE STEPS UNLESS YOU HAVE TAKEN A BACKUP AND ARE COMFORTABLE TROUBLESHOOTING A SQLite DATABASE - IF THE MIGRATION FAILS, ALL DATA IN THE STATES TABLE COULD BE LOST - DO NOT SKIP MAKING A BACKUP

If you are adventurous, have already made a backup, and want to help out by testing (be aware this is not an official release or fix at this stage) the code from #120779 as a custom component. It can be run with the following steps:

Install the custom component

Run the following in a shell (please read the code first before running it blindly)

cd /config ; curl -o- -sSL https://gist.githubusercontent.com/bdraco/43f8043cb04b9838383fd71353e99b18/raw/core_integration_pr | bash /dev/stdin -d issue117263 -p 121544

Add the following to configuration.yaml

# example configuration.yaml entry
issue117263:

Restart

Wait for startup to complete. The migration will begin automatically if its needed.

Wait for the migration to run

The following lines should appears in the log. If you don't see these lines, make sure you added issue117263: to configuration.yaml before restarting.

The migration can take anywhere from a few seconds to a few minutes depending on the size of the database. Please make sure the rebuild is finished before restarting or you will have to do it again.

2024-07-08 10:26:50.947 WARNING (Recorder) [custom_components.issue117263] Rebuilding SQLite table states; This will take a while; Please be patient!
2024-07-08 10:27:04.486 WARNING (Recorder) [custom_components.issue117263] Rebuilding SQLite table states finished
2024-07-08 10:27:04.487 WARNING (Recorder) [homeassistant.components.recorder.migration] Dropping index `ix_states_event_id` from table `states`. Note: this can take several minutes on large databases and slow computers. Please be patient!

Remove the custom component

Once the below lines above apeared the log, the table has been rebuilt.

Once the table has been rebuilt without the foreign key, the problem will go away and there is no need to run the migration code again.

Notes: The migration code is copied from https://github.com/home-assistant/core/pull/120779 and can be viewed as https://github.com/home-assistant/core/pull/121544/files

andyblac commented 2 months ago

I agree with @bdraco for not rushing out a fix, and doing more testing makes far more common sense as the risk to a DB corruption and loosing years worth of historical data. Anyone with this issue can simply just stay on 2024.6.x for now, what's the rush to update?, IMO there is nothing in 2024.7.x that is massive feature that one needs to update for. It's quite simple we all need to just be patient for 2024.08

ScratMan commented 2 months ago

It would be better to remove the 2024.7.x releases from the download pages, to avoid more issues reporting, and wait for the validation of the fix before releasing an official fixed version.

@bdraco : the log lines are not clear about migration's end ; the last one : Dropping indexix_states_event_idfrom tablestates. Note: this can take several minutes on large databases and slow computers. Please be patient! should be followed by another one saying the dropping of indices finished successfully.

bdraco commented 2 months ago

@bdraco : the log lines are not clear about migration's end ; the last one : Dropping indexix_states_event_idfrom tablestates. Note: this can take several minutes on large databases and slow computers. Please be patient! should be followed by another one saying the dropping of indices finished successfully.

Those lines come from the recorder migration code and are not something that can be changed in the test code. The above migration code is for testing and its not expected to be user friendly.

ScratMan commented 2 months ago

@bdraco : the log lines are not clear about migration's end ; the last one : Dropping indexix_states_event_idfrom tablestates. Note: this can take several minutes on large databases and slow computers. Please be patient! should be followed by another one saying the dropping of indices finished successfully.

Those lines come from the recorder migration code and are not something that can be changed in the test code. The above migration code is for testing and its not expected to be user friendly.

There is no "recorder status sensor" available to know what the recorder is doing?

bdraco commented 2 months ago

There is no "recorder status sensor" available to know what the recorder is doing?

Please stay on topic. This issue is not a forum to discuss redesigning the recorder or its migration system, and there are a lot of people who are subscribed to it who would likely prefer that discussion did not happen here.

ChristophCaina commented 2 months ago

it should be mentioned in the documentation, that it needs to be required to have enough disk space left on the system. Else, the integration will throw errors - but still continue running.

Logger: homeassistant.components.recorder.util
Source: components/recorder/util.py:137
integration: Recorder (documentation, issues)
First occurred: 8:25:56 PM (1 occurrences)
Last logged: 8:25:56 PM

Error executing query
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 924, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: database or disk is full

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/recorder/util.py", line 137, in session_scope
    yield session
  File "/config/custom_components/issue117263/__init__.py", line 123, in rebuild_sqlite_table
    session.execute(text(f"DROP TABLE {orig_name}"))
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2351, in execute
    return self._execute_internal(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2245, in _execute_internal
    result = conn.execute(
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1418, in execute
    return meth(
           ^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/sql/elements.py", line 515, in _execute_on_connection
    return connection._execute_clauseelement(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1640, in _execute_clauseelement
    ret = self._execute_context(
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context
    return self._exec_single_context(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1986, in _exec_single_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 2353, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 924, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database or disk is full
[SQL: DROP TABLE states]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
This error originated from a custom integration.

Logger: custom_components.issue117263
Source: custom_components/issue117263/__init__.py:123
integration: Issue 117263
First occurred: 8:25:57 PM (1 occurrences)
Last logged: 8:25:57 PM

Error recreating SQLite table states
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 924, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: database or disk is full

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/config/custom_components/issue117263/__init__.py", line 123, in rebuild_sqlite_table
    session.execute(text(f"DROP TABLE {orig_name}"))
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2351, in execute
    return self._execute_internal(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2245, in _execute_internal
    result = conn.execute(
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1418, in execute
    return meth(
           ^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/sql/elements.py", line 515, in _execute_on_connection
    return connection._execute_clauseelement(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1640, in _execute_clauseelement
    ret = self._execute_context(
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context
    return self._exec_single_context(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1986, in _exec_single_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 2353, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 924, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database or disk is full
[SQL: DROP TABLE states]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
Logger: homeassistant.components.recorder.migration
Source: components/recorder/migration.py:366
integration: Recorder (documentation, issues)
First occurred: 8:25:57 PM (1 occurrences)
Last logged: 8:25:57 PM

Dropping index `ix_states_event_id` from table `states`. Note: this can take several minutes on large databases and slow computers. Please be patient!

Unfortunately, it is not clear in this case, if everything went successfull or if it might caused bigger issues. Maybe, the custom_component should check for available disc space (if possible), before it will execute the different tasks?

bdraco commented 2 months ago

Checking for disk space available would be a great future improvement but it’s not something the recorder migration currently implements so there is no way to currently do that.

That is much more complex than it sounds since

It’s been mentioned before that you should always have 2x the database size in free space available for table rebuilds, repacks, and purges. I’m not sure if it’s in the docs or not though. It would be a nice addition of someone is motivated to contribute that.

ChristophCaina commented 2 months ago

we should maybe highlight the issue with disk space before the 2024.8 update... as a repair(?) Even if this is documented 'somewhere' I could imagine, that a larger portion of users could be hit with too less disc space when the release will be installed.

Not sure, if we could use a 'general repair issue' [which can be ignored by the user] to highlight this somehow?

bdraco commented 2 months ago

The disk space issue isn’t a new problem. We have the same issue with any database migration, large purge, and the monthly repack as well. It’s definitely a problem that comes up but only tangentially related to this issue.

Home Assistant doesn’t monitor or report disk space issues automatically, and historically that’s been something the user has to configure themselves.

While I think it’s a good addition, it should get an architecture discussion and be planned out. Adding pre-configured built-in disk space monitoring is not something we should try to bolt on to this, especially considering the scope would far exceed any migration.

drothenberger commented 2 months ago

If you are adventurous, have already made a backup, and want to help out by testing (be aware this is not an official release or fix at this stage) the code from #120779 as a custom component. It can be run with the following steps:

This worked for me. It took about 30 minutes to do the migration on a 5.5 GB database.

neeu2 commented 2 months ago

Instructions worked for my db of 5.8gb, took around 15 min to complete (pi5 with NVME disk). Ran the manual purge after and this also completed succesfully.

edit - auto purge also worked overnight and energy dashboard etc all working correctly

ChristophCaina commented 2 months ago

yes, except the errors due to disk space, it seems that the custom_component has worked. I was running a recorder.purge - and could not see any wrong behave. My statistiscs are still there, and everything looks fine so far.

I have now removed my auto_purge: false from the configuration and will give it a try tonight.

Pel1can111 commented 2 months ago

Worked for me, took around 15min to complete. Able to manually run purge again. Will see what happens tonight. Thanks

EDIT: no issues overnight

davidsonimagerygmailcom commented 2 months ago

I'm having this issue too, none of the known complicit integrations - and unsure what custom_component may be causing it. For now will add ;

recorder: auto_purge: false purge_keep_days: 7

Presumably I can sit tight until 2024.8.x is released and then turn the recorder purge back on and it'll all self rectify? (disk space shouldn't be an issue for me, have oodles spare).

ChristophCaina commented 2 months ago

unfortunately, the issue has not been solved for me. Tonight, the recorder did stop again... while running purge manually yesterday evening, everything seems to have worked... I will assign more disc space to the machine now, and then I will give it a try again.

I assigned more disk space to the machine, re-added the custom-component and the configuration, but after another restart of the system, the job was not executed anymore.

ScratMan commented 2 months ago

The hotfix worked for me when using 2024.8.0.dev202407040219 and then reverting to 2024.7.1. The auto purge is working fine every night, and recorder correctly stores the states, had no loss of history.

Pave87 commented 2 months ago

I noticed this problem yesterday on my setup. I ran hotfix and so far everything seems OK. I have 8GB DB and this consumed ~16GB extra space. I was expecting it would take as much free space as DB size is. If this much space is required for this to run successfully it would be good to have some check and notify admin if needed. Ran purge to reclaim space after and back to original size.

alsFC commented 2 months ago

@bdraco manual worked fine for me! https://github.com/home-assistant/core/issues/117263#issuecomment-2214806510

Checked the states table's DDL after the migration process and the foreign key on the events table is gone! Purge went without any troubles last night!

barney34 commented 2 months ago

@bdraco worked great all my stuff is back and live events again. https://github.com/home-assistant/core/issues/117263#issuecomment-2214806510

zigomatichub commented 2 months ago

@bdraco worked great also. -+ 40min with 10GB DB on SSD

new states table result: PRIMARY KEY (state_id), FOREIGN KEY(old_state_id) REFERENCES states (state_id), FOREIGN KEY(attributes_id) REFERENCES state_attributes (attributes_id), FOREIGN KEY(metadata_id) REFERENCES states_meta (metadata_id)

Only statistics_meta and statistics_short_term have 'ON DELETE CASCADE' Schema version still on 43.

I have launched 2 purges service manually, one without repack and last with repack.

Result: OK DB resized to 6 GB

Oleg-Sob commented 2 months ago

It helped me. Database 9GB. The process took 1 hour. 15m. Thanks to all! After cleaning up to 10 days and compressing, the Database became 4.7GB.

andyblac commented 2 months ago

could not get it to work, saw nothing in logs, but DB still does not record data, I see the sensors update, but as soon as I restart or reboot, all that data is lost, also after a period of time the sensors stop history updating.

r3pek commented 2 months ago

@bdraco tested your addon. unfortunately don't have enough space on the device to make it rebuild the entire DB. Is there any way we can do this "offline"? I'm more than welcome to just shoot a bunch of sql statements into sqlite on another box and give it a try.