home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
73.43k stars 30.67k forks source link

get_last_state_changes recorder history query performs poorly with a large number of states in the database #90113

Closed julianpas closed 1 year ago

julianpas commented 1 year ago

The problem

Hi,

HA version: 2023.3.4

I have observed that the states lookup for enttities that I added to the ecluded list for the recorder are very slow on MySQL 8.0.29 running in a docker container. I traced the issue down to the issue to queries like this one:

SELECT states.entity_id, states.state, states.last_changed_ts, states.last_updated_ts, NULL AS attributes, NULL AS shared_attrs FROM states WHERE states.entity_id = 'sensor.some_sensor' AND states.last_updated_ts < 1679395218.224e0 ORDER BY states.last_updated_ts DESC LIMIT 1

which don't use the index ix_states_entity_id_last_updated_ts due to fact that the index is created with last_update_ts being in default ASC order and the query having an ORDER BY clause specifying DESC order. After changing the index to also be in DESC order the query goes down from ~4m to 0.13ms (states table has 42mil rows for me as I like to keep longer history).

I tried to submit a pull request to fix that but I can't find where this index is defined in the source code.

What version of Home Assistant Core has the issue?

2023.3.4

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Supervised

Integration causing the issue

No response

Link to integration documentation on our website

No response

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

julianpas commented 1 year ago

To add to that the problem seems to have been made more visible by the fact that the old_state_id FK was preventing purging of the DB so I was actually left with way more than the 90 days I wanted. Either way I think the optimization above is a valid one.

home-assistant[bot] commented 1 year ago

Hey there @home-assistant/core, mind taking a look at this issue as it has been labeled with an integration (recorder) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `recorder` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign recorder` Removes the current integration label and assignees on the issue, add the integration domain after the command.

(message by CodeOwnersMention)


recorder documentation recorder source (message by IssueLinks)

bdraco commented 1 year ago

This is probably the change you want, but most of the queries are the other direction so I'm not sure we should do it

diff --git a/homeassistant/components/recorder/db_schema.py b/homeassistant/components/recorder/db_schema.py
index cf4c0543c1..021e2d7504 100644
--- a/homeassistant/components/recorder/db_schema.py
+++ b/homeassistant/components/recorder/db_schema.py
@@ -370,18 +370,6 @@ class EventTypes(Base):
 class States(Base):
     """State change history."""

-    __table_args__ = (
-        # Used for fetching the state of entities at a specific time
-        # (get_states in history.py)
-        Index(METADATA_ID_LAST_UPDATED_INDEX_TS, "metadata_id", "last_updated_ts"),
-        Index(
-            STATES_CONTEXT_ID_BIN_INDEX,
-            "context_id_bin",
-            mysql_length=CONTEXT_ID_BIN_MAX_LENGTH,
-            mariadb_length=CONTEXT_ID_BIN_MAX_LENGTH,
-        ),
-        _DEFAULT_TABLE_ARGS,
-    )
     __tablename__ = TABLE_STATES
     state_id: Mapped[int] = mapped_column(Integer, Identity(), primary_key=True)
     entity_id: Mapped[str | None] = mapped_column(
@@ -436,6 +424,19 @@ class States(Base):
     )
     states_meta_rel: Mapped[StatesMeta | None] = relationship("StatesMeta")

+    __table_args__ = (
+        # Used for fetching the state of entities at a specific time
+        # (get_states in history.py)
+        Index(METADATA_ID_LAST_UPDATED_INDEX_TS, "metadata_id", last_updated_ts.desc()),
+        Index(
+            STATES_CONTEXT_ID_BIN_INDEX,
+            "context_id_bin",
+            mysql_length=CONTEXT_ID_BIN_MAX_LENGTH,
+            mariadb_length=CONTEXT_ID_BIN_MAX_LENGTH,
+        ),
+        _DEFAULT_TABLE_ARGS,
+    )
+
     def __repr__(self) -> str:
         """Return string representation of instance for debugging."""
         return (
bdraco commented 1 year ago

Still testing this

It makes the max queries faster and the ranged query slower because it has to resort

bdraco commented 1 year ago

reverse sort order

QUERY PLAN
|--SEARCH states USING INDEX ix_states_metadata_id_last_updated_ts (metadata_id=? AND last_updated_ts>? AND last_updated_ts<?)
|--LIST SUBQUERY 2
|  |--CO-ROUTINE SUBQUERY 1
|  |  `--SCAN CONSTANT ROW
|  `--SCAN SUBQUERY 1
`--USE TEMP B-TREE FOR RIGHT PART OF ORDER BY
sqlite> explain query plan SELECT states.metadata_id, states.state, NULL AS last_changed_ts, states.last_updated_ts 
FROM states 
WHERE (states.metadata_id IN (SELECT 1 FROM (SELECT 1) WHERE 1!=1) OR states.last_changed_ts = states.last_updated_ts OR states.last_changed_ts IS NULL) AND states.metadata_id IN (1787, 1789, 1790, 1788) AND states.last_updated_ts > 1679617621.539766 AND states.last_updated_ts < 1679617668.940382 ORDER BY states.metadata_id, states.last_updated_ts;
QUERY PLAN
|--SEARCH states USING INDEX ix_states_metadata_id_last_updated_ts (metadata_id=? AND last_updated_ts>? AND last_updated_ts<?)
|--LIST SUBQUERY 2
|  |--CO-ROUTINE SUBQUERY 1
|  |  `--SCAN CONSTANT ROW
|  `--SCAN SUBQUERY 1
`--USE TEMP B-TREE FOR RIGHT PART OF ORDER BY

existing order

sqlite> drop index ix_states_metadata_id_last_updated_ts;           
sqlite> create index ix_states_metadata_id_last_updated_ts on states (metadata_id, last_updated_ts);           
sqlite> explain query plan SELECT states.metadata_id, states.state, NULL AS last_changed_ts, states.last_updated_ts 
FROM states 
WHERE (states.metadata_id IN (SELECT 1 FROM (SELECT 1) WHERE 1!=1) OR states.last_changed_ts = states.last_updated_ts OR states.last_changed_ts IS NULL) AND states.metadata_id IN (1787, 1789, 1790, 1788) AND states.last_updated_ts > 1679617621.539766 AND states.last_updated_ts < 1679617668.940382 ORDER BY states.metadata_id, states.last_updated_ts;
QUERY PLAN
|--SEARCH states USING INDEX ix_states_metadata_id_last_updated_ts (metadata_id=? AND last_updated_ts>? AND last_updated_ts<?)
`--LIST SUBQUERY 2
   |--CO-ROUTINE SUBQUERY 1
   |  `--SCAN CONSTANT ROW
   `--SCAN SUBQUERY 1
sqlite> explain query plan SELECT states.metadata_id, states.state, NULL AS last_changed_ts, states.last_updated_ts 
FROM states 
WHERE (states.metadata_id IN (SELECT 1 FROM (SELECT 1) WHERE 1!=1) OR states.last_changed_ts = states.last_updated_ts OR states.last_changed_ts IS NULL) AND states.metadata_id IN (1787, 1789, 1790, 1788) AND states.last_updated_ts > 1677664800.0 AND states.last_updated_ts < 1679617668.940382 ORDER BY states.metadata_id, states.last_updated_ts;
QUERY PLAN
|--SEARCH states USING INDEX ix_states_metadata_id_last_updated_ts (metadata_id=? AND last_updated_ts>? AND last_updated_ts<?)
`--LIST SUBQUERY 2
   |--CO-ROUTINE SUBQUERY 1
   |  `--SCAN CONSTANT ROW
   `--SCAN SUBQUERY 1
sqlite> 
bdraco commented 1 year ago

With a long history the distance to the last updated for a specific entity could be quite long without a descending index

But if we switch it its going to make all the forward queries slower

bdraco commented 1 year ago

We really can't have it either way. Both need to be fast in both directions so I think we are stuck with another index

bdraco commented 1 year ago

@julianpas AFAICT this query is only used by the filter integration

homeassistant/components/filter/sensor.py: history.get_last_state_changes,

I'm assuming you are using that?

home-assistant[bot] commented 1 year ago

Hey there @dgomes, mind taking a look at this issue as it has been labeled with an integration (filter) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `filter` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign filter` Removes the current integration label and assignees on the issue, add the integration domain after the command.

(message by CodeOwnersMention)


filter documentation filter source (message by IssueLinks)

bdraco commented 1 year ago

https://github.com/home-assistant/core/pull/90211 should improve the query for the single state case.

Currently the only integration using this query is filter. I'm inclined to deprecate calling this api with a request for more than one in a future PR (which happens to be the default for the filter integration) since we cannot deliver it performantly without https://github.com/home-assistant/core/pull/90209 which would force everyone to have to carry another large index which isn't something we should do for a single integration.

This would be some bad news for some filter configurations since they could no longer be restored at startup unless there was some refactoring to the filter integration:

Ideally filter would switch to using something like https://github.com/home-assistant/core/blob/f1ec77b8e07eb9c8fd140a32327f6085040ad355/homeassistant/components/utility_meter/sensor.py#L616 and storing the previous X states via extra_restore_state_data so when it restarted it wouldn't have to ask the database unless the window_size is 1 (or limit the restore query size to 1).

bdraco commented 1 year ago

If one of the filter code owners wants help with the above, I'm happy to participate.

Since I don't use the filter integration myself I'm not a good candidate to make the actual change or handle any followup as a result of the change as I would not be able to adequately qa a change from a user perspective

julianpas commented 1 year ago

@julianpas AFAICT this query is only used by the filter integration

homeassistant/components/filter/sensor.py: history.get_last_state_changes,

I'm assuming you are using that?

I think in my case it was the sensor card configured to show 48h of recent graph that caused this. Eventually it was the perfect storm of situations that made me aware of the issue:

  1. The EdgeOS integration for my ER-4 which had a bug and was writing multiple updates per second and accumulated a few milion records before I filtered it out from recorder.
  2. The desire to use the sensor card to see the recent history of those states e.g. bandwidth per port, cpu/mem usage etc.
  3. The fact that I also suffered from the issue where the foreign key contraint on old_state_id <-> state_id prevented the proper purging of the DB or else those old states should have been gone since it is more than 90 days that I disabled the history logging for those .

Now I used the advice from another bug to change the old_state_id foreign key contraint to set to NULL instead of prevent the delete op. My DB has been purged and compacted.

I tend to agree that maybe instead of changing the index if it is used elsewhere as is or create both which is expensive (my indeces already take more space than the data on that table) it is worth fixing the places where the DESC query is used and to replace with post retrieval order reversal if possible.

bdraco commented 1 year ago

The fact that I also suffered from the issue where the foreign key contraint on old_state_id <-> state_id prevented the proper purging of the DB or else those old states should have been gone since it is more than 90 days that I disabled the history logging for those .

Is there an issue open on this or was it bad manual schema ?

julianpas commented 1 year ago

The fact that I also suffered from the issue where the foreign key contraint on old_state_id <-> state_id prevented the proper purging of the DB or else those old states should have been gone since it is more than 90 days that I disabled the history logging for those .

Is there an issue open on this or was it bad manual schema ?

Yes, I found that others had problems with this too when I was searching for the foreign index name and stumbled upon this - https://github.com/home-assistant/core/issues/42402#issuecomment-729327643

I haven't messed up with the db schema until yesterday when I finally decided to tackle the problem that when I visited the panel with the aforementioned slow graphs my HA would "hang" showing history for any other graph for up to a minute because all it's recorder threads were busy waiting for the response on the slow queries. The db has been created about 1 year ago.

bdraco commented 1 year ago

Can you post the schema and the log messages you were seeing when trying to purge?

There is logic to disconnect the rows from the old state id before purging added from the the original issue you referenced. That's from two years ago though so I'm surprised to hear it's still an issue today.

bdraco commented 1 year ago

Also if you a backup from before you altered the schema, it would be great to get a copy as I can load it in my lab and try to purge. I can than adjust the migration and purge code to handle that case.

My Dropbox is nick@koston.org and drive is bdraco@gmail.com

julianpas commented 1 year ago

I am sorry I don't have a backup prior to altering the schema or doing a successful purge. I am attaching the current schema (renamed from sql to txt to make github happy) of the DM to this message and I put two comments on the two lines that were altered by me from the original schema.

Given the sensitivity of the data in the db and it's sheer volume 15GB data+index I would be hesitant to share it but if you wish me to run certain queries to check the integrity of the still present data I would be more than happy to do that.

hassio.txt

I checked the logs but there isn't any mentioning there of purging attempts since 3.15. which is the oldest date in the logs. Around the time when I was makng changes to the schema I have a lot of those errors though. And I saw some of those "too many locks" errors cropping up for me too when I was trying to clear myself some old records from the states table.

2023-03-23 14:22:38.301 ERROR (Recorder) [homeassistant.components.recorder.core] Error in database connectivity during commit: Error executing query: (MySQLdb.OperationalError) (1206, 'The total number of locks exceeds the lock table size') [SQL: INSERT INTO states (entity_id, state, attributes, event_id, last_changed, last_changed_ts, last_updated, last_updated_ts, old_state_id, attributes_id, context_id, context_user_id, context_parent_id, origin_idx) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)] [parameters: ('sensor.pws_wind_gust', '0.0', None, None, None, None, None, 1679577752.305077, 134360597, 417632, 'redacted', None, None, 0)] (Background on this error at: https://sqlalche.me/e/20/e3q8). (retrying in 3 seconds) 2023-03-23 14:22:41.334 ERROR (Recorder) [homeassistant.components.recorder.core] SQLAlchemyError error processing task CommitTask(): This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (MySQLdb.OperationalError) (1206, 'The total number of locks exceeds the lock table size') [SQL: INSERT INTO states (entity_id, state, attributes, event_id, last_changed, last_changed_ts, last_updated, last_updated_ts, old_state_id, attributes_id, context_id, context_user_id, context_parent_id, origin_idx) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)] [parameters: ('sensor.pws_wind_gust', '0.0', None, None, None, None, None, 1679577752.305077, 134360597, 417632, 'redacted', None, None, 0)] (Background on this error at: https://sqlalche.me/e/20/e3q8) (Background on this error at: https://sqlalche.me/e/20/7s2a) Traceback (most recent call last): File "/usr/src/homeassistant/homeassistant/components/recorder/core.py", line 786, in _process_one_task_or_recover return task.run(self) File "/usr/src/homeassistant/homeassistant/components/recorder/tasks.py", line 271, in run instance._commit_event_session_or_retry() File "/usr/src/homeassistant/homeassistant/components/recorder/core.py", line 1057, in _commit_event_session_or_retry self._commit_event_session() File "/usr/src/homeassistant/homeassistant/components/recorder/core.py", line 1076, in _commit_event_session self.event_session.commit() File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1903, in commit trans.commit(_to_root=True) File "", line 2, in commit File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/state_changes.py", line 103, in _go self._raise_for_prerequisite_state(fn.name, current_state) File "/usr/local/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 901, in _raise_for_prerequisite_state raise sa_exc.PendingRollbackError( sqlalchemy.exc.PendingRollbackError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (MySQLdb.OperationalError) (1206, 'The total number of locks exceeds the lock table size')

I apologize for not keeping enough logs of the problem before forcing some fix. I totally understand how this is not helpful when trying to fix a problem being often on the other side of this communication myself :) In the future I will try ot gather intel before attempting a fix if something like this shows up.

bdraco commented 1 year ago

2023-03-23 14:22:38.301 ERROR (Recorder) [homeassistant.components.recorder.core] Error in database connectivity during commit: Error executing query: (MySQLdb.OperationalError) (1206, 'The total number of locks exceeds the lock table size')

That generally means its been searching too many rows. If it happens again you can do SHOW ENGINE INNODB STATUS to see which query is causing the issue

I apologize for not keeping enough logs of the problem before forcing some fix. I totally understand how this is not helpful when trying to fix a problem being often on the other side of this communication myself :) In the future I will try ot gather intel before attempting a fix if something like this shows up.

No worries. As we have fixed the LIMIT 1 case I'm going to close this issue out as the data we need to do any more improvements is no longer available.

If you do find another query that has an issue after 2023.4.x please don't hesitate to start a new issue.