Optimize SQLite schema and queries

kellnerd commented 4 months ago

With lots of different URIs being stored, the underlying SQLite database is becoming significantly slower.

My knowledge about DB optimization is limited, but I feel like defining an index column for the uri.value column would be a good first step. Database schema and queries are defined here: https://github.com/kellnerd/snap_storage/blob/653a8c9e8123d6c25ae3f11832bf318266b19f88/src/snap_storage.ts#L65-L90

kellnerd commented 4 months ago

EXPLAIN QUERY PLAN
SELECT timestamp, content_hash FROM snap
JOIN uri ON snap.uri_id = uri.id
WHERE uri.value = 'https://api.deezer.com/track/2729047191' AND timestamp <= 1720349182
ORDER BY timestamp DESC;

id  parent  notused detail
5   0   0   SEARCH uri USING COVERING INDEX sqlite_autoindex_uri_1 (value=?)
9   0   0   SCAN snap
26  0   0   USE TEMP B-TREE FOR ORDER BY

If I interpret this query plan correctly, the uri.value column is unproblematic as it is already indexed (because of its unique constraint). Scanning all snaps for snap.uri_id is the expensive part which has to be optimized.

phw commented 4 months ago

The single query that seems to be done is:

SELECT timestamp, content_hash FROM snap
JOIN uri ON snap.uri_id = uri.id
WHERE uri.value = ? AND timestamp <= ?
ORDER BY timestamp DESC
LIMIT 1;

I looked at the execution plan with:

EXPLAIN SELECT timestamp, content_hash FROM snap
JOIN uri ON snap.uri_id = uri.id
WHERE uri.value = '' AND timestamp <= 1
ORDER BY timestamp DESC
LIMIT 1;

The following index seems to result in an optimized query:

CREATE INDEX IF NOT EXISTS idx_snap_uri_id_timestamp ON snap (uri_id, timestamp DESC);

This improves the request on the timestamp and also seems to help with the join. Creating the index on uri.value instead seems not to get used by sqlite, it resulted in the same execution plan.

But needs testing and benchmarking with a full database, I only checked quickly with a nearly empty one.

Other candidates would be timestamp and uri_id but in reversed order, but this seems to be a bit less efficient.

phw commented 4 months ago

CREATE INDEX IF NOT EXISTS idx_snap_uri_id_timestamp ON snap (uri_id, timestamp DESC);

EXPLAIN QUERY PLAN SELECT timestamp, content_hash FROM snap
JOIN uri ON snap.uri_id = uri.id
WHERE uri.value = '' AND timestamp <= 1
ORDER BY timestamp DESC
LIMIT 1;

Gives:

id	parent	notused	detail
6	0	0	SEARCH uri USING COVERING INDEX sqlite_autoindex_uri_1 (value=?)
10	0	0	SEARCH snap USING INDEX idx_snap_uri_id_timestamp (uri_id=? AND timestamp<?)

atj commented 4 months ago

I took a copy of the current database and used the SQLite .scanstats profiling option to get an idea of the impact of adding the index.

sqlite> .scanstats on
sqlite> SELECT timestamp, content_hash FROM snap JOIN uri ON snap.uri_id = uri.id WHERE uri.value = 'https://api.deezer.com/track/2729047191' AND timestamp <= 1720349182 ORDER BY timestamp DESC;
1714994159|29e386ac6aa0b3f5b4a84236285841557f96f005e19a9e8a02d9e110d2e62dd6
1713720848|af7d68d74ac058d604a1b7be9437f6540d8d7bb5ecbba1b3f26a53f01df9d1d9
QUERY PLAN (cycles=18817920 [100%])
|--SEARCH uri USING COVERING INDEX sqlite_autoindex_uri_1 (value=?)     (cycles=3748792 [20%] loops=1 rows=1)
|--SCAN snap                                                            (cycles=11484542 [61%] loops=1 rows=81782)
`--USE TEMP B-TREE FOR ORDER BY                                         (cycles=46030 [0%] loops=1 rows=2)

sqlite> CREATE INDEX IF NOT EXISTS idx_snap_uri_id_timestamp ON snap (uri_id, timestamp DESC);

sqlite> SELECT timestamp, content_hash FROM snap JOIN uri ON snap.uri_id = uri.id WHERE uri.value = 'https://api.deezer.com/track/2729047191' AND timestamp <= 1720349182 ORDER BY timestamp DESC;
1714994159|29e386ac6aa0b3f5b4a84236285841557f96f005e19a9e8a02d9e110d2e62dd6
1713720848|af7d68d74ac058d604a1b7be9437f6540d8d7bb5ecbba1b3f26a53f01df9d1d9
QUERY PLAN (cycles=239668 [100%])
|--SEARCH uri USING COVERING INDEX sqlite_autoindex_uri_1 (value=?)                 (cycles=15648 [7%] loops=1 rows=1)
`--SEARCH snap USING INDEX idx_snap_uri_id_timestamp (uri_id=? AND timestamp<?)     (cycles=38892 [16%] loops=1 rows=2)

Cycles (which is apparently a good proxy for wall clock time) was reduced by ~77x after the index was added.

kellnerd commented 4 months ago

Thank you both, I have now added this index in https://github.com/kellnerd/snap_storage/commit/161f02201d2e4ec613d17e07a86d431acf0416ac and released v0.6.2.

kellnerd / snap_storage

Optimize SQLite schema and queries #2