lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
456 stars 38 forks source link

Feed and entry tags should be indexed by key #359

Closed lemon24 closed 2 weeks ago

lemon24 commented 3 weeks ago

Follow-up to https://github.com/lemon24/reader/issues/358.

Currently, the query for get_feeds(tags=['one']) results in a scan:

WITH
    __feed_tags AS (
        SELECT key
        FROM feed_tags 
        WHERE feed = feeds.url
    )
SELECT url
FROM feeds
WHERE '.plugin.slugs.one' IN __feed_tags
ORDER BY url
QUERY PLAN
|--SCAN TABLE feeds USING COVERING INDEX sqlite_autoindex_feeds_1
`--CORRELATED LIST SUBQUERY 2
   `--SEARCH TABLE feed_tags USING COVERING INDEX sqlite_autoindex_feed_tags_1 (feed=?)

With create index feed_tags_by_key on feed_tags(key), there are at least two ways to rewrite that query to only use searches:

WITH
    __tag_feeds AS (
        SELECT feed 
        FROM feed_tags
        WHERE key = '.plugin.slugs.one'
    )
SELECT url
FROM feeds
WHERE url in __tag_feeds
ORDER BY url
QUERY PLAN
|--SEARCH TABLE feeds USING COVERING INDEX sqlite_autoindex_feeds_1 (url=?)
`--LIST SUBQUERY 2
   `--SEARCH TABLE feed_tags USING INDEX feed_tags_by_key (key=?)

...and:

SELECT DISTINCT url
FROM feeds
JOIN feed_tags ON feeds.url = feed_tags.feed
WHERE feed_tags.key = '.plugin.slugs.one'
ORDER BY url
QUERY PLAN
|--SEARCH TABLE feed_tags USING INDEX feed_tags_by_key (key=?)
|--SEARCH TABLE feeds USING COVERING INDEX sqlite_autoindex_feeds_1 (url=?)
`--USE TEMP B-TREE FOR DISTINCT

This is not an issue for global tags, since the tag key is the primary key.

lemon24 commented 3 weeks ago

Repeating for entries:

Current query:

WITH
    __entry_tags AS (
        SELECT key FROM entry_tags
        WHERE (id, feed) = (entries.id, entries.feed)
    )
SELECT entries.feed, entries.id, feeds.title
FROM entries
JOIN feeds ON feeds.url = entries.feed
WHERE 'tweeted' IN __entry_tags;
QUERY PLAN
|--SCAN TABLE feeds
|--SEARCH TABLE entries USING INDEX entries_by_feed (feed=?)
`--CORRELATED LIST SUBQUERY 2
   `--SEARCH TABLE entry_tags USING COVERING INDEX sqlite_autoindex_entry_tags_1 (id=? AND feed=?)
Run Time: real 0.087 user 0.062563 sys 0.024012

After create index entry_tags_by_key on entry_tags(key):

WITH
    __tag_entries AS (
        SELECT id, feed 
        FROM entry_tags
        WHERE key = 'tweeted'
    )
SELECT entries.feed, entries.id, feeds.title
FROM entries
JOIN feeds ON feeds.url = entries.feed
WHERE (id, feed) in __tag_entries;
QUERY PLAN
|--SEARCH TABLE entries USING COVERING INDEX sqlite_autoindex_entries_1 (id=?)
|--LIST SUBQUERY 2
|  `--SEARCH TABLE entry_tags USING INDEX entry_tags_by_key (key=?)
|--LIST SUBQUERY 2
|  `--SEARCH TABLE entry_tags USING INDEX entry_tags_by_key (key=?)
`--SEARCH TABLE feeds USING INDEX sqlite_autoindex_feeds_1 (url=?)
Run Time: real 0.000 user 0.000412 sys 0.000169

...and:

SELECT DISTINCT entries.feed, entries.id, feeds.title
FROM entries
JOIN feeds ON feeds.url = entries.feed
JOIN entry_tags ON (entries.feed, entries.id) = (entry_tags.feed, entry_tags.id)
WHERE entry_tags.key = 'tweeted';
QUERY PLAN
|--SEARCH TABLE entry_tags USING INDEX entry_tags_by_key (key=?)
|--SEARCH TABLE entries USING COVERING INDEX sqlite_autoindex_entries_1 (id=? AND feed=?)
|--SEARCH TABLE feeds USING INDEX sqlite_autoindex_feeds_1 (url=?)
`--USE TEMP B-TREE FOR DISTINCT
Run Time: real 0.000 user 0.000352 sys 0.000158
lemon24 commented 3 weeks ago

So, bit of a complication here.

As I see it now, there are two main ways of writing this query:

  1. for each feed in the feeds table, get its tags and see if they match (the current, WITH __feed_tags query)
    • this is, by definition, a scan on feeds – I don't see how it's possible to have an index on WHERE tag in CTE
  2. for each feed in those whose tags match, get the feed from the feeds table (the WITH __tag_feeds, JOIN feed_tags queries)
    • this only works with a single CTE for one OR two (tags=[['one', 'two']]); for one AND two (tags=[['one'], ['two']]), you have to intersect (join) two CTEs
    • this only works for positive cases (has tag, has any tag), but not for negative cases (does not have tag, has no tags), since for the the latter the feeds won't appear in the tags table (it may be possible to write a more complicated subquery that uses feeds too, but it's ... too complicated)

Additionally, I don't think there's a way to write a query that allows the query planner to decide between the two (especially not the SQLite one).

(I vaguely remember going through this thought process when implementing tags initially, I now wish I'd had bothered to write the conclusions down.)

So, I think it's useful to have #​2 as an optimization. Also, we can make a recommendation to storage implementation to optimize at least the "has single tag X" filters.