Feed and entry tags should be indexed by key

Follow-up to https://github.com/lemon24/reader/issues/358.

Currently, the query for get_feeds(tags=['one']) results in a scan:

WITH
    __feed_tags AS (
        SELECT key
        FROM feed_tags 
        WHERE feed = feeds.url
    )
SELECT url
FROM feeds
WHERE '.plugin.slugs.one' IN __feed_tags
ORDER BY url

QUERY PLAN
|--SCAN TABLE feeds USING COVERING INDEX sqlite_autoindex_feeds_1
`--CORRELATED LIST SUBQUERY 2
   `--SEARCH TABLE feed_tags USING COVERING INDEX sqlite_autoindex_feed_tags_1 (feed=?)

With create index feed_tags_by_key on feed_tags(key), there are at least two ways to rewrite that query to only use searches:

WITH
    __tag_feeds AS (
        SELECT feed 
        FROM feed_tags
        WHERE key = '.plugin.slugs.one'
    )
SELECT url
FROM feeds
WHERE url in __tag_feeds
ORDER BY url

QUERY PLAN
|--SEARCH TABLE feeds USING COVERING INDEX sqlite_autoindex_feeds_1 (url=?)
`--LIST SUBQUERY 2
   `--SEARCH TABLE feed_tags USING INDEX feed_tags_by_key (key=?)

...and:

SELECT DISTINCT url
FROM feeds
JOIN feed_tags ON feeds.url = feed_tags.feed
WHERE feed_tags.key = '.plugin.slugs.one'
ORDER BY url

QUERY PLAN
|--SEARCH TABLE feed_tags USING INDEX feed_tags_by_key (key=?)
|--SEARCH TABLE feeds USING COVERING INDEX sqlite_autoindex_feeds_1 (url=?)
`--USE TEMP B-TREE FOR DISTINCT

This is not an issue for global tags, since the tag key is the primary key.

Repeating for entries:

exactly one entry has the tweeted tag; there's 1 tag per entry on average
also joining feeds, since that's what get_entries() does; not joining gives similar results
I ran each query a bunch of times and picked the one with the smallest run time

Current query:

WITH
    __entry_tags AS (
        SELECT key FROM entry_tags
        WHERE (id, feed) = (entries.id, entries.feed)
    )
SELECT entries.feed, entries.id, feeds.title
FROM entries
JOIN feeds ON feeds.url = entries.feed
WHERE 'tweeted' IN __entry_tags;

QUERY PLAN
|--SCAN TABLE feeds
|--SEARCH TABLE entries USING INDEX entries_by_feed (feed=?)
`--CORRELATED LIST SUBQUERY 2
   `--SEARCH TABLE entry_tags USING COVERING INDEX sqlite_autoindex_entry_tags_1 (id=? AND feed=?)
Run Time: real 0.087 user 0.062563 sys 0.024012

After create index entry_tags_by_key on entry_tags(key):

WITH
    __tag_entries AS (
        SELECT id, feed 
        FROM entry_tags
        WHERE key = 'tweeted'
    )
SELECT entries.feed, entries.id, feeds.title
FROM entries
JOIN feeds ON feeds.url = entries.feed
WHERE (id, feed) in __tag_entries;

QUERY PLAN
|--SEARCH TABLE entries USING COVERING INDEX sqlite_autoindex_entries_1 (id=?)
|--LIST SUBQUERY 2
|  `--SEARCH TABLE entry_tags USING INDEX entry_tags_by_key (key=?)
|--LIST SUBQUERY 2
|  `--SEARCH TABLE entry_tags USING INDEX entry_tags_by_key (key=?)
`--SEARCH TABLE feeds USING INDEX sqlite_autoindex_feeds_1 (url=?)
Run Time: real 0.000 user 0.000412 sys 0.000169

...and:

SELECT DISTINCT entries.feed, entries.id, feeds.title
FROM entries
JOIN feeds ON feeds.url = entries.feed
JOIN entry_tags ON (entries.feed, entries.id) = (entry_tags.feed, entry_tags.id)
WHERE entry_tags.key = 'tweeted';

QUERY PLAN
|--SEARCH TABLE entry_tags USING INDEX entry_tags_by_key (key=?)
|--SEARCH TABLE entries USING COVERING INDEX sqlite_autoindex_entries_1 (id=? AND feed=?)
|--SEARCH TABLE feeds USING INDEX sqlite_autoindex_feeds_1 (url=?)
`--USE TEMP B-TREE FOR DISTINCT
Run Time: real 0.000 user 0.000352 sys 0.000158

So, bit of a complication here.

As I see it now, there are two main ways of writing this query:

for each feed in the feeds table, get its tags and see if they match (the current, WITH __feed_tags query)
- this is, by definition, a scan on feeds – I don't see how it's possible to have an index on WHERE tag in CTE
for each feed in those whose tags match, get the feed from the feeds table (the WITH __tag_feeds, JOIN feed_tags queries)
- this only works with a single CTE for one OR two (tags=[['one', 'two']]); for one AND two (tags=[['one'], ['two']]), you have to intersect (join) two CTEs
- this only works for positive cases (has tag, has any tag), but not for negative cases (does not have tag, has no tags), since for the the latter the feeds won't appear in the tags table (it may be possible to write a more complicated subquery that uses feeds too, but it's ... too complicated)

Additionally, I don't think there's a way to write a query that allows the query planner to decide between the two (especially not the SQLite one).

(I vaguely remember going through this thought process when implementing tags initially, I now wish I'd had bothered to write the conclusions down.)

So, I think it's useful to have #2 as an optimization. Also, we can make a recommendation to storage implementation to optimize at least the "has single tag X" filters.

lemon24 / reader

Feed and entry tags should be indexed by key #359