get_entries(has_enclosures=...) should be a plugin(?)

lemon24 commented 1 year ago

The has_enclosures filter predates entry tags, and was meant as a proxy for "is a podcast item" (which works fine, at least with the feeds I'm subscribed to).

The same functionality can be obtained with a plugin that sets a tag, then used as a filter with get_entries(tags=['.has-enclosures']).

Some arguments for this:

Makes the core simpler (and lowers the burden on storage implementations).
Allows for more specific logic, e.g. .has-audio-enclosures.
Provides a reason to implement filtering by entry tags. Update: done in #328.
- Update: ...and a reason for get_entry_counts() to be able to count tags! (kinda need to think how this would work)

Removing the has_enclosures argument is a compatibility break, so it needs to be done in 4.0, #291.

lemon24 commented 11 months ago

What about get_feeds(broken=..., updates_enabled=..., new=...)?

Where do we draw the line? Is this turning into #253? (DynamoDB has rotted my brain.)

lemon24 commented 11 months ago

What would reader look like if you could only filter and sort by tags?

Both feed (user) title and entry recent sort are derived values, so we wouldn't have an issue here.
Ensuring consistency would be left to reader code (e.g. required "tag" attributes, updating computed values, modeling tristate attributes like important).
Idem for a lot of migrations (which may be better for databases that do not support schema transactions).
What would indices look like? (use cases: filter by [tag1, tag2, ...], sort by tag1 value, count by tag1)

lemon24 commented 11 months ago

So, based on various SQLite forum threads, the general conclusion seems to be "don't bother – design your schema as you normally would, and add indexes as needed later on"; in fairness, this is something I already knew, but as I said, DynamoDB has rotted my brain.

I also tentatively removed has_enclosures, and it didn't remove all that much code.

So:

has_enclosure may become a tag, but that would likely lock in a performance penalty (right now, we don't have indexes on it, but if we make it a tag it won't be possible to add one)
- it might be interesting to see what query performance looks like with tags, though (update: slightly worse, see https://github.com/lemon24/reader/issues/327#issuecomment-1861744145)
- the more specific enclosures filtering (e.g. .has-audio-enclosures) can still be achieved via the plugin
read and important are integral to the data model / filtering, so we still want them as regular columns
- related: https://github.com/lemon24/reader/issues/253
feed filtering attributes do not matter all that much, since feeds are both much fewer and smaller than entries (a feeds full table scan is likely negligible)
- on one hand, this is an argument for "do nothing", since the code is already there
- on the other hand, it may be an argument for "use tags" (since we can afford the performance penalty)
- we may do this once we can set tags in a transaction, and get tags in a single query

lemon24 commented 11 months ago

Ran some benchmarks, here's a summary:

With only the has-enclosures entry tag, there seems to be almost no difference between using has_enclosures or the tag.
Adding a 1-2 more tags to each entry seems to make using tags only a bit worse.
Adding 20 more tags to each entry seems to make using tags ~1.5x worse.

Single entry tag results.

Given a `has-enclosures` entry tag set like this: ```sh $ python -c ' from reader import make_reader reader = make_reader("db.sqlite") for e in reader.get_entries(has_enclosures=True): reader.set_tag(e, "has-enclosures") print(reader.get_entry_counts()) ' EntryCounts(total=21609, read=15614, important=222, has_enclosures=3978, averages=(0.0, 6.868131868131868, 10.117808219178082)) ``` ...and this benchmark script: ```sh export BENCH_TIME_STAT='avg min' lines='for _ in reader.get_entries(has_enclosures=True): pass for _ in reader.get_entries(tags=["has-enclosures"]): pass for _ in reader.get_entries(has_enclosures=True, limit=100): pass for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass for _ in reader.search_entries("python", has_enclosures=True): pass for _ in reader.search_entries("python", tags=["has-enclosures"]): pass for _ in reader.search_entries("python", has_enclosures=True, limit=20): pass for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass' while IFS= read -r line; do echo "# $line" sync && sudo purge python scripts/bench.py time snippet -r10 --snippet "$line" done <<< "$lines" ``` The output is: ``` # for _ in reader.get_entries(has_enclosures=True): pass stat number repeat snippet avg 1 10 0.702 min 1 10 0.374 # for _ in reader.get_entries(tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 0.571 min 1 10 0.393 # for _ in reader.get_entries(has_enclosures=True, limit=100): pass stat number repeat snippet avg 1 10 0.022 min 1 10 0.010 # for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass stat number repeat snippet avg 1 10 0.020 min 1 10 0.010 # for _ in reader.search_entries("python", has_enclosures=True): pass stat number repeat snippet avg 1 10 0.538 min 1 10 0.384 # for _ in reader.search_entries("python", tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 0.514 min 1 10 0.395 # for _ in reader.search_entries("python", has_enclosures=True, limit=20): pass stat number repeat snippet avg 1 10 0.250 min 1 10 0.110 # for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass stat number repeat snippet avg 1 10 0.226 min 1 10 0.112 ```

1-2 entry tags results.

Extra tags were set for read and (un)important like so: ```sh $ python -c ' from reader import make_reader reader = make_reader("db.sqlite") for e in reader.get_entries(): if e.read: reader.set_tag(e, "read") if e.important is True: reader.set_tag(e, "important") if e.important is False: reader.set_tag(e, "unimportant") ' ``` Output (same script as before, but only for the tags snippets): ``` # for _ in reader.get_entries(tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 0.592 min 1 10 0.408 # for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass stat number repeat snippet avg 1 10 0.022 min 1 10 0.011 # for _ in reader.search_entries("python", tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 0.536 min 1 10 0.408 # for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass stat number repeat snippet avg 1 10 0.245 min 1 10 0.115 ```

20+ entry tags results.

Extra tags were set for read and (un)important like so: ```sh $ python -c ' from reader import make_reader reader = make_reader("db.sqlite") tags = "one two three four five six seven eight nine ten eleven twelve thirteen fourteen sixteen seventeen eighteen nineteen twenty".split() for e in reader.get_entries(): for tag in tags: reader.set_tag(e, tag) ' ``` Output (same script as before, but only for the tags snippets): ``` # for _ in reader.get_entries(tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 1.170 min 1 10 0.613 # for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass stat number repeat snippet avg 1 10 0.042 min 1 10 0.016 # for _ in reader.search_entries("python", tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 0.789 min 1 10 0.548 # for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass stat number repeat snippet avg 1 10 0.342 min 1 10 0.174 ```

lemon24 / reader

get_entries(has_enclosures=...) should be a plugin(?) #327