lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
434 stars 31 forks source link

get_entries(has_enclosures=...) should be a plugin(?) #327

Closed lemon24 closed 7 months ago

lemon24 commented 8 months ago

The has_enclosures filter predates entry tags, and was meant as a proxy for "is a podcast item" (which works fine, at least with the feeds I'm subscribed to).

The same functionality can be obtained with a plugin that sets a tag, then used as a filter with get_entries(tags=['.has-enclosures']).

Some arguments for this:

Removing the has_enclosures argument is a compatibility break, so it needs to be done in 4.0, #291.

lemon24 commented 7 months ago

What about get_feeds(broken=..., updates_enabled=..., new=...)?

Where do we draw the line? Is this turning into #253? (DynamoDB has rotted my brain.)

lemon24 commented 7 months ago

Related: http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/, vaguely reminiscent of https://en.wikipedia.org/wiki/Star_schema; also see https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model

What would reader look like if you could only filter and sort by tags?

lemon24 commented 7 months ago

So, based on various SQLite forum threads, the general conclusion seems to be "don't bother – design your schema as you normally would, and add indexes as needed later on"; in fairness, this is something I already knew, but as I said, DynamoDB has rotted my brain.

I also tentatively removed has_enclosures, and it didn't remove all that much code.

So:

lemon24 commented 7 months ago

Ran some benchmarks, here's a summary:

Single entry tag results. Given a `has-enclosures` entry tag set like this: ```sh $ python -c ' from reader import make_reader reader = make_reader("db.sqlite") for e in reader.get_entries(has_enclosures=True): reader.set_tag(e, "has-enclosures") print(reader.get_entry_counts()) ' EntryCounts(total=21609, read=15614, important=222, has_enclosures=3978, averages=(0.0, 6.868131868131868, 10.117808219178082)) ``` ...and this benchmark script: ```sh export BENCH_TIME_STAT='avg min' lines='for _ in reader.get_entries(has_enclosures=True): pass for _ in reader.get_entries(tags=["has-enclosures"]): pass for _ in reader.get_entries(has_enclosures=True, limit=100): pass for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass for _ in reader.search_entries("python", has_enclosures=True): pass for _ in reader.search_entries("python", tags=["has-enclosures"]): pass for _ in reader.search_entries("python", has_enclosures=True, limit=20): pass for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass' while IFS= read -r line; do echo "# $line" sync && sudo purge python scripts/bench.py time snippet -r10 --snippet "$line" done <<< "$lines" ``` The output is: ``` # for _ in reader.get_entries(has_enclosures=True): pass stat number repeat snippet avg 1 10 0.702 min 1 10 0.374 # for _ in reader.get_entries(tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 0.571 min 1 10 0.393 # for _ in reader.get_entries(has_enclosures=True, limit=100): pass stat number repeat snippet avg 1 10 0.022 min 1 10 0.010 # for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass stat number repeat snippet avg 1 10 0.020 min 1 10 0.010 # for _ in reader.search_entries("python", has_enclosures=True): pass stat number repeat snippet avg 1 10 0.538 min 1 10 0.384 # for _ in reader.search_entries("python", tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 0.514 min 1 10 0.395 # for _ in reader.search_entries("python", has_enclosures=True, limit=20): pass stat number repeat snippet avg 1 10 0.250 min 1 10 0.110 # for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass stat number repeat snippet avg 1 10 0.226 min 1 10 0.112 ```
1-2 entry tags results. Extra tags were set for read and (un)important like so: ```sh $ python -c ' from reader import make_reader reader = make_reader("db.sqlite") for e in reader.get_entries(): if e.read: reader.set_tag(e, "read") if e.important is True: reader.set_tag(e, "important") if e.important is False: reader.set_tag(e, "unimportant") ' ``` Output (same script as before, but only for the tags snippets): ``` # for _ in reader.get_entries(tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 0.592 min 1 10 0.408 # for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass stat number repeat snippet avg 1 10 0.022 min 1 10 0.011 # for _ in reader.search_entries("python", tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 0.536 min 1 10 0.408 # for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass stat number repeat snippet avg 1 10 0.245 min 1 10 0.115 ```
20+ entry tags results. Extra tags were set for read and (un)important like so: ```sh $ python -c ' from reader import make_reader reader = make_reader("db.sqlite") tags = "one two three four five six seven eight nine ten eleven twelve thirteen fourteen sixteen seventeen eighteen nineteen twenty".split() for e in reader.get_entries(): for tag in tags: reader.set_tag(e, tag) ' ``` Output (same script as before, but only for the tags snippets): ``` # for _ in reader.get_entries(tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 1.170 min 1 10 0.613 # for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass stat number repeat snippet avg 1 10 0.042 min 1 10 0.016 # for _ in reader.search_entries("python", tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 0.789 min 1 10 0.548 # for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass stat number repeat snippet avg 1 10 0.342 min 1 10 0.174 ```