lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
456 stars 38 forks source link

get_entries(has_enclosures=...) should be a plugin(?) #327

Closed lemon24 closed 11 months ago

lemon24 commented 1 year ago

The has_enclosures filter predates entry tags, and was meant as a proxy for "is a podcast item" (which works fine, at least with the feeds I'm subscribed to).

The same functionality can be obtained with a plugin that sets a tag, then used as a filter with get_entries(tags=['.has-enclosures']).

Some arguments for this:

Removing the has_enclosures argument is a compatibility break, so it needs to be done in 4.0, #291.

lemon24 commented 11 months ago

What about get_feeds(broken=..., updates_enabled=..., new=...)?

Where do we draw the line? Is this turning into #253? (DynamoDB has rotted my brain.)

lemon24 commented 11 months ago

Related: http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/, vaguely reminiscent of https://en.wikipedia.org/wiki/Star_schema; also see https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model

What would reader look like if you could only filter and sort by tags?

lemon24 commented 11 months ago

So, based on various SQLite forum threads, the general conclusion seems to be "don't bother – design your schema as you normally would, and add indexes as needed later on"; in fairness, this is something I already knew, but as I said, DynamoDB has rotted my brain.

I also tentatively removed has_enclosures, and it didn't remove all that much code.

So:

lemon24 commented 11 months ago

Ran some benchmarks, here's a summary:

Single entry tag results. Given a `has-enclosures` entry tag set like this: ```sh $ python -c ' from reader import make_reader reader = make_reader("db.sqlite") for e in reader.get_entries(has_enclosures=True): reader.set_tag(e, "has-enclosures") print(reader.get_entry_counts()) ' EntryCounts(total=21609, read=15614, important=222, has_enclosures=3978, averages=(0.0, 6.868131868131868, 10.117808219178082)) ``` ...and this benchmark script: ```sh export BENCH_TIME_STAT='avg min' lines='for _ in reader.get_entries(has_enclosures=True): pass for _ in reader.get_entries(tags=["has-enclosures"]): pass for _ in reader.get_entries(has_enclosures=True, limit=100): pass for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass for _ in reader.search_entries("python", has_enclosures=True): pass for _ in reader.search_entries("python", tags=["has-enclosures"]): pass for _ in reader.search_entries("python", has_enclosures=True, limit=20): pass for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass' while IFS= read -r line; do echo "# $line" sync && sudo purge python scripts/bench.py time snippet -r10 --snippet "$line" done <<< "$lines" ``` The output is: ``` # for _ in reader.get_entries(has_enclosures=True): pass stat number repeat snippet avg 1 10 0.702 min 1 10 0.374 # for _ in reader.get_entries(tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 0.571 min 1 10 0.393 # for _ in reader.get_entries(has_enclosures=True, limit=100): pass stat number repeat snippet avg 1 10 0.022 min 1 10 0.010 # for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass stat number repeat snippet avg 1 10 0.020 min 1 10 0.010 # for _ in reader.search_entries("python", has_enclosures=True): pass stat number repeat snippet avg 1 10 0.538 min 1 10 0.384 # for _ in reader.search_entries("python", tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 0.514 min 1 10 0.395 # for _ in reader.search_entries("python", has_enclosures=True, limit=20): pass stat number repeat snippet avg 1 10 0.250 min 1 10 0.110 # for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass stat number repeat snippet avg 1 10 0.226 min 1 10 0.112 ```
1-2 entry tags results. Extra tags were set for read and (un)important like so: ```sh $ python -c ' from reader import make_reader reader = make_reader("db.sqlite") for e in reader.get_entries(): if e.read: reader.set_tag(e, "read") if e.important is True: reader.set_tag(e, "important") if e.important is False: reader.set_tag(e, "unimportant") ' ``` Output (same script as before, but only for the tags snippets): ``` # for _ in reader.get_entries(tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 0.592 min 1 10 0.408 # for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass stat number repeat snippet avg 1 10 0.022 min 1 10 0.011 # for _ in reader.search_entries("python", tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 0.536 min 1 10 0.408 # for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass stat number repeat snippet avg 1 10 0.245 min 1 10 0.115 ```
20+ entry tags results. Extra tags were set for read and (un)important like so: ```sh $ python -c ' from reader import make_reader reader = make_reader("db.sqlite") tags = "one two three four five six seven eight nine ten eleven twelve thirteen fourteen sixteen seventeen eighteen nineteen twenty".split() for e in reader.get_entries(): for tag in tags: reader.set_tag(e, tag) ' ``` Output (same script as before, but only for the tags snippets): ``` # for _ in reader.get_entries(tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 1.170 min 1 10 0.613 # for _ in reader.get_entries(tags=["has-enclosures"], limit=100): pass stat number repeat snippet avg 1 10 0.042 min 1 10 0.016 # for _ in reader.search_entries("python", tags=["has-enclosures"]): pass stat number repeat snippet avg 1 10 0.789 min 1 10 0.548 # for _ in reader.search_entries("python", tags=["has-enclosures"], limit=20): pass stat number repeat snippet avg 1 10 0.342 min 1 10 0.174 ```