lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
456 stars 38 forks source link

Get feed by user-defined slug #358

Closed nobrowser closed 2 weeks ago

nobrowser commented 4 weeks ago

Hello, I'm trying to write my dream feed reader with your library, and one obstacle I'm facing now is that I'm used to giving short unique IDs (or names) to feeds and accessing them with that ID. That's because URLs are often long and messy, and author titles can be missing or inconvenient (with spaces, among other things). Initially it looked as if the user_title attribute might work for that, but alas, the database schema has no index on the column, so that would be a linear search each time.

Then I thought I could add the index myself in my code -- but the Reader class provides no direct access to the database connection, so I can't do it with a plugin. I could simply open the database after it's created and add the index behind Reader's back, but that would be really ugly. So, what approach do you think I should take? Should I ask you to add the index :grin: , or is there something else I can do now?

Thanks,

-- Ian

lemon24 commented 3 weeks ago

tl;dr: I think you can use feed tags for this, details below.

my dream feed reader

Love this phrase :)

it looked as if the user_title attribute might work for that

Indeed, it would not be a good idea to use user_title, since it is in some ways "special"; among others it may be used (together with the feed's original title) to implement title sort (e.g. dropping "The" etc.); this is one of the main reasons I did not deprecate it in favor of tags.

URLs are often long and messy, and author titles can be missing or inconvenient (with spaces, among other things)

This is probably another reason for not using user_title like this, since you may still want "human readable" titles that are different from the original ones, but are also not slugs.

So, if reader were to have an API for exactly this use case, it'd be something separate, like Feed.slug, set_feed_slug(url, slug), get_feed(slug=slug).


That said, I think you can achieve this by using feed tags, since they are indexed.

In the example below (click to expand), for my database with 177 feeds, get_feed_by_slug(slug) is only twice as slow as get_feed(url). ```python from __future__ import annotations def _make_slug_tag(reader, slug): return reader.make_plugin_reserved_name('slugs', slug) def _get_slug_tags(reader, resource): prefix = _make_slug_tag(reader, '') # filter tags by prefix would make this faster # https://github.com/lemon24/reader/issues/309 rv = list() for tag in reader.get_tag_keys(resource): if not tag.startswith(prefix): continue rv.append(tag) return prefix, rv def get_feed_slug(reader, feed) -> str | None: prefix, tags = _get_slug_tags(reader, feed) if not tags: return None assert len(tags) == 1, (feed, tags) return tags[0].removeprefix(prefix) def get_feed_by_slug(reader, slug) -> Feed | None: tag = _make_slug_tag(reader, slug) feeds = list(reader.get_feeds(tags=[tag])) if not feeds: return None assert len(feeds) == 1, (slug, [f.url for f in feeds]) return feeds[0] def set_feed_slug(reader, feed, slug: str | None): feed = reader.get_feed(feed) tag = _make_slug_tag(reader, slug) if slug: already_has_slug = False for other_feed in reader.get_feeds(tags=[tag]): if feed.url == other_feed.url: already_has_slug = True continue raise ValueError(f"slug {slug!r} already in use by {other_feed.url!r}") # NOTE: technically a race condition / TOCTOU bug if already_has_slug: return _, other_tags = _get_slug_tags(reader, feed) for other_tag in other_tags: reader.delete_tag(feed, other_tag, missing_ok=True) if slug: reader.set_tag(feed, tag) # "handle" race condition; ensures(?) at most one feed has the slug, # with the risk that none of them has it in case of race condition. for other_feed in reader.get_feeds(tags=[tag]): if feed.url == other_feed.url: continue reader.delete_tag(other_feed, tag, missing_ok=True) # an alternative way of dealing with the race condition is # to use global tags, but then we'd have to "garbage-collect" # slugs of feeds that have been deleted (in an after_feeds_update_hook), # although that may lead to a different sort of race condition. if __name__ == '__main__': import timeit from reader import make_reader reader = make_reader('db.sqlite') url = 'https://death.andgravity.com/_feed/index.xml' set_feed_slug(reader, url, 'one') print( get_feed_slug(reader, url), getattr(get_feed_by_slug(reader, 'one'), 'url', None), ) try: set_feed_slug(reader, 'https://xkcd.com/atom.xml', 'one') raise Exception("shouldn't get here") except Exception as e: print('error:', e) set_feed_slug(reader, url, 'two') print( get_feed_slug(reader, url), getattr(get_feed_by_slug(reader, 'two'), 'url', None), ) print() def repeat(stmt): times = timeit.repeat(stmt, repeat=1000, number=1, globals=globals()) print(f"{stmt}: {min(times):.6f}s min, {sum(times)/len(times):.6f}s avg") print(reader.get_feed_counts().total, 'feeds') repeat("reader.get_feed(url)") repeat("get_feed_by_slug(reader, 'two')") ```

(For reference, this is somewhat related to https://github.com/lemon24/reader/issues/159#issuecomment-612914956 that discusses having a surrogate key, except I didn't think it would be user-defined there.)

nobrowser commented 3 weeks ago

Thanks for the suggestion about tags. I didn't think of using them somehow because of the fact that there can be multiple ones, I'll have to look at the schema again to convince myself I'm not doing something abominable.

Good that we're talking.

-- Ian

lemon24 commented 3 weeks ago

that there can be multiple [tags]

The only issue I see with using tags is the race condition mentioned in the code comment, but I think the approach of (accidentally, rarely) deleting both should be acceptable, especially for a single-user system.

BTW (and no pressure! :), if the approach above ends up being OK for you, I'd happy to merge it as a reader plugin, either experimental or built-in (the difference is mainly in the amount of testing). The get/set_feed_slug() and get_feed_by_slug() "methods" can be monkey-patched onto the reader instance as enclosure_dedupe does here.

(In theory I'd be fine with this being built into reader itself too, which would also get rid of the race condition, but I think it'd probably be overkill.)

nobrowser commented 3 weeks ago

Hmmm, so no, looking at the database I can't use a tag for what I want, as there's only an index on the composite key (feed, key) -- nothing on the value of the key/tag, and even if I went with the hack of a unique key, linear scan of the feed_tags table would still be needed. So I really need a new (unique) attribute, which indeed would be a kind of "surrogate key". To avoid any confusion I propose to call it short_name -- it doesn't have to be a slug as the term is normally used.

Would a PR to this effect be considered?

lemon24 commented 3 weeks ago

My suggestion was to use tag keys with get_feeds(tags=[tag]), with the value being ignored/unused (also see the collapsed code at the end of my first comment); for slug My Feed, the matching feed tag would be .plugin.slugs.My Feed.

You are right in that there is a scan involved, since there's no index on the feed_tags key column alone. My results were reasonably fast because I only have a ~small amount of feed tags (177 feeds * ~2 tags/feed). However, indexes on tag keys can and should be added, since they exists for this kind of access pattern; I opened https://github.com/lemon24/reader/issues/359 for this, and may work on it reasonably soon.

Would a PR to this effect [new feed attribute] be considered?

Yes, definitely, but I'm somewhat attached to the "slug" naming, since I think it best conveys the idea that it's used as an identifier – "the part of a URL that identifies a page in human-readable keywords" – and it also matches the publishing term.

lemon24 commented 2 weeks ago

Added a cleaned up version of the code in https://github.com/lemon24/reader/issues/358#issuecomment-2439459213 as a plugin recipe.

Closing this for now, feel free to reopen if needed.