lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
438 stars 36 forks source link

Enable search by default #252

Closed lemon24 closed 2 years ago

lemon24 commented 3 years ago

Enable search by default, because there's less explaining to do for new users: no explicit pip install reader[search], enable_search(), or update_search() needed.

Thoughts:

lemon24 commented 2 years ago

Possible improvements

There are multiple changes that can be done, with different levels of impact to current users. This is kinda covered in the first comment, but it's worth discussing them individually.

Regardless of the default, make_reader(..., search_enabled=True) must fail fast for unsupported versions of SQLite (<3.18). It should also fail if the dependencies were not installed – if I'm requesting search explicitly, I don't want update_feed() to fail.

Open questions:

lemon24 commented 2 years ago

Alternate search API proposal

I'm getting a feeling I may be solving the wrong problem.

For instance, I'm having trouble determining how search_enabled=... interacts with enable/disable_search(): ```python reader = make_reader(search_enabled=True) reader.disable_search() # SearchNotEnabledError? silently not update search? reader.update_feeds() reader = make_reader(search_enabled=False) reader.enable_search() # update search? not update search? reader.update_feeds() reader = make_reader(search_enabled=None) assert reader.is_search_enabled() is True # update search? not update search? reader.update_feeds() ```

Initially, we didn't know when updating search should happen, and the search API wasn't (isn't) stable, so we just exposed everything. It seems we've now converged to "updating search happens offline" (like updating feeds). Also, the pluggable search design (#168) came after we added enable/disable_search() (#122), but because I didn't actually implement it, I didn't notice the overlap.

So maybe we shouldn't burden the user with details about that – most of the time, people just care about search_entries(). If we expose the Search object directly, we don't need to expose enable/disable_search(), is_search_enabled(), and update_search().

Here's a proposal for a new search API:


def make_reader(url, search=None):
    # TBD if it's none:/default: or :none:/:default:

    if search is None or search == 'none:'
        search_obj = None
    elif search == 'default:':
        search_obj = Search(...)           
    else:
        raise ValueError("bad search")

    if search_obj:
        search_obj.check_dependencies()

    return Reader(..., search_obj)

# Search remains as-is

class Reader:

    search: Search

    def _update_feeds(self, feed, updates_enabled, search):
        results = ...

        for result in results:
            # also runs plugin hooks
            rv = self._update_feed(feed)

            # not ideal, since we do many small updates instead of a big one;
            # but if we move it after the loop, search needs to be able to do the same kind of filtering
            if search and self.search:
                try:
                    self.search.update(feed)
                except SearchNotEnabledError:
                    self.search.enable()
                    self.search.update(feed)

            yield rv

    def update_feeds_iter(self, feed=None, updates_enabled=True, search=True):
        yield from self._update_feeds(feed, updates_enabled, search=search)
        # maybe update search for disabled feeds too here,
        # if feed is None and updates_enabled was not provided explicitly;
        # also not ideal, since we get a big pause at the end of the iterator

    def update_feeds(self, ...):
        for _ in self.update_feeds_iter(...): pass

    def update_feed(self, feed, search=True):
        return zero_or_one(self._update_feeds(feed=feed, search=search)

    def search_entries(self, ...):
        if not self.search:
            raise NoSearchAvailableError
        try:
            yield from self.search.search_entries(...)
        except SearchNotEnabledError:
            # even if we enabled search here,
            # we'd still get no result until an update()
            pass

    # removed: enable/disable_search(), is_search_enabled(), update_search()

# normal usage

reader = make_reader("db.sqlite", search="default:")
reader.update_feeds()
reader.search_entries(...)

# no search

reader = make_reader("db.sqlite", search="none:")
reader.update_feeds()
reader.search_entries(...)  # raises NoSearchAvailableError

# enable/update/disable search explicitly

reader = make_reader("db.sqlite", search="default:")
reader.search.enable()
reader.search.update()
reader.search.is_enabled()
reader.search.disable()

Update: I don't like this. On the surface it looks OK, but it makes things harder to implement correctly, OR still requires explaining why things behave different than expected. Better to have the primitives exposed, and let users mix and match.

lemon24 commented 2 years ago

Part of my unease with the search API has to do with its statefulness – some methods may or may not work depending on whether search is enabled, or if there is a search in the first place.

However, besides being connected to a specific search backend, part of this statefulness comes from outside a specific Reader object: e.g. one Reader expects search to be enabled, but another one disables it after the first was instantiated.

At least part of it is essential complexity (us doing migrations automatically in the Storage constructor is trying to hide that, but long-term it makes sense to have that as a separate, explicit operation, like we're doing with search). I don't know what the "correct" API to present this is, but it seems pretty hard/premature to get it right now, especially without other search/storage implementations. We should only solve the problem at hand.

So, what problem am I trying to solve?

Conclusions

lemon24 commented 2 years ago

The table below shows timings on how search affects other methods (for my database with ~160 feeds and 13K entries).

Some notes:

cold hot
make_reader(), migration 24 ms
make_reader(), no migration 1.4 ms
enable_search(), not enabled 34 ms
enable_search(), enabled 0.2 ms
is_search_enabled() 0.2 ms 0.2 ms
set_feed_user_title(), not enabled 60 ms 16 ms
set_feed_user_title(), enabled 140 ms 16 ms
set_feed_user_title(), enabled, updated 150 ms 16 ms
delete_feed(), not enabled 1.2 s
delete_feed(), enabled 1.7 s
delete_feed(), enabled, updated 1.7 s
_delete_from_search(), enabled 1.7 ms
_delete_from_search(), enabled, updated 2.2 ms
_delete_from_search(), enabled, feeds deleted 163 ms
_delete_from_search(), enabled, updated, feeds deleted 4.7 s
Detailed timing output. Blocks separated by blank lines were ran in new processes. Everything below "all starting from search not enabled" was run after this code: ```python from reader import * from shutil import copyfile copyfile('db.sqlite.orig', 'db.sqlite') reader = make_reader('db.sqlite') assert not reader.is_search_enabled() for _ in reader.get_entries(): pass for _ in reader.get_feeds(): pass ``` Detailed timings: ``` python >>> reader.get_feed_counts().total 158 >>> reader.get_entry_counts().total 12975 # search enabled, but it doesn't really matter >>> %time make_reader('db.sqlite') # migration Wall time: 24 ms >>> %time make_reader('db.sqlite') # no migration Wall time: 1.4 ms >>> %time reader.enable_search() # not enabled Wall time: 34 ms >>> %time reader.enable_search() # enabled Wall time: 177 µs # all starting from search not enabled >>> %time for feed in reader.get_feeds(): reader.set_feed_user_title(feed, feed.title + '...') Wall time: 60.6 ms >>> %time for feed in reader.get_feeds(): reader.set_feed_user_title(feed, feed.title + '...') Wall time: 16 ms >>> reader.enable_search() >>> %time for feed in reader.get_feeds(): reader.set_feed_user_title(feed, feed.title + '...') Wall time: 139 ms >>> %time for feed in reader.get_feeds(): reader.set_feed_user_title(feed, feed.title + '...') Wall time: 15.5 ms >>> %time for feed in reader.get_feeds(): reader.set_feed_user_title(feed, feed.title + '...') Wall time: 59.9 ms >>> reader.enable_search() >>> %time for feed in reader.get_feeds(): reader.set_feed_user_title(feed, feed.title + '...') Wall time: 16.1 ms >>> reader.enable_search(); reader.update_search() >>> %time for feed in reader.get_feeds(): reader.set_feed_user_title(feed, feed.title + '...') Wall time: 150 ms >>> %time for feed in reader.get_feeds(): reader.set_feed_user_title(feed, feed.title + '...') Wall time: 15.8 ms >>> %time for feed in reader.get_feeds(): reader.delete_feed(feed) Wall time: 1.2 s >>> reader.enable_search() >>> %time for feed in reader.get_feeds(): reader.delete_feed(feed) Wall time: 1.72 s >>> reader.enable_search(); reader.update_search() >>> %time for feed in reader.get_feeds(): reader.delete_feed(feed) Wall time: 1.74 s >>> reader.enable_search() >>> %time reader._search._delete_from_search() Wall time: 1.74 ms >>> reader.enable_search(); reader.update_search() >>> %time reader._search._delete_from_search() Wall time: 2.15 ms >>> reader.enable_search() >>> for feed in reader.get_feeds(): reader.delete_feed(feed) >>> %time reader._search._delete_from_search() Wall time: 163 ms >>> reader.enable_search(); reader.update_search() >>> for feed in reader.get_feeds(): reader.delete_feed(feed) >>> %time reader._search._delete_from_search() Wall time: 4.73 s >>> %time reader.is_search_enabled() Wall time: 160 µs False >>> %time reader.is_search_enabled() Wall time: 156 µs False >>> reader.enable_search() >>> %time reader.is_search_enabled() Wall time: 174 µs True ```

Conclusions (assuming we're OK with the cost of having search enabled by default)

lemon24 commented 2 years ago

To do:

lemon24 commented 2 years ago

Spent 14h on this, 8 of them thinking; that's arguably too much...