Enable search by default

Enable search by default, because there's less explaining to do for new users: no explicit pip install reader[search], enable_search(), or update_search() needed.

Thoughts:

~~We need make_reader(..., search='default:') (default) and make_reader(..., search='none:'), per https://github.com/lemon24/reader/issues/168#issuecomment-642002049.~~
- 2021-10 update: Why?
We need to make the reader[search] dependencies required (keep the extra for backwards compatibility, though).
We need update_feed() etc. to update search by default, but only if not none: and enabled.
- It should be possible to opt out: update_feed(..., search=False); we likely need update_search(feed=...) for this.
reader requires SQLite 3.15, the current search implementation requires 3.18; make_reader() should fail gracefully (and explicitly) for SQLite <3.18.

Possible improvements

There are multiple changes that can be done, with different levels of impact to current users. This is kinda covered in the first comment, but it's worth discussing them individually.

make_reader(..., search_enabled: Optional[bool] = None), where "search_enabled: Whether to enable or disable search. None means do nothing."
- search_enabled defaulting to None is backwards-compatible.
- search_enabled defaulting to True is not backwards-compatible, assuming update_feeds() calls update_search() automatically.
- First, it adds a sudden performance penalty to update_feeds() (triggers + update_search()). This is not necessarily OK for a minor version release.
- Second, doing this requires the search dependencies to be always installed, otherwise update_feeds() may fail if the user didn't install reader[search].
update_feeds() calls update_search() automatically.
- As long as it happens only when search is enabled, this is backwards compatible.
- It's worth mentioning there are other times when update_search() needs to be called.
- AFAICT, set_feed_user_title() is the only one right now.
- It's OK to not call updatesearch() automatically for them – the `update...()` are expected to take a long time; the others aren't.
- Assuming update_feeds() is called regularly, update_search() will happen eventually for enabled feeds anyway. One may still want to call update_search() for disabled feeds, or to have the other changes happen more often.
Always install the reader[search] extra.
- As mention in the first point, if search_enabled defaults to True, this is required.
- I would prefer not to do this, since people may not need the additional dependency – either because they just don't need it, or because they use a different search implementation.

Regardless of the default, make_reader(..., search_enabled=True) must fail fast for unsupported versions of SQLite (<3.18). It should also fail if the dependencies were not installed – if I'm requesting search explicitly, I don't want update_feed() to fail.

Open questions:

Does "only when search is enabled" mean just make_reader(..., search_enabled=True), or whenever is_search_enabled() is True?
I'm not sure why I mentioned the search='none:' make_reader() argument. Presumably, it could help make a reader that isn't disabling the search, but isn't doing automatic update_search() calls either.
- If "only when search is enabled" means just make_reader(..., search_enabled=True), search_enabled=None achieves the same thing.
What happens with SearchErrors raised during the update_feed...() calls?
- It is backwards compatible to raise a new exception when make_reader(..., search_enabled=True) was used.
- update_feeds() only raises StorageError.
- update_feeds_iter() only raises StorageError, but can yield ParseError or other ReaderError.
- update_feed() can raise ParseError, but cannot raise other ReaderError; I likely missed this when adding update_feeds_iter().

Alternate search API proposal

I'm getting a feeling I may be solving the wrong problem.

For instance, I'm having trouble determining how search_enabled=... interacts with enable/disable_search():

```python reader = make_reader(search_enabled=True) reader.disable_search() # SearchNotEnabledError? silently not update search? reader.update_feeds() reader = make_reader(search_enabled=False) reader.enable_search() # update search? not update search? reader.update_feeds() reader = make_reader(search_enabled=None) assert reader.is_search_enabled() is True # update search? not update search? reader.update_feeds() ```

Initially, we didn't know when updating search should happen, and the search API wasn't (isn't) stable, so we just exposed everything. It seems we've now converged to "updating search happens offline" (like updating feeds). Also, the pluggable search design (#168) came after we added enable/disable_search() (#122), but because I didn't actually implement it, I didn't notice the overlap.

So maybe we shouldn't burden the user with details about that – most of the time, people just care about search_entries(). If we expose the Search object directly, we don't need to expose enable/disable_search(), is_search_enabled(), and update_search().

Here's a proposal for a new search API:


def make_reader(url, search=None):
    # TBD if it's none:/default: or :none:/:default:

    if search is None or search == 'none:'
        search_obj = None
    elif search == 'default:':
        search_obj = Search(...)           
    else:
        raise ValueError("bad search")

    if search_obj:
        search_obj.check_dependencies()

    return Reader(..., search_obj)

# Search remains as-is

class Reader:

    search: Search

    def _update_feeds(self, feed, updates_enabled, search):
        results = ...

        for result in results:
            # also runs plugin hooks
            rv = self._update_feed(feed)

            # not ideal, since we do many small updates instead of a big one;
            # but if we move it after the loop, search needs to be able to do the same kind of filtering
            if search and self.search:
                try:
                    self.search.update(feed)
                except SearchNotEnabledError:
                    self.search.enable()
                    self.search.update(feed)

            yield rv

    def update_feeds_iter(self, feed=None, updates_enabled=True, search=True):
        yield from self._update_feeds(feed, updates_enabled, search=search)
        # maybe update search for disabled feeds too here,
        # if feed is None and updates_enabled was not provided explicitly;
        # also not ideal, since we get a big pause at the end of the iterator

    def update_feeds(self, ...):
        for _ in self.update_feeds_iter(...): pass

    def update_feed(self, feed, search=True):
        return zero_or_one(self._update_feeds(feed=feed, search=search)

    def search_entries(self, ...):
        if not self.search:
            raise NoSearchAvailableError
        try:
            yield from self.search.search_entries(...)
        except SearchNotEnabledError:
            # even if we enabled search here,
            # we'd still get no result until an update()
            pass

    # removed: enable/disable_search(), is_search_enabled(), update_search()

# normal usage

reader = make_reader("db.sqlite", search="default:")
reader.update_feeds()
reader.search_entries(...)

# no search

reader = make_reader("db.sqlite", search="none:")
reader.update_feeds()
reader.search_entries(...)  # raises NoSearchAvailableError

# enable/update/disable search explicitly

reader = make_reader("db.sqlite", search="default:")
reader.search.enable()
reader.search.update()
reader.search.is_enabled()
reader.search.disable()

Update: I don't like this. On the surface it looks OK, but it makes things harder to implement correctly, OR still requires explaining why things behave different than expected. Better to have the primitives exposed, and let users mix and match.

Part of my unease with the search API has to do with its statefulness – some methods may or may not work depending on whether search is enabled, or if there is a search in the first place.

However, besides being connected to a specific search backend, part of this statefulness comes from outside a specific Reader object: e.g. one Reader expects search to be enabled, but another one disables it after the first was instantiated.

At least part of it is essential complexity (us doing migrations automatically in the Storage constructor is trying to hide that, but long-term it makes sense to have that as a separate, explicit operation, like we're doing with search). I don't know what the "correct" API to present this is, but it seems pretty hard/premature to get it right now, especially without other search/storage implementations. We should only solve the problem at hand.

So, what problem am I trying to solve?

Avoid having to explain search to users.

No one complained of this; it seems like an imagined problem.
Reduce boilerplate by doing enable_search(), update_search() automatically.

Same as above. Also, the granularity is useful; we can't get rid of it completely.
update_search(), search_entries() may raise SeachNotEnabledError.

Can't be solved entirely, but enabling it in make_reader() can help – at least it makes the issue more visible.
update_search() may fail late due to missing dependencies.

Enabling search in make_reader() and checking for dependencies there seems like an acceptable solution.

Conclusions

Always install the reader[search] extra.

No. The granularity is good.
def make_reader(..., search_enabled=None)

Maybe.

Seems like a good idea, at least until we think of a better way of saying "give me a Reader with working search".

With search_enabled true, we can check for missing dependencies and fail early.
- def make_reader(..., search_enabled=True) (default true)
No.

Requires having reader[search].

Also, for other search implementations, this may not be desirable (e.g. too slow).

We may consider enabling search by default if an overwhelming majority of users use search (I have no data for this).

FWIW, for SQLite, enable_search() takes ~50ms for 12k entries, which seems acceptable (if search is already enabled, it's much less). I don't know how other methods are impacted by search being enabled.
update_feeds() calls update_search() automatically.

No. Hard to implement right, breaks expectations.

The table below shows timings on how search affects other methods (for my database with ~160 feeds and 13K entries).

Some notes:

make_reader() is shown as a reference.
The various get_...() methods shouldn't be affected, so I didn't measure them.
Adding entries will be affected, but that happens during update_feeds(), so we don't really care.
enable_search() is negligible compared to make_reader() if search is already enabled.
is_search_enabled() takes the same amount of time as enable_search() if search is already enabled().
set_feed_user_title() and delete_feed() time increases seem acceptable; the time is for all the feeds.
If search is enabled, and the user adds and then removes a lot of feeds, but never calls update_search(), rows for deleted entries keep accumulating in entries_search_sync_state. These must be cleaned up by calling _delete_from_search() regularly (likely in update_feeds()).

	cold	hot
make_reader(), migration	24 ms
make_reader(), no migration	1.4 ms
enable_search(), not enabled	34 ms
enable_search(), enabled	0.2 ms
is_search_enabled()	0.2 ms	0.2 ms
set_feed_user_title(), not enabled	60 ms	16 ms
set_feed_user_title(), enabled	140 ms	16 ms
set_feed_user_title(), enabled, updated	150 ms	16 ms
delete_feed(), not enabled	1.2 s
delete_feed(), enabled	1.7 s
delete_feed(), enabled, updated	1.7 s
_delete_from_search(), enabled	1.7 ms
_delete_from_search(), enabled, updated	2.2 ms
_delete_from_search(), enabled, feeds deleted	163 ms
_delete_from_search(), enabled, updated, feeds deleted	4.7 s

Detailed timing output.

Blocks separated by blank lines were ran in new processes. Everything below "all starting from search not enabled" was run after this code: ```python from reader import * from shutil import copyfile copyfile('db.sqlite.orig', 'db.sqlite') reader = make_reader('db.sqlite') assert not reader.is_search_enabled() for _ in reader.get_entries(): pass for _ in reader.get_feeds(): pass ``` Detailed timings: ``` python >>> reader.get_feed_counts().total 158 >>> reader.get_entry_counts().total 12975 # search enabled, but it doesn't really matter >>> %time make_reader('db.sqlite') # migration Wall time: 24 ms >>> %time make_reader('db.sqlite') # no migration Wall time: 1.4 ms >>> %time reader.enable_search() # not enabled Wall time: 34 ms >>> %time reader.enable_search() # enabled Wall time: 177 µs # all starting from search not enabled >>> %time for feed in reader.get_feeds(): reader.set_feed_user_title(feed, feed.title + '...') Wall time: 60.6 ms >>> %time for feed in reader.get_feeds(): reader.set_feed_user_title(feed, feed.title + '...') Wall time: 16 ms >>> reader.enable_search() >>> %time for feed in reader.get_feeds(): reader.set_feed_user_title(feed, feed.title + '...') Wall time: 139 ms >>> %time for feed in reader.get_feeds(): reader.set_feed_user_title(feed, feed.title + '...') Wall time: 15.5 ms >>> %time for feed in reader.get_feeds(): reader.set_feed_user_title(feed, feed.title + '...') Wall time: 59.9 ms >>> reader.enable_search() >>> %time for feed in reader.get_feeds(): reader.set_feed_user_title(feed, feed.title + '...') Wall time: 16.1 ms >>> reader.enable_search(); reader.update_search() >>> %time for feed in reader.get_feeds(): reader.set_feed_user_title(feed, feed.title + '...') Wall time: 150 ms >>> %time for feed in reader.get_feeds(): reader.set_feed_user_title(feed, feed.title + '...') Wall time: 15.8 ms >>> %time for feed in reader.get_feeds(): reader.delete_feed(feed) Wall time: 1.2 s >>> reader.enable_search() >>> %time for feed in reader.get_feeds(): reader.delete_feed(feed) Wall time: 1.72 s >>> reader.enable_search(); reader.update_search() >>> %time for feed in reader.get_feeds(): reader.delete_feed(feed) Wall time: 1.74 s >>> reader.enable_search() >>> %time reader._search._delete_from_search() Wall time: 1.74 ms >>> reader.enable_search(); reader.update_search() >>> %time reader._search._delete_from_search() Wall time: 2.15 ms >>> reader.enable_search() >>> for feed in reader.get_feeds(): reader.delete_feed(feed) >>> %time reader._search._delete_from_search() Wall time: 163 ms >>> reader.enable_search(); reader.update_search() >>> for feed in reader.get_feeds(): reader.delete_feed(feed) >>> %time reader._search._delete_from_search() Wall time: 4.73 s >>> %time reader.is_search_enabled() Wall time: 160 µs False >>> %time reader.is_search_enabled() Wall time: 156 µs False >>> reader.enable_search() >>> %time reader.is_search_enabled() Wall time: 174 µs True ```

Conclusions (assuming we're OK with the cost of having search enabled by default)

Always install the reader[search] extra.

Maybe. Currently, this adds only beautifulsoup4 and soupsieve, which seems acceptable.
def make_reader(..., search_enabled=None)

Yes.

Seems like a good idea, at least until we think of a better way of saying "give me a Reader with working search".

With search_enabled true, we can check for missing dependencies and fail early.
- def make_reader(..., search_enabled=True) (default true)
Maybe.

Requires having reader[search].

While for other search implementations this may not be desirable (e.g. too slow), we don't have any planned at the moment. It's better to be friendly to the users now.

This affects the other methods in acceptable ways speed-wise. If the user never calls update_search(), disk usage is only ~2% higher.

_delete_from_search() needs to be called regularly.
- def make_reader(..., search_enabled=auto) (enable search on first use)
Maybe.

Same as above, but enable search on the first update_search() call.

Zero-cost until enabled. Doesn't require _delete_from_search() to be called regularly (the user obviously knows about update_search()).

TODO: We may need a better argument name.
update_feeds() calls update_search() automatically.

No. Hard to implement right, breaks expectations.

To do:

[x] always install the search extra, remove dependency check, update install docs and scripts
~~call _delete_from_search() during update_feeds()~~
[x] add search_enabled, docstring
[x] user guide, readme
[x] check SQLite version in make_reader() (?)
[x] changelog

Spent 14h on this, 8 of them thinking; that's arguably too much...

lemon24 / reader