lemon24 / reader

A Python feed reader library.
https://reader.readthedocs.io
BSD 3-Clause "New" or "Revised" License
434 stars 31 forks source link

Search sync simplification #323

Closed lemon24 closed 4 months ago

lemon24 commented 9 months ago

Two semi-related ideas:

  1. (from One process programming...) Split the FTS table into a separate, ATTACH-ed database.
    • Pro: Actual data is separated from transient data (mainly useful for backups).
    • Pro (?): Search doesn't need to be enabled/disabled anymore. Maybe we can deprecate those APIs?
    • Con: We cannot use triggers to update a "needs to be synced" table (the search database may not even be attached).
    • Pro: Fewer triggers (see previous point) => somewhat faster updates.
  2. (consequence of 1) Don't use a separate sync table, use a "last_modified >= now - last_sync" queries instead.
    • Pro: Decouples main and search databases/logic to some extent (pull instead of push); more similar to how an "external" search implementation would work (although external searches would use some storage internal APIs).
    • Con: Reliance on the wall clock, which is unreliable. Will need extra thought on how to handle this correctly (we're aiming for eventual consistency here).
    • Update: Con: What about keeping track of deleted entries?
    • Pro: More "robust" sync mechanism (aside from the wall clock issue in the previous point).
    • Con (?): Two separate schemas to maintain (although for search we can just recreate everything).
    • Con: Two places to run PRAGMA optimize etc. (but, FTS5 has INSERT ... VALUES('optimize') we're not using at the moment, but likely should).
    • Update: PRAGMA optimize does all of the schemas, so this is not an issue.
    • Update: We don't need to do FTS5 optimize, it happens automatically, see the automerge command.
    • Con: Different queries based on different use cases; e.g. entry changed, feed title changed (which is not the same as feed changed!).
lemon24 commented 5 months ago

Using model-based testing on a search sync implementation: https://gist.github.com/lemon24/558955ad82ba2e4f50c0184c630c668c

Assuming that (1) Sync models things correctly and (2) there are no exceedingly rare bugs Hypothesis couldn't find, a correct solution looks like this:

  • Each entry has a "sequence" that changes every time the entry changes in a way the search index cares about. The sequence can be a global counter, a random number, or a timestamp; the only requirement is that it won't be used again (or it is extremely unlikely that will happen).

  • Each sequence change gets recorded. For changes to an existing entry, an update+delete pair is recorded, with the old/new sequence, respectively.

  • Some time later, search processes pending changes and deletes them. For update, search checks the change sequence against the entry sequence; if the sequences do not match, the change is ignored (deleted without an update).

lemon24 commented 5 months ago

Requirements:

Proposed minimal storage API:

class Storage:
    ...
    def enable_changes(self) -> None: ...
    def disable_changes(self) -> None: ...

    def get_changes(
        self, action: Action | None = None, limit: int = None
    ) -> list[Change]:
        """Return the next batch of changes, if any.

        Return at most `limit` changes; may return less,
        depending on storage internal limits.

        """

    def changes_done(self, list[Change]) -> None:
        """Mark changes as done. Ignore unknown changes.

        Raise if more changes than get_changes() returns are passed;
        changes_done(get_changes()) should always work.

        """

class Change:
    action: Action
    sequence: bytes
    feed_url: str | None = None
    entry_id: str | None = None
    tag: str | None = None

class Action(Enum):
    # resource needs to be added to the index
    INSERT = 1
    # resource needs to be deleted from the index
    DELETE = 2
    # (optional) backfill needed to catch up (not all changes are available);
    # added by enable_changes(), or if old unprocessed changes are dropped
    BACKFILL = 3

class Entry:
    ...
    _sequence: bytes

This is used by search implementations as follows:

def enable(self):
    self.storage.enable_changes()
    ...

def disable(self):
    self.storage.disable_changes()
    ...

def update(self):
    changes = self.storage.get_changes(Action.BACKFILL)
    if changes:
        assert changes == [Change(Action.BACKFILL, b'')]
        self._do_backfill()
        self.storage.changes_done(changes)

    while True:
        changes = self.storage.get_changes(Action.DELETE)
        if not changes:
            break
        self._do_delete(changes)
        self.storage.changes_done(changes)

    while True:
        changes = self.storage.get_changes(Action.INSERT)
        if not changes:
            break
        self._do_insert(changes)
        self.storage.changes_done(changes)    

Later, we can add a batch-get-entries Storage method that only returns the requested entry if the sequence matches:

def get_entries_for_search(
    self, entries: list[tuple[str, str, bytes]]
) -> list[Entry]: ...

Support for feeds can be added by giving them a _sequence attribute. Support for tags TBD, as it depends on how tag search works (but they'd likely get a sequence too, at least those that need to be searched).

lemon24 commented 5 months ago

OK, the initial change tracking implementation is mostly done; I'm relatively confident there aren't any corner cases the current triggers aren't handling, but additional tests should clear that up.

What's left to be done can be split into two phases: first update search to use the changes API for sync, then make it use an attached database; changing the search table schema to support feeds and tags can be done later.

To do: