Closed lemon24 closed 4 months ago
Using model-based testing on a search sync implementation: https://gist.github.com/lemon24/558955ad82ba2e4f50c0184c630c668c
Assuming that (1) Sync models things correctly and (2) there are no exceedingly rare bugs Hypothesis couldn't find, a correct solution looks like this:
Each entry has a "sequence" that changes every time the entry changes in a way the search index cares about. The sequence can be a global counter, a random number, or a timestamp; the only requirement is that it won't be used again (or it is extremely unlikely that will happen).
Each sequence change gets recorded. For changes to an existing entry, an update+delete pair is recorded, with the old/new sequence, respectively.
Some time later, search processes pending changes and deletes them. For update, search checks the change sequence against the entry sequence; if the sequences do not match, the change is ignored (deleted without an update).
Requirements:
Proposed minimal storage API:
class Storage:
...
def enable_changes(self) -> None: ...
def disable_changes(self) -> None: ...
def get_changes(
self, action: Action | None = None, limit: int = None
) -> list[Change]:
"""Return the next batch of changes, if any.
Return at most `limit` changes; may return less,
depending on storage internal limits.
"""
def changes_done(self, list[Change]) -> None:
"""Mark changes as done. Ignore unknown changes.
Raise if more changes than get_changes() returns are passed;
changes_done(get_changes()) should always work.
"""
class Change:
action: Action
sequence: bytes
feed_url: str | None = None
entry_id: str | None = None
tag: str | None = None
class Action(Enum):
# resource needs to be added to the index
INSERT = 1
# resource needs to be deleted from the index
DELETE = 2
# (optional) backfill needed to catch up (not all changes are available);
# added by enable_changes(), or if old unprocessed changes are dropped
BACKFILL = 3
class Entry:
...
_sequence: bytes
This is used by search implementations as follows:
def enable(self):
self.storage.enable_changes()
...
def disable(self):
self.storage.disable_changes()
...
def update(self):
changes = self.storage.get_changes(Action.BACKFILL)
if changes:
assert changes == [Change(Action.BACKFILL, b'')]
self._do_backfill()
self.storage.changes_done(changes)
while True:
changes = self.storage.get_changes(Action.DELETE)
if not changes:
break
self._do_delete(changes)
self.storage.changes_done(changes)
while True:
changes = self.storage.get_changes(Action.INSERT)
if not changes:
break
self._do_insert(changes)
self.storage.changes_done(changes)
Later, we can add a batch-get-entries Storage method that only returns the requested entry if the sequence matches:
def get_entries_for_search(
self, entries: list[tuple[str, str, bytes]]
) -> list[Entry]: ...
Support for feeds can be added by giving them a _sequence
attribute. Support for tags TBD, as it depends on how tag search works (but they'd likely get a sequence too, at least those that need to be searched).
OK, the initial change tracking implementation is mostly done; I'm relatively confident there aren't any corner cases the current triggers aren't handling, but additional tests should clear that up.
What's left to be done can be split into two phases: first update search to use the changes API for sync, then make it use an attached database; changing the search table schema to support feeds and tags can be done later.
To do:
Two semi-related ideas:
Pro: Fewer triggers (see previous point) => somewhat faster updates.PRAGMA optimize
does all of the schemas, so this is not an issue.