Search infra - Githubissues

paulsnar commented 2 years ago

I have only a vague idea of how Solr works. This needs to be researched (in the context of what's required) and documented, and presumably a service provider written to discover and interact with the local Solr instance.

paulsnar commented 2 years ago

The primary motivation for me proposing Solr over just using Postgres's built-in FTS was because I had assumed that Lucene's (and therefore Solr's) support for Latvian processing was significantly better than what could be accomplished with Postgres. After some research, it seems that that assumption was exaggerated: pretty much all that Lucene provides is a stemmer and a stopword list, both of which are almost direct ports of Kreslins' '96 PhD thesis.

Running Solr is a fairly large dependency that doesn't seem really beneficial, and trying to interface with it to only provide the equivalent of FTS seems overkill. Therefore I propose to move all search functionality into Postgres instead. The both aforelinked parts should be possible to port to Postgres as dictionaries or the like, and this would significantly simplify the architecture of Foxhound and by extension Chihuahua.

paulsnar commented 2 years ago

A proposal of how to move this forward:

Define an additional table storing the full transcribed texts of files, indexed by the same files.id. Add a basic FTS index on its contents to be used in searches.
Postpone (backlog) implementation of advanced Latvian language support for text normalization. Just use a dumb FTS without any normalization for MVP stage.

This nonetheless might present a challenge in how to annotate which parts of the transcript correspond to which regions of the file in question. One way to do this would be to include the locators in-band with the content, contained in some weird delimiter (like, say, <<0:43>> text 1 <<0:46>> text 2 and similarly for paged media) but that might break FTS if the search phrase crosses a locator boundary in which case there's no exact FTS match. Alternatively, the locators could be stored out-of-bound in a separate column, but then there needs to be defined a way of matching up the text with the relevant locator. (Perhaps the simplest way would be to store both the text without locators and with locators, and upon a match in the locatorless text, to find it within the locator-annotated text by increasing search context incrementally, but that seems pretty wasteful because that'd involve storing the whole transcript twice.) For now I'm postponing that too onto the backlog because for the MVP I believe returning a dummy locator should work fine.

untitled-pit-group / foxhound

Search infra #5