untitled-pit-group / foxhound

PIFS standard backend
BSD Zero Clause License
0 stars 0 forks source link

Search: Normalization for transcribed text #17

Open paulsnar opened 2 years ago

paulsnar commented 2 years ago

See #5 for context. For the MVP phase no normalization is going to be implemented, but this would be a nice-to-have.

Both Solr and Postgres appear to have acceptable support for normalizing English texts. For Latvian, Solr provides a rudimentary normalization algorithm in the form of a stemmer and a stopword list, both sourced from Kārlis Krēsliņš' '96 thesis. Presumably these should be implemented somehow—preliminary research suggests Postgres has support for something called dictionaries which is intended for this purpose.

Alternatively, perhaps algorithms from AILab's projects (cf also) could be used instead, given that they've had a couple more years of effort put into them, but I'm not sure where to begin with those.

paulsnar commented 2 years ago

One of the ways Postgres allows custom language support to be effected is via a Snowball stemming program. As ever, there's support for Lithuanian but not Latvian.

That said, presumably Krēsliņš' algorithm shouldn't be too difficult to port onto Snowball, given that it's a language pretty much designed exactly for writing stemmers.