projectEndings / staticSearch

A codebase to support a pure JSON search engine requiring no backend for any XHTML5 document collection
https://endings.uvic.ca/staticSearch/docs/index.html
Mozilla Public License 2.0
50 stars 22 forks source link

Plan for multilingual stemming #296

Open martindholmes opened 5 months ago

martindholmes commented 5 months ago

@joeytakeda and I discussed the problem of multilingual texts and stemming today, and came up with what looks like a workable plan; this is a bluesky enhancement, and we don't intend it to be in 2.0, but it would come along afterwards.

  1. End-users may add @lang attributes to sections of text which are not in the default language of pages. (This would be good practice anyway.)
  2. Content in those context could be pre-stemmed, using span elements around words, with an attribute @data-ss-stem. This could either be done by end users as part of their own build process, or if staticSearch has a stemmer for the language, it could be done by a staticSearch process that runs before the main language stemming/tokenizing.
  3. When the main stemming process encounters one of these spans, it simply ignores the content.
  4. At build time, a JSON file is created which is basically a lookup table from full forms to stems, for all the terms which have @data-ss-stem.
  5. The StaticSearch object looks for this file, and loads it if it exists. If loaded, when running any text search, the JS first checks that lookup table for any matches, and if any are found, uses those stems. It also proceeds to apply the default stemming to the term; this is necessary because it's not possible to know which language a particular search term is supposed to be in, and there are of course cases where words are the same across languages.

This is a good solution to the problem of sites with one dominant language and potentially many other languages that appear in quotes and so on. Truly multilingual sites should of course have multiple searches.