projectEndings / staticSearch

A codebase to support a pure JSON search engine requiring no backend for any XHTML5 document collection
https://endings.uvic.ca/staticSearch/docs/index.html
Mozilla Public License 2.0
46 stars 21 forks source link

Allow specification by XPath of elements where the stopword list should be ignored when indexing #273

Open martindholmes opened 10 months ago

martindholmes commented 10 months ago

Working on a couple of indigenous language dictionaries, we've encountered an intriguing problem. It's perfectly legitimate for a user/learner of the language to want to search for the other language word for a common English word that might be in the stopword list. If you're learning prepositions of location, you would obviously want to search for "at", "in", "on" etc.

However, if we just nuke these items from the stopword list, we'll end up with a massive index, and most of the hits will not be relevant to the search.

I think the solution here is to have a config file component which allows you to specify, through XPath, elements where the stopword list will be ignored when indexing; so for example a <gloss> element inside a dictionary entry can be assumed to contain the English gloss for a term, and could be indexed without the stopword list being invoked, generating and index entry for "in" if it contains that word; but instances of the stopwords would be ignored in all other contexts as normal.

This doesn't seem like it might be too difficult. The only bit I haven't figure out is how to carry over this functionality to the JavaScript; maybe all we need to do for a case like this is not use the stopword list at all, on the assumption that there's no penalty when a common word is searched for; if there's a stem file for it, then good -- it will have been constructed only from the specially-defined contexts, and shouldn't be too large -- and if there isn't, then the search just fails.

@joeytakeda Any thoughts?

martindholmes commented 9 months ago

After discussion, we will wait until we actually have a project that doesn't solve this problem simply by using an empty stopword list. If we do implement it, we should do it through contexts.