projectEndings / staticSearch

A codebase to support a pure JSON search engine requiring no backend for any XHTML5 document collection
https://endings.uvic.ca/staticSearch/docs/index.html
Mozilla Public License 2.0
46 stars 21 forks source link

do we need default files for a dictionary and stop words list #271

Open peterrobinson opened 11 months ago

peterrobinson commented 11 months ago

My adventures with static search continue: you can read more about them at my Scholarly Digital Editions blog,, especially the sequence beginning at the Endings Project and the Canterbury Tales project. Here I will post particular issues that the grand gurus of static search may want to consider.

In setting up static search for my project, I found that it would not run unless I had set up values in the section of the config file for <stopwordsFile> and <dictionaryFile>. Further, these had to point at actual files present in the root (or related) directory in my project folder, thus:

    `<stopwordsFile>test_stopwords.txt</stopwordsFile>`
    `<dictionaryFile>english_words.txt</dictionaryFile>`

Then, I had to physically include the files test_stopwords.txt and english_words.txt in my project folder for static search to run successfully.

This seems to me to be a candidate for default values, to avoid being forced to choose something (anything!) to make the build work.

martindholmes commented 9 months ago

It's a bug that if nothing is supplied for these files the build fails. It would be quite rare that you wouldn't want to use a stoplist or a wordlist; the only context I can imagine not wanting a stoplist is in the case of a dictionary where words like "in", "at", or "here" might well be searched for, but issue #273 proposes a different solution to that (which if implemented will be dependent on a complete reworking of the tokenizing process, which @joeytakeda is thinking about now).

So I think the solution here is:

  1. Add these two config items to the list of mandatory items.
  2. Document how to create and use empty files if you do want to avoid a stoplist.
martindholmes commented 9 months ago

This should be a patch to the 1.4 release branch and also implemented in dev.

joeytakeda commented 9 months ago

I think part of what @peterrobinson is asking for (and apologies if I'm misrepresenting here) is that the config file should be as minimal as possible in order to get staticSearch up and running—if you don't specify a stopwords element, then you just get whatever staticSearch thinks you should use (i.e. xsl/english_words.txt).

We've gone back and forth on the problem of default values (see #195 ) and I think this is a good case for stating that stopwords shouldn't be mandatory at all.

In terms of the dictionaryFile — I personally think we should just get rid of it entirely; it's only used when generating the report, but does create additional files and overhead unnecessarily, imo.

martindholmes commented 9 months ago

Just adding a reminder that the documentation will need substantial changes arising out of the decisions made here. The 1.4.5 documentation is updated for the mandatory status of those elements, but it seems likely that the dictionaryFile may be unnecessary and the stopwordFile optional in 2.0. However, it's worth remembering that our documentation suggests that you might create/modify your stopwords file based on the output of the report generator, and that depends on the dictionary file at the moment.