Process dictionary rules in a matcher and provide as matches to Typerighter consumers

What does this change?

This processes dictionary rules in the checker service, published in the rule JSON artefact from the rule-managerservice. It creates a DictionaryMatcher that checks incoming text blocks from the Typerighter consumer (Composer) against a list of dictionary rules, providing matches where invalid words are found, with suggestions based on edit distance from the invalid word, and word frequency.

We've had to include some unpublished Java files from org.languagetool - one which we didn't particularly modify other than translating to Scala (DictionaryBuilder) -which we would like to replace by publishing their original version. We also included one which we did modify (SpellDictionaryBuilder) which will require some additional thought to replace due to a final restriction on the class in languagetool itself.

We've had to create some original code in Java due to trouble translating to Scala while remaining compatible with languagetool and Morfologik. Perhaps we can translate these files to Scala with some more thought (CollinsEnglish and MorfologikCollinsSpellerRule.

There is much more work to do, but this PR should serve as a working baseline. For example, we probably don't want to provide spellchecks on 'Title Case' proper nouns, and we may want to reduce the number of suggestions provided to consumers (it will gladly suggest 50+).

resources/dictionary/en_gb_wordlist.xml is also borrowed from languagetool to provide word frequency information. Do we want to include this in our source code or fetch it from somewhere, e.g. in the setup script?

How does this PR work and what does each file do?

This PR is quite involved so here I will provide an explanation of the work being done, and where it's being done:

Intro to Matchers:

In the Checker, we have a MatcherPool, which checks blocks of text sent from Composer against a set of Matchers.

We already have a RegexMatcher for checking text against our regex-based style guide rules, and a LanguageToolMatcher for checking text against LanguageTool rules (which are mostly XML-defined rules relating to punctuation and syntax that were considered too complex to define usefully via Regex).

LanguageTool has the capability of doing spellchecking via the Morfologik library, and actually has some built-in spell checking rules that we turn off in the existing LanguageToolMatcher.

This PR adds the DictionaryMatcher. This is another LanguageTool instance, that has been specially configured to do the job of a spellchecker using a custom wordlist.

The work in DictionaryMatcher.scala creates the new matcher instance, and the (currently commented-out) lines in MatcherProvisionerService.scala plugs it into the existing MatcherPool.

The other files added in this PR are all related to converting our List[DictionaryRule] - derived from the artefact published by the Rule Manager to something that LanguageTool can use to spellcheck with. Here's a quick runthrough of what each is doing:

CollinsEnglish

CollinsEnglish is the language for our spellchecker. It's extended from BritishEnglish - the LanguageTool Language that provides its default British English spellchecker, as it seemed like the most sensible base class for our purposes. The main useful thing CollinsEnglish does is use our MorfologikCollinsSpellerRule instead of the existing MorfologikBritishSpellerRule.

MorfologikCollinsSpellerRule

This is very similar to the existing MorfologikBritishSpellerRule. We had to create our own to do two useful things: It overrides where LanguageTool expects to find the .dict file (more on that later). It overrides the rule ID with "MORFOLOGIK_RULE_COLLINS" so that we can identify our own spell checking rules elsewhere in the application.

We couldn't extend MorfologikBritishSpellerRule itself because it sets the RULE_ID and RESOURCE_FILENAME as final properties, which means we can't override them in the child class - so we've essentially created a fork of it with MorfologikCollinsSpellerRule.

SpellDictionaryBuilder

That .dict file mentioned earlier is part of a compromise we have undergone to not have to include too many forks of LanguageTool code.

LanguageTool provides some simple ways of extending the dictionary, for example adding a spelling.txt file to add some additional words, but this doesn't really suit our needs because we want very fine-grained control of our dictionary so that it only uses words from the Guardian's official Collins dictionary.

The way to do this is to create our own custom .dict dictionary files - this is a binary file that represents a wordlist with accompanying English-language word frequencies, which can be read by LanguageTool.

The flow of steps to get a working speller would look like this. (N.b. MorfologikBritishSpellerRule extends AbstractEnglishSpellerRule which extends MorfologikSpellerRule - which calls the getFilename method we provide to build the dictionary, and passed the path to MorfologikMultiSpeller instances which actually read the .dict and convert it to an in-memory data structure)

flowchart
  dictFile[.dictFile]
  wordFreq["Word frequencies file (.xml)"]
  LTLang["LanguageTool language, e.g. CollinsEnglish"]

  dictionaryRules--"Use to create .dict file"-->dictFile
  wordFreq--"Use to create .dict file"-->dictFile
  dictFile--"Consumed by"-->MorfologikMultiSpeller 
  MorfologikMultiSpeller --"descendant class used as part of"--> LTLang

Ideally we could avoid this IO step altogether, and instead do something like the below directly could do something like this:

 flowchart
  wordList["Word list in memory"]
  wordFreq["Word frequencies in memory"]
  LTLang["LanguageTool language, e.g. CollinsEnglish"]

  wordList--"Used directly to instantiate"-->MorfologikMultiSpeller
  wordFreq--"Used directly to instantiate"-->MorfologikMultiSpeller
  MorfologikMultiSpeller --"descendant class used as part of"--> LTLang

But to do so we'd need to replace some classes deep in the Morfologik hierarchy (notably MorfologikMultispeller) along with the tree of parent classes. In testing, we found that performance was fine despite the IO inefficiency so decided against replacing those classes.

LanguageTool provides some CLI scripts that allow you to create a .dict dictionary as a one-off, and SpellDictionaryBuilder is one of those. We wanted to use the class directly in our application rather than using it in a CLI context, so we made some modifications to the class to allow that.

DictionaryBuilder

This does some of the heavy lifting of actually compiling the dict file, and is unmodified other than converting to Scala. If we publish the original version from LanguageTool in the future, we should be able to import from there instead

collins.info

This is a configuration file based on the one used for BritishEnglish, and is found by LanguageTool via a getResource() call somewhere in its class tree.

This line makes sure that word frequencies are included in the .dict binary:

fsa.dict.frequency-included=true

en_gb_wordlist.xml

This is a list of word frequencies used when compiling the .dict. It doesn't have all the words in the Collins dictionary, but it does provide a frequency for the most commonly used words in the English language. Words that appear in Collins but not this file will have the lowest priority when we provide spell checking suggestions.

How to test

Run the rule-manager and the checker services according to the instructions in the readme.
Make sure your rule manager has access to dictionary words by running the setup script and hitting the ./api/refreshDictionary endpoint.
This should publish the rules to Checker. You can publish again by publishing an arbitrary change to a non-dictionary rule.
Run Composer (aka flexible-content) locally according to the instructions in that repo. Create an article with some incorrect words, and run a Typerighter checks.
- Do you get spellcheck suggestions for mis-typed words?
- Do you get any false positives?

How can we measure success?

We see dictionary matches coming through in the consumer (i.e. Composer).

Kapture 2023-08-17 at 13 07 06

guardian / typerighter