Closed rhystmills closed 1 year ago
I've commented out the code that can add a dictionary matcher to the pool.
This is probably more cautious than is necessary as we're only adding a dictionary matcher if there are dictionary rules, which should only be added if there is a wordlist in the S3 bucket and someone hits the /api/refreshDictionary
endpoint on PROD.
@samanthagottlieb spotted an error with the /api/dictionary endpoint while pairing - now sorted in this PR.
What does this change?
This processes dictionary rules in the
checker
service, published in the rule JSON artefact from therule-manager
service. It creates aDictionaryMatcher
that checks incoming text blocks from the Typerighter consumer (Composer) against a list of dictionary rules, providing matches where invalid words are found, with suggestions based on edit distance from the invalid word, and word frequency.We've had to include some unpublished Java files from
org.languagetool
- one which we didn't particularly modify other than translating to Scala (DictionaryBuilder
) -which we would like to replace by publishing their original version. We also included one which we did modify (SpellDictionaryBuilder
) which will require some additional thought to replace due to afinal
restriction on the class in languagetool itself.We've had to create some original code in Java due to trouble translating to Scala while remaining compatible with
languagetool
andMorfologik
. Perhaps we can translate these files to Scala with some more thought (CollinsEnglish
andMorfologikCollinsSpellerRule
.There is much more work to do, but this PR should serve as a working baseline. For example, we probably don't want to provide spellchecks on 'Title Case' proper nouns, and we may want to reduce the number of suggestions provided to consumers (it will gladly suggest 50+).
resources/dictionary/en_gb_wordlist.xml
is also borrowed fromlanguagetool
to provide word frequency information. Do we want to include this in our source code or fetch it from somewhere, e.g. in the setup script?How does this PR work and what does each file do?
This PR is quite involved so here I will provide an explanation of the work being done, and where it's being done:
Intro to Matchers:
In the Checker, we have a
MatcherPool
, which checks blocks of text sent from Composer against a set ofMatcher
s.We already have a
RegexMatcher
for checking text against our regex-based style guide rules, and aLanguageToolMatcher
for checking text againstLanguageTool
rules (which are mostly XML-defined rules relating to punctuation and syntax that were considered too complex to define usefully via Regex).LanguageTool has the capability of doing spellchecking via the Morfologik library, and actually has some built-in spell checking rules that we turn off in the existing
LanguageToolMatcher
.This PR adds the
DictionaryMatcher
. This is another LanguageTool instance, that has been specially configured to do the job of a spellchecker using a custom wordlist.The work in
DictionaryMatcher.scala
creates the new matcher instance, and the (currently commented-out) lines inMatcherProvisionerService.scala
plugs it into the existingMatcherPool
.The other files added in this PR are all related to converting our
List[DictionaryRule]
- derived from the artefact published by the Rule Manager to something that LanguageTool can use to spellcheck with. Here's a quick runthrough of what each is doing:CollinsEnglish
CollinsEnglish
is the language for our spellchecker. It's extended fromBritishEnglish
- the LanguageToolLanguage
that provides its default British English spellchecker, as it seemed like the most sensible base class for our purposes. The main useful thingCollinsEnglish
does is use ourMorfologikCollinsSpellerRule
instead of the existingMorfologikBritishSpellerRule
.MorfologikCollinsSpellerRule
This is very similar to the existing
MorfologikBritishSpellerRule
. We had to create our own to do two useful things: It overrides where LanguageTool expects to find the.dict
file (more on that later). It overrides the rule ID with "MORFOLOGIK_RULE_COLLINS" so that we can identify our own spell checking rules elsewhere in the application.We couldn't extend
MorfologikBritishSpellerRule
itself because it sets theRULE_ID
andRESOURCE_FILENAME
asfinal
properties, which means we can't override them in the child class - so we've essentially created a fork of it withMorfologikCollinsSpellerRule
.SpellDictionaryBuilder
That
.dict
file mentioned earlier is part of a compromise we have undergone to not have to include too many forks of LanguageTool code.LanguageTool provides some simple ways of extending the dictionary, for example adding a
spelling.txt
file to add some additional words, but this doesn't really suit our needs because we want very fine-grained control of our dictionary so that it only uses words from the Guardian's official Collins dictionary.The way to do this is to create our own custom
.dict
dictionary files - this is a binary file that represents a wordlist with accompanying English-language word frequencies, which can be read by LanguageTool.The flow of steps to get a working speller would look like this. (N.b.
MorfologikBritishSpellerRule
extendsAbstractEnglishSpellerRule
which extendsMorfologikSpellerRule
- which calls thegetFilename
method we provide to build the dictionary, and passed the path toMorfologikMultiSpeller
instances which actually read the .dict and convert it to an in-memory data structure)Ideally we could avoid this IO step altogether, and instead do something like the below directly could do something like this:
But to do so we'd need to replace some classes deep in the Morfologik hierarchy (notably
MorfologikMultispeller
) along with the tree of parent classes. In testing, we found that performance was fine despite the IO inefficiency so decided against replacing those classes.LanguageTool provides some CLI scripts that allow you to create a
.dict
dictionary as a one-off, andSpellDictionaryBuilder
is one of those. We wanted to use the class directly in our application rather than using it in a CLI context, so we made some modifications to the class to allow that.DictionaryBuilder
This does some of the heavy lifting of actually compiling the dict file, and is unmodified other than converting to Scala. If we publish the original version from LanguageTool in the future, we should be able to import from there instead
collins.info
This is a configuration file based on the one used for
BritishEnglish
, and is found by LanguageTool via agetResource()
call somewhere in its class tree.This line makes sure that word frequencies are included in the
.dict
binary:en_gb_wordlist.xml
This is a list of word frequencies used when compiling the
.dict
. It doesn't have all the words in the Collins dictionary, but it does provide a frequency for the most commonly used words in the English language. Words that appear in Collins but not this file will have the lowest priority when we provide spell checking suggestions.How to test
./api/refreshDictionary
endpoint.How can we measure success?
We see dictionary matches coming through in the consumer (i.e. Composer).