guardian / typerighter

Even if you’re the right typer, couldn’t hurt to use Typerighter!
Apache License 2.0
276 stars 12 forks source link

Invalidate Morfologik dictionary cache on DictionaryMatcher instantiation #449

Closed rhystmills closed 1 year ago

rhystmills commented 1 year ago

Co-authored with @jonathonherbert

What does this change?

Changes to dictionary rules (additions, edits, or unpublish actions) don't cause any user-facing changes in the DictionaryMatcher, or at least take a long time to surface if they do - for example - creating a new dictionary rules doesn't cause that word to be recognised as valid in the matches sent to Composer by the Checker service. This is due to dictionary caching behaviour in MorfologikSpeller, used indirectly as part of LanguageTool.

This PR makes two changes:

  1. Ensures that the JLanguageTool instance created as part of a new DictionaryMatcher is only instantiated after the collins.dict binary has been created (i.e. after we call new SpellDictionaryBuilder().buildDictionary - otherwise there's a chance that the matcher will reflect an older version of our dictionary rules corpus.
  2. Invalidate the dictionary LoaderCache from MorfologikSpeller when a DictionaryMatcher is instantiated. The cache is constructed quite deep in the Morfologik class hierarchy, and we don't want to recreate our own versions of all those classes to allow us to access it because we'll have to add thousands of lines to Typerighter and not benefit from maintenance to their equivalent files in LanguageTool. Currently, the dictionary binary is cached for ten minutes, and any changes made to DictionaryRules within those ten minutes won't be reflected in the matches we send to Composer - until we re-instantiate the dictionary matcher due to an unrelated rule change after those ten minutes have elapsed. This makes local testing of DictionaryRules especially difficult.

Our cache invalidation is a pretty unpleasant hack using Java reflection - circumventing the LoadingCache being a private property of MorfologikSpeller. It would be good to raise a PR in LanguageTool in the future to surface a way to invalidated the dictionary cache, and use that mechanism instead of this one, but (for now) this will help us make DictionaryRules editable and reach our KR.

How to test

  1. Run the Typerighter Checker and Rule Manager services and flexible-content (i.e. Composer) locally, according to the instructions in the readmes.
  2. Make sure you have the following line in Composer's config in ~/.gu/flexible-composerbackend.properties in order to use local Typerighter Checker rather than CODE: typerighter.url=https://checker.typerighter.local.dev-gutools.co.uk/
  3. Create, edit, and unpublish dictionary rules.
  4. Are your changes reflected in matches to Composer, with the Collins feature switch and toggle active?
jonathonherbert commented 1 year ago

~This looks good, and the two times I've published a new artefact in the manager in CODE, also invariably knocks over the checker service with java.lang.OutOfMemoryError: Java heap space (logs) 😢~

~Wonder if reflection has an unexpected impact on our memory footprint here?~

~Another thing to try: swap out or reconfigure the cache with another cache with a shorter or nonexistent TTL. That'd avoid us having to reflect and alter the class whenever we create a new dictionary.~

EDIT: this may be due to another memory-related issue found in https://github.com/guardian/typerighter/pull/449 – disregard!