Add second dictionary file alongside lemmatised list

jonathonherbert commented 1 year ago

NB: Depends on #440. Please review that first!

What does this change?

At the moment, we use a lemmatised list provided by Collins to provide our dictionary word list. That list, however, isn't complete. It's missing critical tokens, like

waterpark
Novichok
Covid-19
unshowiest 
snuggly
kombucha

This PR parses their original dictionary file, which includes the above words, and 39,000 more. Most of them are combinations of words. 2028 are single words, as above.

Dev notes

I've spun the dictionary-related code in BucketRuleResource off into DictionaryResource, as it's something only the rule-manager service should care about.

In doing that, I've refactored the dictionary related code into two files, DictionaryResource (fetching from S3) and utils.Dictionary (munging XML). This lets us test the former without implicating the latter and having to mock S3 calls when testing munging logic.

How to test

The local tests should pass, and you should be convinced that they extract word lists from the relevant XML.

Running locally or in CODE, hit the big scary dictionary ingest button. After a long delay (#441 adds logging to make what's happening more visible in CODE and PROD), you should be treated to a cornucopia of new words in the rule manager:

😌

Added relevant S3 files to

[x] local
[x] CODE
[x] PROD

jonathonherbert commented 1 year ago

> I also wonder whether it's worth leaving out multi-word phrases with spaces in them - what do you think? This was already the case and is not a new problem, though there are more examples now. I don't think they're used downstream and they create noise in the manager and the published artefact.

I'd be inclined to keep them so we don't have to ask, 'which bits of the Collins corpus are in the system' – reasoning about importing things and which bit of the corpus is where is not easy. There's an implication that there are ways of at least ignoring valid multi-word phrases in https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt – it might be worth digging into this in some future piece of work.

> I'm not totally clear on what we're using the definitions for - is that data being used for anything at the moment? This was in the code I'd already written, but it might be YAGNI – what do you think? It crossed my mind that we might store the dictionary definitions as part of the description.

rhystmills commented 1 year ago

I'd be inclined to keep them so we don't have to ask, 'which bits of the Collins corpus are in the system' – reasoning about importing things and which bit of the corpus is where is not easy. There's an implication that there are ways of at least ignoring valid multi-word phrases in languagetool-org/languagetool@master/languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/hunspell/spelling.txt – it might be worth digging into this in some future piece of work.

Good to know! Let's keep them then.

> I'm not totally clear on what we're using the definitions for - is that data being used for anything at the moment? This was in the code I'd already written, but it might be YAGNI – what do you think? It crossed my mind that we might store the dictionary definitions as part of the description.

If it's already being parsed then it's more work to take it out than leave it in, and I can see that we'd keep it for descriptions at some point. Was just checking that that wasn't meant to be the case already.

I'm good with merging 👍

guardian / typerighter