guardian / typerighter

Even if you’re the right typer, couldn’t hurt to use Typerighter!
Apache License 2.0
276 stars 12 forks source link

Allow list of words not to publish #433

Closed rhystmills closed 1 year ago

rhystmills commented 1 year ago

What does this change?

This PR adds support for a list of words to no publish as official dictionary words.

Why?

Our first audience for the Collins dictionary will be writing in British English, as this is the primary language variant for the majority of our staff. We will want to avoid publishing (for example) American regional variants of words, so that our Dictionary matcher will not recognise them as valid words, and suggest changing to the British English variant.

(In the future we will want to add support for American and Australian regional spellings, but that's not in scope for our pre-MVP work).

Being a record of the English language, the Collins dictionary also contains a lot of offensive words that we wouldn't want to suggest as autocorrect options, so we may want to disallow them too (LanguageTool should prevent a few of these, but it's by no means an exhaustive list).

How?

We will include a list of "words to not publish" in S3, to be used alongside our existing config. I've also modified the setup script (and other local setup-related files) to pull the same file from the dev S3, if it exists, and use it locally.

The JSON is expected to be an array of objects containing a tag name, and a list of associated words. A very short file might look like this:

[
   {
      "tag":"American region spelling",
      "words":[
         "color"
      ]
   },
   {
      "tag":"Offensive word",
      "words":[
         "cretin"
      ]
   }
]

We decided to keep the list it in S3 rather than having it here in version control, to avoid the prospect of our repo housing a long list of incredibly offensive words. The tags will help us keep track of excluded words, and perhaps allow a filtering mechanism for consumers in the future (e.g. opt into Americanisms).

How to test

  1. Run Typerighter Rule Manager locally according to the instructions in the readme.
  2. Run the setup script, or otherwise add the above to ~/.gu/typerighter/words-to-not-publish.json
  3. Enable the only feature switch in the top right dropdown of the manager
  4. Click the scary red button that reloads dictionary rules
  5. Search for dictionary words that should not be live according to your config. Are they in draft? Do they have the expected tags?