This PR adds support for a list of words to no publish as official dictionary words.
Why?
Our first audience for the Collins dictionary will be writing in British English, as this is the primary language variant for the majority of our staff. We will want to avoid publishing (for example) American regional variants of words, so that our Dictionary matcher will not recognise them as valid words, and suggest changing to the British English variant.
(In the future we will want to add support for American and Australian regional spellings, but that's not in scope for our pre-MVP work).
Being a record of the English language, the Collins dictionary also contains a lot of offensive words that we wouldn't want to suggest as autocorrect options, so we may want to disallow them too (LanguageTool should prevent a few of these, but it's by no means an exhaustive list).
How?
We will include a list of "words to not publish" in S3, to be used alongside our existing config. I've also modified the setup script (and other local setup-related files) to pull the same file from the dev S3, if it exists, and use it locally.
The JSON is expected to be an array of objects containing a tag name, and a list of associated words. A very short file might look like this:
We decided to keep the list it in S3 rather than having it here in version control, to avoid the prospect of our repo housing a long list of incredibly offensive words. The tags will help us keep track of excluded words, and perhaps allow a filtering mechanism for consumers in the future (e.g. opt into Americanisms).
How to test
Run Typerighter Rule Manager locally according to the instructions in the readme.
Run the setup script, or otherwise add the above to ~/.gu/typerighter/words-to-not-publish.json
Enable the only feature switch in the top right dropdown of the manager
Click the scary red button that reloads dictionary rules
Search for dictionary words that should not be live according to your config. Are they in draft? Do they have the expected tags?
What does this change?
This PR adds support for a list of words to no publish as official dictionary words.
Why?
Our first audience for the Collins dictionary will be writing in British English, as this is the primary language variant for the majority of our staff. We will want to avoid publishing (for example) American regional variants of words, so that our Dictionary matcher will not recognise them as valid words, and suggest changing to the British English variant.
(In the future we will want to add support for American and Australian regional spellings, but that's not in scope for our pre-MVP work).
Being a record of the English language, the Collins dictionary also contains a lot of offensive words that we wouldn't want to suggest as autocorrect options, so we may want to disallow them too (LanguageTool should prevent a few of these, but it's by no means an exhaustive list).
How?
We will include a list of "words to not publish" in S3, to be used alongside our existing config. I've also modified the setup script (and other local setup-related files) to pull the same file from the dev S3, if it exists, and use it locally.
The JSON is expected to be an array of objects containing a tag name, and a list of associated words. A very short file might look like this:
We decided to keep the list it in S3 rather than having it here in version control, to avoid the prospect of our repo housing a long list of incredibly offensive words. The tags will help us keep track of excluded words, and perhaps allow a filtering mechanism for consumers in the future (e.g. opt into Americanisms).
How to test
~/.gu/typerighter/words-to-not-publish.json