Filter named entities from dictionary matches

simonbyford commented 1 year ago

What does this change?

Right now, proper nouns cause false positives in Typerighter because they're not found in Collins Dictionary. This creates a lot of noise and erodes trust in the tool.

Screenshot 2023-10-05 at 09 09 37

This PR attempts to improve this by identifying named entities from text being checked and excluding them from dictionary matches. We use the named-entity recognition (NER) library OpenNLP to achieve this.

To start with, we use 3 models: people, organisations and locations. There is a full list of models here:

https://opennlp.sourceforge.net/models-1.5/

In the future, we might wish to use the model developed internally by the data science team as it was trained on Guardian content and will likely do a better job:

https://github.com/guardian/data-science-ner-service

We just went with OpenNLP as the simplest proof of concept.

How to test

Run typerighter and composer locally, ensuring composer is pointing at the local typerighter instance as per the docs. Ensure the Collins Dictionary feature switch is enabled. Try adding some proper nouns and see if they are flagged as misspellings or not.

How can we measure success?

Fewer false positives caused by proper nouns in text.

Have we considered potential risks?

There is a risk that genuinely misspelled words might be incorrectly identified as named-entities and therefore excluded from spell-checking (i.e. false negatives).

jonathonherbert commented 1 year ago

This looks good! With the offset problem corrected, there's definitely a reduction of noise:

Before	After

Having said that, there are two problems, one of which might block this approach:

OpenNLP is very slow! It takes ~7-8s to check the above three paragraphs locally. That's on a snappy M1!
The model is not super accurate. You can see it misses 'Bojangles', 'Medgar Evans' and 'Simone' above.

Interestingly, the vanilla spaCy model gets pretty much everything right, and is much quicker – this comes back in ~100-120ms:

[
  { "label": "PERSON", "text": "Lisa Simone", "start": 0, "end": 11 },
  { "label": "PERSON", "text": "Nina Simone", "start": 113, "end": 124 },
  { "label": "ORG", "text": "Broadway", "start": 464, "end": 472 },
  { "label": "PERSON", "text": "Nina", "start": 892, "end": 896 },
  { "label": "PERSON", "text": "Lisa Simone", "start": 935, "end": 946 },
  { "label": "PERSON", "text": "Nina Simone’s", "start": 1108, "end": 1121 },
  { "label": "PERSON", "text": "Medgar Evers", "start": 1327, "end": 1339 },
  {
    "label": "PERSON",
    "text": "Lorraine Hansberry",
    "start": 1457,
    "end": 1475
  },
  { "label": "PERSON", "text": "Weldon Irvine", "start": 1492, "end": 1505 },
  { "label": "PERSON", "text": "Bojangles", "start": 1744, "end": 1753 }
]

Wonder whether it's worth spiking a call out to ner.gutools service to see how it performs? Should be fairly quick to try.

simonbyford commented 1 year ago

Tested locally and CODE, works as expected. Performance is significantly improved – we can keep an eye on the response times in PROD. The model is still not good, but does filter out some useful matches!

Thanks!

Logging the entities might be useful, so we can see just how much noise this technique is reducing in PROD – I've added an example commit on another branch @ 30d9bb9. WDYT?

I like it very much, have cherry-picked onto this branch

guardian / typerighter