Use Guardian NER service for entity recognition

Co-authored-by: @rhystmills

What does this change?

Our first attempt at NER (named entity recognition) has resulted in a reduction in noise but many proper nouns remain unidentified and are therefore flagged as typos by typerighter.

peckham

This PR swaps our NER implementation from the OpenNLP library to the Guardian's own NER service, which is accessed via a REST API. This results in a significant further reduction in noise. However, there are a few risks associated with this approach, as outlined below.

~~If we decide to go ahead with this approach, we can strip out the OpenNLP implementation in a subsequent PR.~~ (we ended up removing it here instead)

How to test

Either run this branch locally (in conjunction with composer) or deploy it to CODE. Paste in some text and see how many proper nouns are flagged as spelling mistakes, compared with doing the same thing on main. There are a few test articles below.

Have we considered potential risks?

There are 3 main risks:

1) As before, there is a risk with any NER solution that genuinely misspelled words might be incorrectly identified as named-entities and therefore excluded from spell-checking (i.e. false negatives).

2) There is a risk that adding this integration degrades or overwhelms the Guardian NER service because of the increased traffic to the API. This would be bad because it is used by other production systems. We can mitigate this by working closely with the data science team to monitor the service.

3) There is a risk that introducing a network request increases the latency of the checker and this feels obvious to the user.

How can we measure success?

Fewer proper nouns being flagged as spelling mistakes. No significant hit to the usability of the tool.

Images

Ukraine benchmarking article

Before	After

Israel benchmarking article

Before	After

Art benchmarking article

Before	After

Guardian NER Links

Repo: https://github.com/guardian/data-science-ner-service Logs: https://logs.gutools.co.uk/s/data-science/app/discover PROD Service: https://ner.gutools.co.uk/v1/ CODE Service: https://ner.code.dev-gutools.co.uk/v1/

guardian / typerighter