guardian / typerighter

Even if you’re the right typer, couldn’t hurt to use Typerighter!
Apache License 2.0
276 stars 12 forks source link

Use Guardian NER service for entity recognition #453

Closed simonbyford closed 1 year ago

simonbyford commented 1 year ago

Co-authored-by: @rhystmills

What does this change?

Our first attempt at NER (named entity recognition) has resulted in a reduction in noise but many proper nouns remain unidentified and are therefore flagged as typos by typerighter.

peckham

This PR swaps our NER implementation from the OpenNLP library to the Guardian's own NER service, which is accessed via a REST API. This results in a significant further reduction in noise. However, there are a few risks associated with this approach, as outlined below.

If we decide to go ahead with this approach, we can strip out the OpenNLP implementation in a subsequent PR. (we ended up removing it here instead)

How to test

Either run this branch locally (in conjunction with composer) or deploy it to CODE. Paste in some text and see how many proper nouns are flagged as spelling mistakes, compared with doing the same thing on main. There are a few test articles below.

Have we considered potential risks?

There are 3 main risks:

1) As before, there is a risk with any NER solution that genuinely misspelled words might be incorrectly identified as named-entities and therefore excluded from spell-checking (i.e. false negatives).

2) There is a risk that adding this integration degrades or overwhelms the Guardian NER service because of the increased traffic to the API. This would be bad because it is used by other production systems. We can mitigate this by working closely with the data science team to monitor the service.

3) There is a risk that introducing a network request increases the latency of the checker and this feels obvious to the user.

How can we measure success?

Fewer proper nouns being flagged as spelling mistakes. No significant hit to the usability of the tool.

Images

Ukraine benchmarking article

Before After
Screenshot 2023-10-23 at 10 17 14 Screenshot 2023-10-23 at 10 17 36

Israel benchmarking article

Before After
Screenshot 2023-10-23 at 10 16 58 Screenshot 2023-10-23 at 10 17 06

Art benchmarking article

Before After
Screenshot 2023-10-23 at 10 53 06 Screenshot 2023-10-23 at 10 53 28

Guardian NER Links

Repo: https://github.com/guardian/data-science-ner-service Logs: https://logs.gutools.co.uk/s/data-science/app/discover PROD Service: https://ner.gutools.co.uk/v1/ CODE Service: https://ner.code.dev-gutools.co.uk/v1/

rhystmills commented 1 year ago

New commits looks good to me 👍