Our first attempt at NER (named entity recognition) has resulted in a reduction in noise but many proper nouns remain unidentified and are therefore flagged as typos by typerighter.
This PR swaps our NER implementation from the OpenNLP library to the Guardian's own NER service, which is accessed via a REST API. This results in a significant further reduction in noise. However, there are a few risks associated with this approach, as outlined below.
If we decide to go ahead with this approach, we can strip out the OpenNLP implementation in a subsequent PR. (we ended up removing it here instead)
How to test
Either run this branch locally (in conjunction with composer) or deploy it to CODE. Paste in some text and see how many proper nouns are flagged as spelling mistakes, compared with doing the same thing on main. There are a few test articles below.
Have we considered potential risks?
There are 3 main risks:
1) As before, there is a risk with any NER solution that genuinely misspelled words might be incorrectly identified as named-entities and therefore excluded from spell-checking (i.e. false negatives).
2) There is a risk that adding this integration degrades or overwhelms the Guardian NER service because of the increased traffic to the API. This would be bad because it is used by other production systems. We can mitigate this by working closely with the data science team to monitor the service.
3) There is a risk that introducing a network request increases the latency of the checker and this feels obvious to the user.
How can we measure success?
Fewer proper nouns being flagged as spelling mistakes. No significant hit to the usability of the tool.
Co-authored-by: @rhystmills
What does this change?
Our first attempt at NER (named entity recognition) has resulted in a reduction in noise but many proper nouns remain unidentified and are therefore flagged as typos by typerighter.
This PR swaps our NER implementation from the OpenNLP library to the Guardian's own NER service, which is accessed via a REST API. This results in a significant further reduction in noise. However, there are a few risks associated with this approach, as outlined below.
If we decide to go ahead with this approach, we can strip out the OpenNLP implementation in a subsequent PR.(we ended up removing it here instead)How to test
Either run this branch locally (in conjunction with composer) or deploy it to CODE. Paste in some text and see how many proper nouns are flagged as spelling mistakes, compared with doing the same thing on
main
. There are a few test articles below.Have we considered potential risks?
There are 3 main risks:
1) As before, there is a risk with any NER solution that genuinely misspelled words might be incorrectly identified as named-entities and therefore excluded from spell-checking (i.e. false negatives).
2) There is a risk that adding this integration degrades or overwhelms the Guardian NER service because of the increased traffic to the API. This would be bad because it is used by other production systems. We can mitigate this by working closely with the data science team to monitor the service.
3) There is a risk that introducing a network request increases the latency of the checker and this feels obvious to the user.
How can we measure success?
Fewer proper nouns being flagged as spelling mistakes. No significant hit to the usability of the tool.
Images
Ukraine benchmarking article
Israel benchmarking article
Art benchmarking article
Guardian NER Links
Repo: https://github.com/guardian/data-science-ner-service Logs: https://logs.gutools.co.uk/s/data-science/app/discover PROD Service: https://ner.gutools.co.uk/v1/ CODE Service: https://ner.code.dev-gutools.co.uk/v1/