guardian / typerighter

Even if you’re the right typer, couldn’t hurt to use Typerighter!
Apache License 2.0
276 stars 12 forks source link

Do not discard legitimate word starts in SentenceHelper #167

Closed jonathonherbert closed 3 years ago

jonathonherbert commented 3 years ago

What does this change?

We use sentence chunking in Typerighter to detect sentence starts. This helps us figure out when to cap up suggestions. For example, when detecting a typo like

... to holiday accomodation. cafs, bars, and ...

We would want to provide the suggestion Cafes – to accommodate both the typo and the fact that words at the start of a sentence should be capitalised.

Our sentence helper tokenises our sentence and discovers its first word to enable this feature. This lets us compare our suggestion to the first word in the sentence. If our suggestion case-insensitively matches the first word and its position, we can be sure that we're suggesting something at the start of a sentence, and capitalise accordingly.

The first word is not always the first token. For example, the tokens for the fragment (Oh no would be -LRB- Oh no, where LRB stands for Left Round Bracket. So in order to detect the first word, we must discard any tokens which are entirely comprised of non-word characters.

This PR fixes a bug where if we detected any non-word characters in a token, we'd discard it. This meant that, in an example given to us from Editorial, the sentence helper thought the first word in the sentence Anti-immigrant flyers was flyers, because Anti-immigrant contained a non-word character.

How to test

Deploy this branch to CODE and test the following sentence:

As politics kept us from one home, it marginalised us in the other. The Dursleys of Essex nailed their racism to the mast time and again. Anti-immigrant flyers landed on the doorstep, Ukip banners appeared in windows, two of the top five “vote Leave” constituencies were within a few miles of our family home.

Before this change, you should see flyers flagged as red, with a suggestion Flyers.

After it, flyers should be green.

jonathonherbert commented 3 years ago

Provisionally claiming bonus points w/o review after refactor to satisfy user as tests are green, thanks for the feedback – feel free to invalidate bonus points at any time in post-hoc review 😁