We use sentence chunking in Typerighter to detect sentence starts. This helps us figure out when to cap up suggestions. For example, when detecting a typo like
... to holiday accomodation. cafs, bars, and ...
We would want to provide the suggestion Cafes – to accommodate both the typo and the fact that words at the start of a sentence should be capitalised.
Our sentence helper tokenises our sentence and discovers its first word to enable this feature. This lets us compare our suggestion to the first word in the sentence. If our suggestion case-insensitively matches the first word and its position, we can be sure that we're suggesting something at the start of a sentence, and capitalise accordingly.
The first word is not always the first token. For example, the tokens for the fragment (Oh no would be -LRB-Ohno, where LRB stands for Left Round Bracket. So in order to detect the first word, we must discard any tokens which are entirely comprised of non-word characters.
This PR fixes a bug where if we detected any non-word characters in a token, we'd discard it. This meant that, in an example given to us from Editorial, the sentence helper thought the first word in the sentence Anti-immigrant flyers was flyers, because Anti-immigrant contained a non-word character.
How to test
Deploy this branch to CODE and test the following sentence:
As politics kept us from one home, it marginalised us in the other. The Dursleys of Essex nailed their racism to the mast time and again. Anti-immigrant flyers landed on the doorstep, Ukip banners appeared in windows, two of the top five “vote Leave” constituencies were within a few miles of our family home.
Before this change, you should see flyers flagged as red, with a suggestion Flyers.
Provisionally claiming bonus points w/o review after refactor to satisfy user as tests are green, thanks for the feedback – feel free to invalidate bonus points at any time in post-hoc review 😁
What does this change?
We use sentence chunking in Typerighter to detect sentence starts. This helps us figure out when to cap up suggestions. For example, when detecting a typo like
... to holiday accomodation. cafs, bars, and ...
We would want to provide the suggestion
Cafes
– to accommodate both the typo and the fact that words at the start of a sentence should be capitalised.Our sentence helper tokenises our sentence and discovers its first word to enable this feature. This lets us compare our suggestion to the first word in the sentence. If our suggestion case-insensitively matches the first word and its position, we can be sure that we're suggesting something at the start of a sentence, and capitalise accordingly.
The first word is not always the first token. For example, the tokens for the fragment
(Oh no
would be-LRB-
Oh
no
, whereLRB
stands for Left Round Bracket. So in order to detect the first word, we must discard any tokens which are entirely comprised of non-word characters.This PR fixes a bug where if we detected any non-word characters in a token, we'd discard it. This meant that, in an example given to us from Editorial, the sentence helper thought the first word in the sentence
Anti-immigrant flyers
wasflyers
, becauseAnti-immigrant
contained a non-word character.How to test
Deploy this branch to CODE and test the following sentence:
As politics kept us from one home, it marginalised us in the other. The Dursleys of Essex nailed their racism to the mast time and again. Anti-immigrant flyers landed on the doorstep, Ukip banners appeared in windows, two of the top five “vote Leave” constituencies were within a few miles of our family home.
Before this change, you should see
flyers
flagged as red, with a suggestionFlyers
.After it,
flyers
should be green.