guardian / typerighter

Even if you’re the right typer, couldn’t hurt to use Typerighter!
Apache License 2.0
276 stars 12 forks source link

RegexMatcher: address capitalisation problems at sentence starts #108

Closed jonathonherbert closed 3 years ago

jonathonherbert commented 3 years ago

What does this change?

At the moment, when we apply suggestions from regular expressions, there's sometimes an unexpected side effect: we overwrite casing in the matched text.

For example, with regex (?i)\\bmedia?eval (note the (?i) flag, which means it's case insensitive), we always suggest the word medieval.

This causes problems when words begin sentences, e.g. end of sentence. Medieval will produce a match suggesting medieval.

This PR preserves case where possible, by detecting sentence starts. When a regex applies to a sentence start, and the starting characters of the suggestion and the match whilst ignoring case, we keep the casing of the match.

So given the regex (?i)\bmedia?eval with the replacement medieval where square brackets denote a match:

The reason we preserve the suggestion on a perfect caseless match, rather than just stripping it, is to preserve the match's 'mark as correct' behaviour.

How to test

The unit tests should pass.

E.g. the sentence End of sentence. Mediaeval should offer a correction to Medieval. Before, it would offer medieval.

How can we measure success?

Fewer complaints about mistakes w/ casing in suggestions.

Have we considered potential risks?

This is probably still not perfect, but the rules to which this apply are probably better addressed in the long run with a dictionary. Wonky edge cases gratefully received.