keymanapp / keyman

Keyman cross platform input methods system running on Android, iOS, Linux, macOS, Windows and mobile and desktop web
https://keyman.com/
Other
386 stars 107 forks source link

bug(web): auto-correct polish needed #11963

Closed jahorton closed 1 month ago

jahorton commented 1 month ago

Using the current state of auto-correct this weekend, I ran into a few issues:

  1. Type a word in the model, add a standard punctuation mark.
    • The word will still be visible as a correction; additionally, it will likely be auto-selected.
    • Hit space - the punctuation mark will be deleted. (Good luck getting the engine to let you leave it!)
    • Interestingly... there's a good chance that the applied suggestion'll also be lower-cased, even if the version in context is title-cased.
  2. Type a word not in the model... perhaps a name.
    • Take Esther, which is a single letter away from either. The latter is a fairly common word, while the former is a name.
    • Good luck getting the engine to let you actually leave "esther" in place.
      • "Either" is "close enough" to trigger auto-correct mode.
      • "Esther" is not an available option, so there's no way to select something else or disable the auto-correct mode.
      • If you let it apply then move to revert it... well, auto-correct reactivates immediately, wanting to replace "Esther" with "either" again.
  3. So, that last bit above - if a suggestion has been reverted, especially if it was auto-selected, auto-correction should be disabled until further keystrokes have been typed.
jahorton commented 1 month ago

Regarding point 1 above - the fact that punctuation is not handled well by auto-correct:

Step 1: Consider the following two possible contexts: apple. vs apple . When no whitespace is returned and when the original string indices have been stripped, the prior form of the tokenize() function made these two contexts indistinguishable. This is a "problem."

Why so?

Step 2: suppose the pre-existing context apple. Broadly speaking, there are three different cases a follow-up might have:

  1. apples
  2. apple
  3. apple.

(Also consider that there is technically nothing that prohibits a keyboard from specifying a single key that emits a space followed by a '.')

For case 1, we simply continue adding more text to the existing word. No problems here. Predictions are based upon apple.

For case 2, at present, we explicitly check for a whitespace-only transform. (Whitespaces are the most common case for wordbreaking.) If it occurs, we note this down and prepend it to generated predictions in order to preserve the whitespace. We also generate predictions based on the new empty token following the whitespace.

For case 3... at present, we treat it the same as case 1... despite the fact that wordbreaking will split the context into two pieces: apple and ..

It would be best to treat case 3 like case 2. That said, we would still have another issue to resolve to handle this completely ideally. We'd prefer not to 'correct' word-breaking punctuation marks, just as we currently don't 'correct' whitespace when starting a new token after a space.


Toward the issue raised in that last paragraph:

  1. We could add a "punctuation list" to models.
    • If the new token is a listed punctuation token, automatically insert a blank token afterward and predict based on that.
    • EXTRA: if the user types a punctuation token after the standard insertAfterWord appended to predictions, this would make a fantastic check for auto-removing the appended insertAfterWord.
  2. Alternatively, we could specify and add a skipCorrection function to models.
    • Default: if it's whitespace, no correcting.
    • Adding extra cases (say, to match punctuation tokens) would then say to bypass the token and root any predictions/corrections after such cases.