Trigram context should be stored only in lower case and without accents in the database

See the discussion here:

https://github.com/mike-fabian/ibus-typing-booster/issues/251#issuecomment-973038344

Testing revealed that storing the trigram context only in lower case and without accents does not make the prediction quality worse at all.

We already ignore punctuation for along time, input like “(Test” ist just stored as “Test” in the database.

Removing that “(” from the token and not saving it in the database might slightly reduce the prediction quality because the information that the “(” was there is lost. On the other hand it avoids saving an extra row in the database which is not likely to be used again soon.

There is a tradeoff how exact the input should be recorded. Doing it too exact just increases the sizes of the database (or if the size of the database is limited to a fixed size it takes away space for other rows which might be more useful).

A database which is too big just makes it very slow without offering significantly better predictions.

So introducing a bit more “fuzziness” in the trigrams by not recording upper/lower case and not recording accents saves a few rows in the database as there are no rows which are almost identical and just differ in case or accents.

A bit surprisingly extensive testing showed, that this did not reduce the prediction quality at all but saved around 1.3% of database rows.

So doing this is a small improvement.

mike-fabian / ibus-typing-booster

Trigram context should be stored only in lower case and without accents in the database #256