Closed psads-git closed 3 years ago
Thanks, Mike. Have a good day!
https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ has 2.15.1 builds which show only capitalized candidates if the user starts a word with a capital letter.
Thanks, Mike! It seems to be working fine.
Thanks, Mike! It seems to be working fine.
Do you think it is an improvement? I tend to think it is an improvement but I am not sure yet...
Yes, Mike, to me it is a big improvement. It is hard to understand why you are not yet sure about that. However, if you are not very convinced, you may want to ponder to make that optional.
Ah, no, I don't want to make that optional, I wouldn't even know how to name that option so that it is understandable.
I just didn't test enough yet.
I thought there might be cases where one wants lower case even though one accidentally started typing with upper case. But that is probably rare and then one can still go to lower case by pressing the right Shift key.
So I think I leave that feature.
I might change the way it is implemented though, the current implementation makes it slightly slower, I have a different idea how it could be done.
Thanks, Mike. I am afraid that my suggestions may be putting a heavy load on your shoulders. That is not my intention -- I just want to make ibus-typing-booster even greater, as it simplifies a lot my life. However, if you do not feel like accepting or implementing my suggestions, you please feel free to follow your will and not my ideas and suggestions, as I understand that working on ibus-typing-booster is not the way your salary is earned. Moreover, you are the only one with full rights to decide what is best to Ibus-typing-booster.
I think I might leave the current implementation of capitalizing the candidates when the user types a capital letter as it is. It has a slight performance penalty because it does this:
def best_candidates(
self,
phrase_frequencies: Dict[str, int],
title=False) -> List[Tuple[str, int]]:
'''Sorts the phrase_frequencies dictionary and returns the best
candidates.
Should *not* change the phrase_frequencies dictionary!
'''
if title:
phrase_frequencies_title: Dict[str, int] = {}
for phrase in phrase_frequencies:
phrase_title = phrase[:1].title() + phrase[1:]
if phrase_title in phrase_frequencies_title:
phrase_frequencies_title[phrase_title] += (
phrase_frequencies_title[phrase_title])
else:
phrase_frequencies_title[phrase_title] = (
phrase_frequencies[phrase])
return sorted(phrase_frequencies_title.items(),
key=lambda x: (
-1*x[1], # user_freq descending
len(x[0]), # len(phrase) ascending
x[0] # phrase alphabetical
))[:20]
return sorted(phrase_frequencies.items(),
key=lambda x: (
-1*x[1], # user_freq descending
len(x[0]), # len(phrase) ascending
x[0] # phrase alphabetical
))[:20]
I.e. when the function best_candidates()
is called with title=True
, then all phrases in the phrase_frequencies
dictionary which are identical except for the capitalization of the first letter are merged into one and their user frequencies added.
For example, if phrase_frequencies contains
{'Ratatouille': 5, 'ratatouille' 3, 'ratatouiller': 1}
then the new phrase_frequencies_title will be:
{'Ratatouille': 8, 'Ratatouiller': 1}
and then this is sorted as usual and the first 20 candidates returned.
Going through all the entries in the original dictionary, and merging into a new one is some extra work, especially if the original number of phrases before sorting and cutting out the first 20 is long, might be hundreds of entries long. But I don’t really notice any slowness while typing, so it is probably OK.
The other idea I had is this:
There is already the feature that you can switch between three case modes using the left and the right Shift key, forwards with the left one, backwards with the right one.
'capitalize', 'upper', 'lower'
and if you don’t like any of these you can go back to the 'orig'
case mode (original, as typed by the user) by pressing Escape. This is somewhat faster as the above way to change the case in the candidate list as it only works on the 20 candidates which are already in the candidate list and it just changes the case of them, if that creates identical entries, they are not merged. For example if the candidate list contains:
1 Ratatouille
2 ratatouille
3 ratatouiller
and you press Shift you get:
1 Ratatouille
2 Ratatouille
3 Ratatouiller
so the first two entries are now identical. This is good because before pressing Shift you might have seen candidate 3 and thought “I want that but in upper case”, so you press Shift
and then 3
. But if entries 1 and 2 were merged, the numbers would change and a candidate 3 would not exist anymore.
So switching through the case modes leaves the candidate list unchanged except for the change in case.
My other idea was, that when the user types a word starting with a capital letter to switch to the case mode 'capitalize'
automatically. That would be very fast but would often create duplicates in the candidate list shown.
My current implementation for the case when the user typed a word starting with a capital letter gives you a different candidate list without duplicates. Therefore it might have a different length and different order because it gives higher weight to a word if it was already twice in the list with different capitalization.
I think removing the duplicates when the user starts typing with a capital letter is probably better.
Now, I understand your reluctance regarding this new feature, Mike! But I was not thinking in anything as complicated as you describe!
Let me explain what was my idea. Nothing would need to change but the following:
This should not impose any degradation on speed, since it only implies
capitalize 9 words;
and using a function unique
to remove the duplicates.
Removing the duplicates is sligthly more complicated then using a unique
function on the words only because the candidates are pairs of words and integer numbers, higher number means higher priority, it should show up higher in the lookup table.
When merging {'Ratatouille': 5, 'ratatouille' 3}
into one entry, one needs to look at the numbers 5 and 3 as well. Either add them and get 8 (which is what I did and which increases priority of words which happened to be twice in the original list) or take the maximum.
The final list presented to the user can have more than 9 entries because the candidate list can be paged down. I cut it down to 20. If emojis are matched as well the final list can get longer than 20 again because the emoji are added.
(You see that when you type cat
you get 20, when typing cat_
you get 40 candidates)
Currently I remove the duplicates before cutting down to 20 candidates and resort because adding the frequencies of duplicates together requires a sort to get the entries in the right order of the frequencies. But before cutting down to 20 the list may be long.
I could make it slightly faster by first cutting down to 20 and sorting as usual and after that remove the duplicates without changing the sort order (The first one of a duplicate is the one with the higher frequency if the list is already sorted). This will probably be slightly faster because one never goes through more then 20 candidates that way.
I could make it slightly faster by first cutting down to 20 and sorting as usual and after that remove the duplicates without changing the sort order (The first one of a duplicate is the one with the higher frequency if the list is already sorted). This will probably be slightly faster because one never goes through more then 20 candidates that way.
Right, Mike!
I might try that, it could be that the difference in speed is not significant it seems, that current more “exact” implementation seems fast as well.
Up to now, Mike, I have not noticed any slowdown. So, the exact approach seems to be appropriate.
https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ has 2.15.2 builds now.
These builds have the sligthly faster implemtation of title casing the candidates if the user input starts with an upper case letter.
The difference is very small as I expected it is barely measurable.
But the new code is cleaner:
def best_candidates(
self,
phrase_frequencies: Dict[str, int],
title=False) -> List[Tuple[str, int]]:
'''Sorts the phrase_frequencies dictionary and returns the best
candidates.
Should *not* change the phrase_frequencies dictionary!
'''
candidates = sorted(phrase_frequencies.items(),
key=lambda x: (
-1*x[1], # user_freq descending
len(x[0]), # len(phrase) ascending
x[0] # phrase alphabetical
))[:20]
if not title:
return candidates
candidates_title = []
phrases_title = set()
for candidate in candidates:
phrase = candidate[0]
phrase_title = phrase[:1].title() + phrase[1:]
if phrase_title in phrases_title:
continue
candidates_title.append((phrase_title, candidate[1]))
phrases_title.add(phrase_title)
return candidates_title
Unless this makes anything worse, I’ll keep that.
Thanks, Mike. Apparently, everything seems to be working fine.
Maybe in the future ibus-typing-booster will be able to work without recording uppercase words in the database and even faster. Notwithstanding, ibus-typing-booster is already fast and great!
Maybe in the future ibus-typing-booster will be able to work without recording uppercase words in the database and even faster. Notwithstanding, ibus-typing-booster is already fast and great!
I am not sure what you mean by “without recording uppercase words”.
Currently entries in the database look like this:
55897|mo|Mondays|hates|Charlie|2|1637249680.12463
The fields are
id input_phrase phrase p_phrase pp_prase user_freq timestamp
The input_phrase
, i.e. mo
is what the user typed (he might have typed Mo
!), it is already always recorded in lower case. That is how I implemented the case insensitive matching, input_phrase is always recorded in lower case now and new user input is also converted to lowercase before attempting to match it against input_phrase
in that database.
phrase
i.e. Monday
is the candidate the user committed. Of course that has to be recorded with the case which was actually used, in this case upper case. I guess you don’t mean you want to change that.
p_phrase
is the previous word and pp_phrase
the word before that. I.e. the user typed Charlie hates mo
and then selected Mondays
.
This context of the last two words is currently recorded with the case which was actually used in the context.
That means if you type Charlie hates mo
or Charlie hates Mo
again, Monday
will be suggested with a very high priority because the context matches exactly. Most likely it will be the first candidate.
However if you type charlie hates mo
or charlie hates Mo
, Monday
will still be in the suggestions because the mo
or Mo
match and the hates
matches, but charlie
does not match Charlie
so the priority number calculated for that match will be somewhat smaller.
Now the question is how picky one should be about differences in case in the context. Should Charlie hates
and charlie hates
be considered to be the identical context or not?
Is there any need to make these two contexts distinct and possibly suggest different things in these two contexts?
Or should contexts which differ only in case be considered identical and produce exactly the same suggestions?
I am not sure what is better, it is possible that ignoring the case in the context is better, very hard to tell.
Punctuation for example is already ignored in contexts. Whether one types (Charlie hates mo
or Charlie hates mo
, the context is Charlie hates
in both cases, the (
is ignored. That makes it impossible to suggest different things in these two cases but probably suggesting something different just because one started with the (
would not be helpful.
It might be that suggesting different things for contexts which differ only in case is not helpful either.
If suggesting different things when the the context differs only by case is not helpful then not recording the case of the context in the database would use up only one row instead of two in the database for Charlie hates mo
and charlie hates mo
with the result Mondays
. As we limit the database to a fixed maximum number of 50000 rows, this would make room for a different entry, which might be more useful than storing almost identical entries which differ only in case.
I can try that, but I wonder how to test whether it is really better or decreases the quality of the suggestions.
By the way, if we think about ignoring case in the context, we could also think about ignoring accents in the context.
Both ignoring case and ignoring accents in the context makes the context less precise and therefore the suggestion less precise but it is unclear how much precision is helpful there. Too much precision can have the result that there are too many rows in the database which are never typed exactly like that again but something similar which differs only by case or accents is typed again. So a somewhat more “fuzzy” context could be better.
But it is really not easy to say when making such a change whether it improves things or makes things worse.
I wonder if I should think about a way to measure improvement in a more precise way than just typing for a few hours and then deciding “It seems better to me ...”.
Thanks, Mike. I think you have interpreted correctly my idea. I conjecture that, in this way, we can save database space without sacrificing prediction accuracy (or only very, very residually).
As a side comment, I have just checked GBoard (the smartphone Google keyboard): When one starts writing, some suggestions are shown and the same exact suggestions are turned capitalized if one starts writing with an uppercase letter!
Thanks, Mike. I think you have interpreted correctly my idea. I conjecture that, in this way, we can save database space without sacrificing prediction accuracy (or only very, very residually).
I can try that soon and make a test build and then you can try to test whether it is helpful or not.
Should I start with recording the context in lower case and still keep the accents or do both at the same time?
As a side comment, I have just checked GBoard (the smartphone Google keyboard): When one starts writing, some suggestions are shown and the same exact suggestions are turned capitalized if one starts writing with an uppercase letter!
SwiftKey on Android also does this, so this is probably a good thing, we should keep that (and not make it optional either).
I would not start by both at the same time, Mike. And I would start by the accents.
Maybe one could test the viability of the idea without implementing it in ibus-typing-booster: Doing some SQL queries directly on the database would perhaps suffice.
I would not start by both at the same time, Mike. And I would start by the accents.
Maybe one could test the viability of the idea without implementing it in ibus-typing-booster: Doing some SQL queries directly on the database would perhaps suffice.
Implementing is not so difficult, I can probably do that in a few hours.
I wonder more how to evaluate whether it is helpful or not.
I just got this idea of a way of testing:
Letting ibus-typing-booster read whole books to train the input turned out to be less useful than I thought it would because some random book is not specific to the stuff a certain user usually types.
But of course I’ll keep that feature because there are maybe some text files a user has which can be useful.
But letting ibus-typing-booster read some book might be a useful test case:
Now change something in the implementation like storing the context case insensitive and repeat the above test and see whether the “% of letters saved” changes.
That might be a good test to see whether a change in implementation achieves an improvement or not.
Go ahead, Mike! It seems to be a very ingenious testing idea!
I implemented that test case now and checked what the difference is before and after this commit:
commit e252672cf50d3a778e6e243354702909065c9dd5 Author: Mike FABIAN mfabian@redhat.com Date: Wed Nov 17 12:47:30 2021 +0100
Title case all candidates if input_phrase is in title case
(Resolves: https://github.com/mike-fabian/ibus-typing-booster/issues/253)
I used the Project Gutenberg version of “The Picture of Dorian Gray” as a test case.
That book has 9195 lines and 80687 words:
$ wc the_picture_of_dorian_gray.txt
9195 80687 451987 the_picture_of_dorian_gray.txt
After reading that into an empty in memory database, that database has 70594 rows.
Testing with the behaviour before the above commit (i.e. even when a word starting with a capital letter is typed, there may be lower case candidates) gave:
total_length_typed=222279
total_length_committed=341048
total_length_saved=-118769
total_percent_saved=-34.82471675541273
I.e. 118769 characters were saved in typing which is 34.8%.
Testing after the above commit (When a word starting with a capital letter is typed, all candidates start with upper case) gave:
total_length_typed=216400
total_length_committed=341048
total_length_saved=-124648
total_percent_saved=-36.54852102929793
I.e. 124648 characters were saved in typing which is 36.5%.
So this change was helpful, although less than I would have guessed from my feelings after manual testing.
If I tested correctly, storing the context only in lower case and matching the context only in lower case does not change the savings at all, I get exactly the same:
total_length_typed=216400
total_length_committed=341048
total_length_saved=-124648
total_percent_saved=-36.54852102929793
but it reduces the number of database rows stored from 70594 to 69743 (i.e. by 1.2%)
So it is a an improvement as it achieves exactly the same prediction quality with a somewhat smaller database.
But a very small improvement.
I’ll test with accent insensitive context next but I need to use a different book, “The Picture of Dorian Gray” has no accents at all.
That is good news, Mike! I expect that the savings in the database size will be larger than in the case you have just studied.
Actually I expect the savings by doing it without the accents to be even smaller than doing it case insensitive.
Let’s see ... Tests are still running ...
I made some tests with “Notre-Dame de Paris by Victor Hugo”:
https://www.gutenberg.org/ebooks/2610
Results case and accent insensitive context:
Database rows: 156245
total_length_typed=619150
total_length_committed=849995
total_length_saved=-230845
total_percent_saved=-27.158395049382644
Results case insensitive context (but accent sensitive):
Database rows: 156301
total_length_typed=619150
total_length_committed=849995
total_length_saved=-230845
total_percent_saved=-27.158395049382644
Results case sensitive and accent sensitive context:
Database rows: 157587
total_length_typed=619150
total_length_committed=849995
total_length_saved=-230845
total_percent_saved=-27.158395049382644
Results before:
commit e252672cf50d3a778e6e243354702909065c9dd5
Author: Mike FABIAN <mfabian@redhat.com>
Date: Wed Nov 17 12:47:30 2021 +0100
Title case all candidates if input_phrase is in title case
(Resolves: https://github.com/mike-fabian/ibus-typing-booster/issues/253)
total_length_typed=629175
total_length_committed=849995
total_length_saved=-220820
total_percent_saved=-25.978976346919687
I.e. just as with my tests with the English book “The Picture of Dorian Gray”, doing the “Title case all candidates if input_phrase is in title case” is definitely helpful although the difference is again not very big.
It only saves a small number of database rows.
Going to case insensitive context saves 1286 database rows (0.82% of the rows).
Doing accent insensitive context as well saves another 56 database rows (0.035% of the rows).
Doing the case sensitive and accent insensitive context does not change prediction accuracy at all in this test, the number of chracters saved is exactly the same.
When I think about it, this seems to make sense to me. It does not surprise me much that the context which is identical except for accents is very rare. Context which is identical except for the case of one of the context words will occur regularly depending whether a sequence of 3 words occurs at the beginning of a sentence or in the middle of the sentence. But I guessed already that context which is identical except for accent differences is something very unusual.
By the way, the 27.2% percent saved for “Notre-Dame de Paris by Victor Hugo” was less than the 36.5% saved for “The picture of Dorian Gray”.
I am not sure why that is, maybe it Typing Booster works less well for French than for English? It could also be because of the size of the book, “Notre-Dame de Paris by Victor Hugo” is more than twice as long:
$ wc /home/mfabian/tmp/the_picture_of_dorian_gray.txt
9195 80687 451987 /home/mfabian/tmp/the_picture_of_dorian_gray.txt
$ wc /home/mfabian/tmp/victor_hugo_notre_dame_de_paris.txt
21676 175408 1117074 /home/mfabian/tmp/victor_hugo_notre_dame_de_paris.txt
and the percentage saved tends to be bigger for shorter texts according to the tests I did yesterday.
Two shorter texts I tested yesterday:
$ wc the_road_not_taken.txt chant_d_automne.txt
27 151 770 the_road_not_taken.txt
39 226 1396 chant_d_automne.txt
(“The Road Not Taken” is the poem by Robert Frost, “Chant d’automne” is the poem by Charles Baudelaird).
Savings were -51.3% for “The Road Not Taken” and -37.6% for “Chant d’automne”.
So what do I do now after these tests?
I think I’ll do both the case insensitive and accent insensitive context.
The improvement doing this is small, it does not seem to change the prediction accuracy at all but it saves a (very small) amount of database rows.
As the room in the database is limited (currently we cut it down to 50000 rows on each reastart of Typing Booster), saving a few rows without changing the prediction accuracy at all makes some room for other additional rows which might actually improve prediction accuracy.
So doing this seems to be an improvement, but a very small one.
Excellent news, Mike!
The following is perhaps a too crazy idea (I still have to think much deeper about it): Ignoring the order of the context words may have only a very neglectable impact on prediction accuracy, while saving a lot of database rows.
Excellent news, Mike!
The following is perhaps a too crazy idea (I still have to think much deeper about it): Ignoring the order of the context words may have only a very neglectable impact on prediction accuracy, while saving a lot of database rows.
Here I would guess that this will significantly worsen prediction accuracy.
That would destroy the whole trigram idea. Recently you qoted something which had some number how much prediction accuracy increases if you use longer context than just 2 words, i.e. four-grams or five-grams.
According to what I have read so far, prediction accuraccy increases the longer you make the context used, but the speed goes down fast, the calculation time does increase extremely fast and the accuracy gains are minor.
trigrams usually seem to be still worth it though.
Your idea is almost like using bigrams with the additional twist of adding fake bigrams where the context skips one word.
I really doubt that this would help anything, most likely this makes it worse.
As I remarked in my previous message, Mike, I have not yet thought thoroughly about the idea. Moreover, this is only a speculative exercise, since ibus-typing-booster is already fast -- it does not need significant improvements. However, my idea is to try to understand whether an ordered 3-gram model is worse than an unordered 4- or 5-gram one.
At least I have some code now which helps me testing such ideas. So we can get actual results now when trying out ideas.
Is there any literature about “unordered n-grams”?
Yeah, your system to test new ideas is indeed a great improvement, Mike!
Yes, there little literature regarding unordered n-grams, but I have found the following technical report on
Unordered n-grams are mentioned on p. 7. However, as far as I understand, this technical report is not on word prediction.
A colleague found this:
https://towardsdatascience.com/next-word-prediction-with-nlp-and-deep-learning-48b9fe0a17bf
but reading this I have the impression that they do less than I already do and that in a more complicated way.
They even write in a note:
Note: There are certain cases where the program might not return the expected result. This is obvious because each word is being considered only once. This will cause certain issues for particular sentences and you will not receive the desired output. To improve the accuracy of the model you can consider trying out bi-grams or tri-grams. We have only used uni-grams in this approach. Also, a few more additional steps can be done in the pre-processing steps. Overall, there is a lot of scope for improvement.
I found a side effect of making the user input into the database case insensitive and opened a new issue for this:
https://github.com/mike-fabian/ibus-typing-booster/issues/255
New issue for the case and accent insensitivity in the context:
https://github.com/mike-fabian/ibus-typing-booster/issues/256
I did another interesting test:
victor_hugo_notre_dame_de_paris.txt
into an empty, in-memory database.After doing that, the database has 156245 rows. We know already from the tests above that a simulated typing of that book will now save about -27% of the characters.
But that database is huge and will be cut down to the 50000 “best” entries on the next restart of Typing Booster. So how much will that degrade the prediction quality? So I do these next steps in the test:
Result: Only -24% are saved instead of -27%
But that doesn’t seem bad to me, only a small loss of 3% prediction accuracy with a database of less then 1/3 the original size.
Getting higher quality data into a smaller database is more useful than a huge database which low quality data.
The database trained by the huge book “Notre dame de Paris” is probably mostly useful only to retype exactly that book and of very limited use to a “normal” user.
A “normal” user is probably much better served by a smaller database with contents which fit better to his style of writing.
A colleague found this:
https://towardsdatascience.com/next-word-prediction-with-nlp-and-deep-learning-48b9fe0a17bf
but reading this I have the impression that they do less than I already do and that in a more complicated way.
They even write in a note:
Note: There are certain cases where the program might not return the expected result. This is obvious because each word is being considered only once. This will cause certain issues for particular sentences and you will not receive the desired output. To improve the accuracy of the model you can consider trying out bi-grams or tri-grams. We have only used uni-grams in this approach. Also, a few more additional steps can be done in the pre-processing steps. Overall, there is a lot of scope for improvement.
I do not think, Mike, it would be a good idea to use deep learning methods on ibus-typing-booster, as such a methods would require periodic training.
I did another interesting test:
- Let Typing Booster read the book
victor_hugo_notre_dame_de_paris.txt
into an empty, in-memory database.After doing that, the database has 156245 rows. We know already from the tests above that a simulated typing of that book will now save about -27% of the characters.
But that database is huge and will be cut down to the 50000 “best” entries on the next restart of Typing Booster. So how much will that degrade the prediction quality? So I do these next steps in the test:
- Call cleanup_database() on that in-memory database
- Now the database has only 50000 entries
- Now do the typing simulation
Result: Only -24% are saved instead of -27%
But that doesn’t seem bad to me, only a small loss of 3% prediction accuracy with a database of less then 1/3 the original size.
Yes, Mike, that is great that saving a lot of database size impacts so little on accuracy!
Getting higher quality data into a smaller database is more useful than a huge database which low quality data.
The database trained by the huge book “Notre dame de Paris” is probably mostly useful only to retype exactly that book and of very limited use to a “normal” user.
A “normal” user is probably much better served by a smaller database with contents which fit better to his style of writing.
I totally agree with you, Mike!
https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ has 2.15.4 now with these changes:
Thanks, Mike. If I find any problem, I will let you know.
https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ The 2.15.7 build contains an additional small tweak:
While reading training data from a file, the context in the database is converted to lower case and no accents.
So if you want to convert the context in the existing rows in your database you can read training data from file.
Size of the file doesn’t matter, it can even be empty.
Thanks, Mike. That is an useful tweak!
What do you think about this?:
https://github.com/mike-fabian/ibus-typing-booster/issues/257
Do you have any opinions? If yes, please comment.
I do not think that it is a high-needed feature, Mike, given the fact that it only needs to copy a file. But, who knows, it may be useful for non sophisticated users.
If you choose to add such a feature, then maybe it should include not only the database, but all configurations. And maybe using a single compressed file as the exported file.
I hope to have helped, Mike!
https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ has 2.15.8 builds.
Almost everything works 30%-40% faster with this experimental version (but uses somewhat more memory, not sure how much)
Thanks, Mike. With uses somewhat more memory, do mean RAM or disk?
RAM.
I didn't measure how much more ram it uses and I am not sure how to measure that.
The change which achieves this big speedup is actually only two lines:
diff --git a/engine/itb_util.py b/engine/itb_util.py
index ea55b7b7..9b4cfee6 100644
--- a/engine/itb_util.py
+++ b/engine/itb_util.py
@@ -33,6 +33,7 @@ from enum import Enum, Flag
import sys
import os
import re
+import functools
import collections
import unicodedata
import locale
@@ -2784,6 +2785,7 @@ TRANS_TABLE = {
ord('Ŧ'): 'T',
}
+@functools.cache
def remove_accents(text: str, keep: str = '') -> str:
# pylint: disable=line-too-long
'''Removes accents from the text
I noticed that the function which removes accents froma a string is a major bottleneck.
I couldn’t find a way to make that function faster but I tried caching the results. The easiest way to do that is to add that function decorator.
That means if this function is called twice with the same arguments, for example if you call something like this twice:
remove_accents('abcÅøßẞüxyz', keep='åÅØø'))
then the second call with return the result
'abcÅøssSSuxyz'
from the cache which is of course much faster.
As this remove_accents()
function is used really a lot in Typing Booster, caching results from only that function already achieves that huge speedup of 30%-40%.
But as I didn’t limit the size of the cache, this means that every time this function is called with a different word during a typing booster session, that word gets added to the cache. And this function is really used a lot, i.e. the cache might get quite big.
I could use something like:
@functools.lru_cache(maxsize=100000)
to limit the maximum size of the cache. That would limit the cache to up to the most recent one hundred thousand calls. As each call typically has a word of input and a word of output, that would limit the maximum size of the cache to a few megabytes.
According to the documentation
https://docs.python.org/3/library/functools.html
adding such a limit makes it slightly slower though:
Returns the same as lru_cache(maxsize=None), creating a thin wrapper around a dictionary lookup for the function arguments. Because it never needs to evict old values, this is smaller and faster than lru_cache() with a size limit.
I have not yet measured how much slower.
While I am in process of changing the timestamps of my database, I have something that I would like to suggest to you: To make search case-insensitive. For instance, if I write mike, nothing is found by ibus-typing-booster, but if I start writing Mi... the Mike suggestion emerges immediately. I think it would be useful to be possible to write mi... and then the suggestion Mike being immediately offered.
What do you think about this?