Should search be case-sensitive?

psads-git commented 3 years ago

While I am in process of changing the timestamps of my database, I have something that I would like to suggest to you: To make search case-insensitive. For instance, if I write mike, nothing is found by ibus-typing-booster, but if I start writing Mi... the Mike suggestion emerges immediately. I think it would be useful to be possible to write mi... and then the suggestion Mike being immediately offered.

What do you think about this?

psads-git commented 3 years ago

Thanks, Mike. Have a good day!

mike-fabian commented 3 years ago

https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ has 2.15.1 builds which show only capitalized candidates if the user starts a word with a capital letter.

psads-git commented 3 years ago

Thanks, Mike! It seems to be working fine.

mike-fabian commented 3 years ago

Thanks, Mike! It seems to be working fine.

Do you think it is an improvement? I tend to think it is an improvement but I am not sure yet...

psads-git commented 3 years ago

Yes, Mike, to me it is a big improvement. It is hard to understand why you are not yet sure about that. However, if you are not very convinced, you may want to ponder to make that optional.

mike-fabian commented 3 years ago

Ah, no, I don't want to make that optional, I wouldn't even know how to name that option so that it is understandable.

I just didn't test enough yet.

I thought there might be cases where one wants lower case even though one accidentally started typing with upper case. But that is probably rare and then one can still go to lower case by pressing the right Shift key.

So I think I leave that feature.

I might change the way it is implemented though, the current implementation makes it slightly slower, I have a different idea how it could be done.

psads-git commented 3 years ago

Thanks, Mike. I am afraid that my suggestions may be putting a heavy load on your shoulders. That is not my intention -- I just want to make ibus-typing-booster even greater, as it simplifies a lot my life. However, if you do not feel like accepting or implementing my suggestions, you please feel free to follow your will and not my ideas and suggestions, as I understand that working on ibus-typing-booster is not the way your salary is earned. Moreover, you are the only one with full rights to decide what is best to Ibus-typing-booster.

mike-fabian commented 3 years ago

I think I might leave the current implementation of capitalizing the candidates when the user types a capital letter as it is. It has a slight performance penalty because it does this:

https://github.com/mike-fabian/ibus-typing-booster/blob/release-candidate-2.15.1/engine/tabsqlitedb.py#L379

    def best_candidates(
            self,
            phrase_frequencies: Dict[str, int],
            title=False) -> List[Tuple[str, int]]:
        '''Sorts the phrase_frequencies dictionary and returns the best
        candidates.

        Should *not* change the phrase_frequencies dictionary!
        '''
        if title:
            phrase_frequencies_title: Dict[str, int] = {}
            for phrase in phrase_frequencies:
                phrase_title = phrase[:1].title() + phrase[1:]
                if phrase_title in phrase_frequencies_title:
                    phrase_frequencies_title[phrase_title] += (
                        phrase_frequencies_title[phrase_title])
                else:
                    phrase_frequencies_title[phrase_title] = (
                        phrase_frequencies[phrase])
            return sorted(phrase_frequencies_title.items(),
                          key=lambda x: (
                              -1*x[1],   # user_freq descending
                              len(x[0]), # len(phrase) ascending
                              x[0]       # phrase alphabetical
                          ))[:20]
        return sorted(phrase_frequencies.items(),
                      key=lambda x: (
                          -1*x[1],   # user_freq descending
                          len(x[0]), # len(phrase) ascending
                          x[0]       # phrase alphabetical
                      ))[:20]

I.e. when the function best_candidates() is called with title=True, then all phrases in the phrase_frequencies dictionary which are identical except for the capitalization of the first letter are merged into one and their user frequencies added.

For example, if phrase_frequencies contains

{'Ratatouille': 5, 'ratatouille' 3, 'ratatouiller': 1}

then the new phrase_frequencies_title will be:

{'Ratatouille': 8, 'Ratatouiller': 1}

and then this is sorted as usual and the first 20 candidates returned.

Going through all the entries in the original dictionary, and merging into a new one is some extra work, especially if the original number of phrases before sorting and cutting out the first 20 is long, might be hundreds of entries long. But I don’t really notice any slowness while typing, so it is probably OK.

The other idea I had is this:

There is already the feature that you can switch between three case modes using the left and the right Shift key, forwards with the left one, backwards with the right one.

'capitalize', 'upper', 'lower'

and if you don’t like any of these you can go back to the 'orig' case mode (original, as typed by the user) by pressing Escape. This is somewhat faster as the above way to change the case in the candidate list as it only works on the 20 candidates which are already in the candidate list and it just changes the case of them, if that creates identical entries, they are not merged. For example if the candidate list contains:

1 Ratatouille
2 ratatouille
3 ratatouiller

and you press Shift you get:

1 Ratatouille
2 Ratatouille
3 Ratatouiller

so the first two entries are now identical. This is good because before pressing Shift you might have seen candidate 3 and thought “I want that but in upper case”, so you press Shift and then 3. But if entries 1 and 2 were merged, the numbers would change and a candidate 3 would not exist anymore.

So switching through the case modes leaves the candidate list unchanged except for the change in case.

My other idea was, that when the user types a word starting with a capital letter to switch to the case mode 'capitalize' automatically. That would be very fast but would often create duplicates in the candidate list shown.

My current implementation for the case when the user typed a word starting with a capital letter gives you a different candidate list without duplicates. Therefore it might have a different length and different order because it gives higher weight to a word if it was already twice in the list with different capitalization.

I think removing the duplicates when the user starts typing with a capital letter is probably better.

psads-git commented 3 years ago

Now, I understand your reluctance regarding this new feature, Mike! But I was not thinking in anything as complicated as you describe!

Let me explain what was my idea. Nothing would need to change but the following:

When the list of suggestions is ready to offer to the user (using the exact same mechanism as before, with no modification at all), ibus-typing-booster should give it to the user as a list of capitalized words and removing the duplicates.

This should not impose any degradation on speed, since it only implies

capitalize 9 words;
and using a function unique to remove the duplicates.

mike-fabian commented 3 years ago

Removing the duplicates is sligthly more complicated then using a unique function on the words only because the candidates are pairs of words and integer numbers, higher number means higher priority, it should show up higher in the lookup table.

When merging {'Ratatouille': 5, 'ratatouille' 3} into one entry, one needs to look at the numbers 5 and 3 as well. Either add them and get 8 (which is what I did and which increases priority of words which happened to be twice in the original list) or take the maximum.

The final list presented to the user can have more than 9 entries because the candidate list can be paged down. I cut it down to 20. If emojis are matched as well the final list can get longer than 20 again because the emoji are added.

(You see that when you type cat you get 20, when typing cat_ you get 40 candidates)

Currently I remove the duplicates before cutting down to 20 candidates and resort because adding the frequencies of duplicates together requires a sort to get the entries in the right order of the frequencies. But before cutting down to 20 the list may be long.

I could make it slightly faster by first cutting down to 20 and sorting as usual and after that remove the duplicates without changing the sort order (The first one of a duplicate is the one with the higher frequency if the list is already sorted). This will probably be slightly faster because one never goes through more then 20 candidates that way.

psads-git commented 3 years ago

I could make it slightly faster by first cutting down to 20 and sorting as usual and after that remove the duplicates without changing the sort order (The first one of a duplicate is the one with the higher frequency if the list is already sorted). This will probably be slightly faster because one never goes through more then 20 candidates that way.

Right, Mike!

mike-fabian commented 3 years ago

I might try that, it could be that the difference in speed is not significant it seems, that current more “exact” implementation seems fast as well.

psads-git commented 3 years ago

Up to now, Mike, I have not noticed any slowdown. So, the exact approach seems to be appropriate.

mike-fabian commented 3 years ago

https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ has 2.15.2 builds now.

These builds have the sligthly faster implemtation of title casing the candidates if the user input starts with an upper case letter.

The difference is very small as I expected it is barely measurable.

But the new code is cleaner:

    def best_candidates(
            self,
            phrase_frequencies: Dict[str, int],
            title=False) -> List[Tuple[str, int]]:
        '''Sorts the phrase_frequencies dictionary and returns the best
        candidates.

        Should *not* change the phrase_frequencies dictionary!
        '''
        candidates = sorted(phrase_frequencies.items(),
                            key=lambda x: (
                                -1*x[1],   # user_freq descending
                                len(x[0]), # len(phrase) ascending
                                x[0]       # phrase alphabetical
                            ))[:20]
        if not title:
            return candidates
        candidates_title = []
        phrases_title = set()
        for candidate in candidates:
            phrase = candidate[0]
            phrase_title = phrase[:1].title() + phrase[1:]
            if phrase_title in phrases_title:
                continue
            candidates_title.append((phrase_title, candidate[1]))
            phrases_title.add(phrase_title)
        return candidates_title

Unless this makes anything worse, I’ll keep that.

psads-git commented 3 years ago

Thanks, Mike. Apparently, everything seems to be working fine.

Maybe in the future ibus-typing-booster will be able to work without recording uppercase words in the database and even faster. Notwithstanding, ibus-typing-booster is already fast and great!

mike-fabian commented 3 years ago

Maybe in the future ibus-typing-booster will be able to work without recording uppercase words in the database and even faster. Notwithstanding, ibus-typing-booster is already fast and great!

I am not sure what you mean by “without recording uppercase words”.

Currently entries in the database look like this:

55897|mo|Mondays|hates|Charlie|2|1637249680.12463

The fields are

id input_phrase phrase p_phrase pp_prase  user_freq timestamp

The input_phrase, i.e. mo is what the user typed (he might have typed Mo!), it is already always recorded in lower case. That is how I implemented the case insensitive matching, input_phrase is always recorded in lower case now and new user input is also converted to lowercase before attempting to match it against input_phrase in that database.

phrase i.e. Monday is the candidate the user committed. Of course that has to be recorded with the case which was actually used, in this case upper case. I guess you don’t mean you want to change that.

p_phrase is the previous word and pp_phrase the word before that. I.e. the user typed Charlie hates mo and then selected Mondays.

This context of the last two words is currently recorded with the case which was actually used in the context.

That means if you type Charlie hates mo or Charlie hates Mo again, Monday will be suggested with a very high priority because the context matches exactly. Most likely it will be the first candidate.

However if you type charlie hates mo or charlie hates Mo, Monday will still be in the suggestions because the mo or Mo match and the hates matches, but charlie does not match Charlie so the priority number calculated for that match will be somewhat smaller.

Now the question is how picky one should be about differences in case in the context. Should Charlie hates and charlie hates be considered to be the identical context or not?

Is there any need to make these two contexts distinct and possibly suggest different things in these two contexts?

Or should contexts which differ only in case be considered identical and produce exactly the same suggestions?

I am not sure what is better, it is possible that ignoring the case in the context is better, very hard to tell.

Punctuation for example is already ignored in contexts. Whether one types (Charlie hates mo or Charlie hates mo, the context is Charlie hates in both cases, the ( is ignored. That makes it impossible to suggest different things in these two cases but probably suggesting something different just because one started with the ( would not be helpful.

It might be that suggesting different things for contexts which differ only in case is not helpful either.

If suggesting different things when the the context differs only by case is not helpful then not recording the case of the context in the database would use up only one row instead of two in the database for Charlie hates mo and charlie hates mo with the result Mondays. As we limit the database to a fixed maximum number of 50000 rows, this would make room for a different entry, which might be more useful than storing almost identical entries which differ only in case.

I can try that, but I wonder how to test whether it is really better or decreases the quality of the suggestions.

mike-fabian commented 3 years ago

By the way, if we think about ignoring case in the context, we could also think about ignoring accents in the context.

Both ignoring case and ignoring accents in the context makes the context less precise and therefore the suggestion less precise but it is unclear how much precision is helpful there. Too much precision can have the result that there are too many rows in the database which are never typed exactly like that again but something similar which differs only by case or accents is typed again. So a somewhat more “fuzzy” context could be better.

But it is really not easy to say when making such a change whether it improves things or makes things worse.

I wonder if I should think about a way to measure improvement in a more precise way than just typing for a few hours and then deciding “It seems better to me ...”.

psads-git commented 3 years ago

Thanks, Mike. I think you have interpreted correctly my idea. I conjecture that, in this way, we can save database space without sacrificing prediction accuracy (or only very, very residually).

As a side comment, I have just checked GBoard (the smartphone Google keyboard): When one starts writing, some suggestions are shown and the same exact suggestions are turned capitalized if one starts writing with an uppercase letter!

mike-fabian commented 3 years ago

Thanks, Mike. I think you have interpreted correctly my idea. I conjecture that, in this way, we can save database space without sacrificing prediction accuracy (or only very, very residually).

I can try that soon and make a test build and then you can try to test whether it is helpful or not.

Should I start with recording the context in lower case and still keep the accents or do both at the same time?

As a side comment, I have just checked GBoard (the smartphone Google keyboard): When one starts writing, some suggestions are shown and the same exact suggestions are turned capitalized if one starts writing with an uppercase letter!

SwiftKey on Android also does this, so this is probably a good thing, we should keep that (and not make it optional either).

psads-git commented 3 years ago

I would not start by both at the same time, Mike. And I would start by the accents.

Maybe one could test the viability of the idea without implementing it in ibus-typing-booster: Doing some SQL queries directly on the database would perhaps suffice.

mike-fabian commented 3 years ago

I would not start by both at the same time, Mike. And I would start by the accents.

Maybe one could test the viability of the idea without implementing it in ibus-typing-booster: Doing some SQL queries directly on the database would perhaps suffice.

Implementing is not so difficult, I can probably do that in a few hours.

I wonder more how to evaluate whether it is helpful or not.

I just got this idea of a way of testing:

Letting ibus-typing-booster read whole books to train the input turned out to be less useful than I thought it would because some random book is not specific to the stuff a certain user usually types.

But of course I’ll keep that feature because there are maybe some text files a user has which can be useful.

But letting ibus-typing-booster read some book might be a useful test case:

Start with an empty database in memory
Let ibus-typing-booster read some book into that database
Start a simulated typing of the whole book letter by letter. I.e. type the first letter of the first word, see if the first word is suggested as the first candidate, if not type the next letter. As soon as the correct word is the first candidate, add the number of letters saved typing to some counter and continue with the next word of the book. “Type” the whole book like this and count how many letters were saved overall and calculate a “% of letters saved”.

Now change something in the implementation like storing the context case insensitive and repeat the above test and see whether the “% of letters saved” changes.

That might be a good test to see whether a change in implementation achieves an improvement or not.

psads-git commented 3 years ago

Go ahead, Mike! It seems to be a very ingenious testing idea!

mike-fabian commented 3 years ago

I implemented that test case now and checked what the difference is before and after this commit:

commit e252672cf50d3a778e6e243354702909065c9dd5 Author: Mike FABIAN mfabian@redhat.com Date: Wed Nov 17 12:47:30 2021 +0100

Title case all candidates if input_phrase is in title case

(Resolves: https://github.com/mike-fabian/ibus-typing-booster/issues/253)

I used the Project Gutenberg version of “The Picture of Dorian Gray” as a test case.

That book has 9195 lines and 80687 words:

$ wc the_picture_of_dorian_gray.txt 
  9195  80687 451987 the_picture_of_dorian_gray.txt

After reading that into an empty in memory database, that database has 70594 rows.

Testing with the behaviour before the above commit (i.e. even when a word starting with a capital letter is typed, there may be lower case candidates) gave:

total_length_typed=222279
total_length_committed=341048
total_length_saved=-118769
total_percent_saved=-34.82471675541273

I.e. 118769 characters were saved in typing which is 34.8%.

Testing after the above commit (When a word starting with a capital letter is typed, all candidates start with upper case) gave:

total_length_typed=216400
total_length_committed=341048
total_length_saved=-124648
total_percent_saved=-36.54852102929793

I.e. 124648 characters were saved in typing which is 36.5%.

So this change was helpful, although less than I would have guessed from my feelings after manual testing.

mike-fabian commented 3 years ago

If I tested correctly, storing the context only in lower case and matching the context only in lower case does not change the savings at all, I get exactly the same:

total_length_typed=216400
total_length_committed=341048
total_length_saved=-124648
total_percent_saved=-36.54852102929793

but it reduces the number of database rows stored from 70594 to 69743 (i.e. by 1.2%)

So it is a an improvement as it achieves exactly the same prediction quality with a somewhat smaller database.

But a very small improvement.

I’ll test with accent insensitive context next but I need to use a different book, “The Picture of Dorian Gray” has no accents at all.

psads-git commented 3 years ago

That is good news, Mike! I expect that the savings in the database size will be larger than in the case you have just studied.

mike-fabian commented 3 years ago

Actually I expect the savings by doing it without the accents to be even smaller than doing it case insensitive.

Let’s see ... Tests are still running ...

mike-fabian commented 3 years ago

I made some tests with “Notre-Dame de Paris by Victor Hugo”:

https://www.gutenberg.org/ebooks/2610

Results case and accent insensitive context:

Database rows: 156245
total_length_typed=619150
total_length_committed=849995
total_length_saved=-230845
total_percent_saved=-27.158395049382644

Results case insensitive context (but accent sensitive):

Database rows: 156301
total_length_typed=619150
total_length_committed=849995
total_length_saved=-230845
total_percent_saved=-27.158395049382644

Results case sensitive and accent sensitive context:
Database rows: 157587
total_length_typed=619150
total_length_committed=849995
total_length_saved=-230845
total_percent_saved=-27.158395049382644

Results before:
commit e252672cf50d3a778e6e243354702909065c9dd5
Author: Mike FABIAN <mfabian@redhat.com>
Date:   Wed Nov 17 12:47:30 2021 +0100

    Title case all candidates if input_phrase is in title case

    (Resolves: https://github.com/mike-fabian/ibus-typing-booster/issues/253)

total_length_typed=629175
total_length_committed=849995
total_length_saved=-220820
total_percent_saved=-25.978976346919687

I.e. just as with my tests with the English book “The Picture of Dorian Gray”, doing the “Title case all candidates if input_phrase is in title case” is definitely helpful although the difference is again not very big.

It only saves a small number of database rows.

Going to case insensitive context saves 1286 database rows (0.82% of the rows).

Doing accent insensitive context as well saves another 56 database rows (0.035% of the rows).

Doing the case sensitive and accent insensitive context does not change prediction accuracy at all in this test, the number of chracters saved is exactly the same.

When I think about it, this seems to make sense to me. It does not surprise me much that the context which is identical except for accents is very rare. Context which is identical except for the case of one of the context words will occur regularly depending whether a sequence of 3 words occurs at the beginning of a sentence or in the middle of the sentence. But I guessed already that context which is identical except for accent differences is something very unusual.

By the way, the 27.2% percent saved for “Notre-Dame de Paris by Victor Hugo” was less than the 36.5% saved for “The picture of Dorian Gray”.

I am not sure why that is, maybe it Typing Booster works less well for French than for English? It could also be because of the size of the book, “Notre-Dame de Paris by Victor Hugo” is more than twice as long:

$ wc /home/mfabian/tmp/the_picture_of_dorian_gray.txt 
  9195  80687 451987 /home/mfabian/tmp/the_picture_of_dorian_gray.txt
$ wc /home/mfabian/tmp/victor_hugo_notre_dame_de_paris.txt
  21676  175408 1117074 /home/mfabian/tmp/victor_hugo_notre_dame_de_paris.txt

and the percentage saved tends to be bigger for shorter texts according to the tests I did yesterday.

Two shorter texts I tested yesterday:

$ wc the_road_not_taken.txt chant_d_automne.txt 
  27  151  770 the_road_not_taken.txt
  39  226 1396 chant_d_automne.txt

(“The Road Not Taken” is the poem by Robert Frost, “Chant d’automne” is the poem by Charles Baudelaird).

Savings were -51.3% for “The Road Not Taken” and -37.6% for “Chant d’automne”.

mike-fabian commented 3 years ago

So what do I do now after these tests?

I think I’ll do both the case insensitive and accent insensitive context.

The improvement doing this is small, it does not seem to change the prediction accuracy at all but it saves a (very small) amount of database rows.

As the room in the database is limited (currently we cut it down to 50000 rows on each reastart of Typing Booster), saving a few rows without changing the prediction accuracy at all makes some room for other additional rows which might actually improve prediction accuracy.

So doing this seems to be an improvement, but a very small one.

psads-git commented 3 years ago

Excellent news, Mike!

The following is perhaps a too crazy idea (I still have to think much deeper about it): Ignoring the order of the context words may have only a very neglectable impact on prediction accuracy, while saving a lot of database rows.

mike-fabian commented 3 years ago

Excellent news, Mike!

The following is perhaps a too crazy idea (I still have to think much deeper about it): Ignoring the order of the context words may have only a very neglectable impact on prediction accuracy, while saving a lot of database rows.

Here I would guess that this will significantly worsen prediction accuracy.

That would destroy the whole trigram idea. Recently you qoted something which had some number how much prediction accuracy increases if you use longer context than just 2 words, i.e. four-grams or five-grams.

According to what I have read so far, prediction accuraccy increases the longer you make the context used, but the speed goes down fast, the calculation time does increase extremely fast and the accuracy gains are minor.

trigrams usually seem to be still worth it though.

Your idea is almost like using bigrams with the additional twist of adding fake bigrams where the context skips one word.

I really doubt that this would help anything, most likely this makes it worse.

psads-git commented 3 years ago

As I remarked in my previous message, Mike, I have not yet thought thoroughly about the idea. Moreover, this is only a speculative exercise, since ibus-typing-booster is already fast -- it does not need significant improvements. However, my idea is to try to understand whether an ordered 3-gram model is worse than an unordered 4- or 5-gram one.

mike-fabian commented 3 years ago

At least I have some code now which helps me testing such ideas. So we can get actual results now when trying out ideas.

Is there any literature about “unordered n-grams”?

psads-git commented 3 years ago

Yeah, your system to test new ideas is indeed a great improvement, Mike!

Yes, there little literature regarding unordered n-grams, but I have found the following technical report on

Natural Language Generation

Unordered n-grams are mentioned on p. 7. However, as far as I understand, this technical report is not on word prediction.

mike-fabian commented 3 years ago

A colleague found this:

https://towardsdatascience.com/next-word-prediction-with-nlp-and-deep-learning-48b9fe0a17bf

but reading this I have the impression that they do less than I already do and that in a more complicated way.

They even write in a note:

Note: There are certain cases where the program might not return the expected result. This is obvious because each word is being considered only once. This will cause certain issues for particular sentences and you will not receive the desired output. To improve the accuracy of the model you can consider trying out bi-grams or tri-grams. We have only used uni-grams in this approach. Also, a few more additional steps can be done in the pre-processing steps. Overall, there is a lot of scope for improvement.

mike-fabian commented 3 years ago

I found a side effect of making the user input into the database case insensitive and opened a new issue for this:

https://github.com/mike-fabian/ibus-typing-booster/issues/255

mike-fabian commented 3 years ago

New issue for the case and accent insensitivity in the context:

https://github.com/mike-fabian/ibus-typing-booster/issues/256

mike-fabian commented 3 years ago

I did another interesting test:

Let Typing Booster read the book victor_hugo_notre_dame_de_paris.txt into an empty, in-memory database.

After doing that, the database has 156245 rows. We know already from the tests above that a simulated typing of that book will now save about -27% of the characters.

But that database is huge and will be cut down to the 50000 “best” entries on the next restart of Typing Booster. So how much will that degrade the prediction quality? So I do these next steps in the test:

Call cleanup_database() on that in-memory database
Now the database has only 50000 entries
Now do the typing simulation

Result: Only -24% are saved instead of -27%

But that doesn’t seem bad to me, only a small loss of 3% prediction accuracy with a database of less then 1/3 the original size.

mike-fabian commented 3 years ago

Getting higher quality data into a smaller database is more useful than a huge database which low quality data.

The database trained by the huge book “Notre dame de Paris” is probably mostly useful only to retype exactly that book and of very limited use to a “normal” user.

A “normal” user is probably much better served by a smaller database with contents which fit better to his style of writing.

psads-git commented 3 years ago

A colleague found this:

https://towardsdatascience.com/next-word-prediction-with-nlp-and-deep-learning-48b9fe0a17bf

but reading this I have the impression that they do less than I already do and that in a more complicated way.

They even write in a note:

Note: There are certain cases where the program might not return the expected result. This is obvious because each word is being considered only once. This will cause certain issues for particular sentences and you will not receive the desired output. To improve the accuracy of the model you can consider trying out bi-grams or tri-grams. We have only used uni-grams in this approach. Also, a few more additional steps can be done in the pre-processing steps. Overall, there is a lot of scope for improvement.

I do not think, Mike, it would be a good idea to use deep learning methods on ibus-typing-booster, as such a methods would require periodic training.

psads-git commented 3 years ago

I did another interesting test:

Let Typing Booster read the book victor_hugo_notre_dame_de_paris.txt into an empty, in-memory database.

After doing that, the database has 156245 rows. We know already from the tests above that a simulated typing of that book will now save about -27% of the characters.

But that database is huge and will be cut down to the 50000 “best” entries on the next restart of Typing Booster. So how much will that degrade the prediction quality? So I do these next steps in the test:

Call cleanup_database() on that in-memory database

Now the database has only 50000 entries

Now do the typing simulation

Result: Only -24% are saved instead of -27%

But that doesn’t seem bad to me, only a small loss of 3% prediction accuracy with a database of less then 1/3 the original size.

Yes, Mike, that is great that saving a lot of database size impacts so little on accuracy!

psads-git commented 3 years ago

Getting higher quality data into a smaller database is more useful than a huge database which low quality data.

The database trained by the huge book “Notre dame de Paris” is probably mostly useful only to retype exactly that book and of very limited use to a “normal” user.

A “normal” user is probably much better served by a smaller database with contents which fit better to his style of writing.

I totally agree with you, Mike!

mike-fabian commented 3 years ago

https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ has 2.15.4 now with these changes:

Make the context in the database case insensitive and accent insensitive (Resolves: https://github.com/mike-fabian/ibus-typing-booster/issues/256)
Test cases for the sqlite database
Allow cleanup_database() to cleanup an in memory database when doing unit tests
Allow only lower case user shortcuts (Resolves: https://github.com/mike-fabian/ibus-typing-booster/issues/255)
Add function to return number of rows in the database for debugging and testing
Allow to read training data from file when database is empty, allow reading from .gz files

psads-git commented 3 years ago

Thanks, Mike. If I find any problem, I will let you know.

mike-fabian commented 3 years ago

https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ The 2.15.7 build contains an additional small tweak:

While reading training data from a file, the context in the database is converted to lower case and no accents.

So if you want to convert the context in the existing rows in your database you can read training data from file.

Size of the file doesn’t matter, it can even be empty.

psads-git commented 3 years ago

Thanks, Mike. That is an useful tweak!

mike-fabian commented 3 years ago

What do you think about this?:

https://github.com/mike-fabian/ibus-typing-booster/issues/257

Do you have any opinions? If yes, please comment.

psads-git commented 3 years ago

I do not think that it is a high-needed feature, Mike, given the fact that it only needs to copy a file. But, who knows, it may be useful for non sophisticated users.

If you choose to add such a feature, then maybe it should include not only the database, but all configurations. And maybe using a single compressed file as the exported file.

I hope to have helped, Mike!

mike-fabian commented 3 years ago

https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ has 2.15.8 builds.

Almost everything works 30%-40% faster with this experimental version (but uses somewhat more memory, not sure how much)

psads-git commented 3 years ago

Thanks, Mike. With uses somewhat more memory, do mean RAM or disk?

mike-fabian commented 3 years ago

RAM.

I didn't measure how much more ram it uses and I am not sure how to measure that.

The change which achieves this big speedup is actually only two lines:

diff --git a/engine/itb_util.py b/engine/itb_util.py
index ea55b7b7..9b4cfee6 100644
--- a/engine/itb_util.py
+++ b/engine/itb_util.py
@@ -33,6 +33,7 @@ from enum import Enum, Flag
 import sys
 import os
 import re
+import functools
 import collections
 import unicodedata
 import locale
@@ -2784,6 +2785,7 @@ TRANS_TABLE = {
     ord('Ŧ'): 'T',
 }

+@functools.cache
 def remove_accents(text: str, keep: str = '') -> str:
     # pylint: disable=line-too-long
     '''Removes accents from the text

I noticed that the function which removes accents froma a string is a major bottleneck.

I couldn’t find a way to make that function faster but I tried caching the results. The easiest way to do that is to add that function decorator.

That means if this function is called twice with the same arguments, for example if you call something like this twice:

remove_accents('abcÅøßẞüxyz', keep='åÅØø'))

then the second call with return the result

'abcÅøssSSuxyz'

from the cache which is of course much faster.

As this remove_accents() function is used really a lot in Typing Booster, caching results from only that function already achieves that huge speedup of 30%-40%.

But as I didn’t limit the size of the cache, this means that every time this function is called with a different word during a typing booster session, that word gets added to the cache. And this function is really used a lot, i.e. the cache might get quite big.

I could use something like:

@functools.lru_cache(maxsize=100000)

to limit the maximum size of the cache. That would limit the cache to up to the most recent one hundred thousand calls. As each call typically has a word of input and a word of output, that would limit the maximum size of the cache to a few megabytes.

According to the documentation

https://docs.python.org/3/library/functools.html

adding such a limit makes it slightly slower though:

Returns the same as lru_cache(maxsize=None), creating a thin wrapper around a dictionary lookup for the function arguments. Because it never needs to evict old values, this is smaller and faster than lru_cache() with a size limit.

I have not yet measured how much slower.

mike-fabian / ibus-typing-booster

Should search be case-sensitive? #251