Should search be case-sensitive?

psads-git commented 3 years ago

While I am in process of changing the timestamps of my database, I have something that I would like to suggest to you: To make search case-insensitive. For instance, if I write mike, nothing is found by ibus-typing-booster, but if I start writing Mi... the Mike suggestion emerges immediately. I think it would be useful to be possible to write mi... and then the suggestion Mike being immediately offered.

What do you think about this?

mike-fabian commented 3 years ago

Yes, I think search should be case-insensitive and if this is possible it should be case insensitive by default.

mike-fabian commented 3 years ago

I don’t want to make this optional though, I think it should always be case insensitive.

psads-git commented 3 years ago

Thanks, Mike. That will significantly improve ibus-typing-booster.

mike-fabian commented 3 years ago

https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ has 2.14.18 builds now which have the case insensitive matching.

psads-git commented 3 years ago

Thanks, Mike. This version of ibus-typing-booster is almost unusable, since the search is extremely slow.

Let me add that my database has about 70MB (I input into it several books, to help the prediction).

mike-fabian commented 3 years ago

Is it slower than before? I didn’t notice any slowdown.

psads-git commented 3 years ago

Yes, Mike, several orders of magnitude slower.

mike-fabian commented 3 years ago

But probably because of your huge database, or do you think the case insensitive matching change anything?

mike-fabian commented 3 years ago

If you downgrade to 2.14.17, is that faster?

psads-git commented 3 years ago

If I downgrade to 2.14.17, ibus-typing-booster is very fast.

I do not have any solid idea why is that happening, but, since the result of each search has many more elements, the ordering by use frequency may be extremely slow, because of the number of elements that it has to order -- if I remember well, the sorting algorithms are O(n^2) (or something close to that).

mike-fabian commented 3 years ago

Let me add that my database has about 70MB (I input into it several books, to help the prediction).

By the way, I found reading books into Typing Booster not to be helpful. For example, in 2014, I let Typing Booster read the “Hitchhikers Guide to the Galaxy” and “The Picture of Dorian Gray”. It didn’t seem to help my typing at all. I could easily open any page in one of these books and type any sentence from these books easily with good prediction, but what I really wanted to type (almost) never seemed to occur in these books. I guess this is true for almost any book except possibly if you wrote it yourself. User input is so variable, it doesn’t help much to read text from other writers, even a huge amount of such text doesn’t seem to help.

Now when I implemented the expiry of old entries and looked at what entries were expired, I noticed that a most entries from both these books had been created in 2014 and have never been touched since then.

So I think reading books doesn’t help much.

Maybe I should rethink how to read huge texts, I have no good idea at the moment though. Learning from your own input is helpful, learning from other peoples input seems to have very limited value. Maybe, when reading a text above a certain size, I should only add stuff from that text to the database if it really appears very often. That still might not help very much, for example “The picture of Dorian Gray” contains 47 times the text “said Lord Henry”. I will probably never type that, I will probably type “said” often but “Lord” and “Henry” very rarely and it is unlikely that I ever type the complete text “said Lord Henry”.

mike-fabian commented 3 years ago

Now I have measured and indeed it is much slower.

My database has 140000 rows and a size of 10MB and with that it is about 10 times slower.

I measured both the time it takes to do the case insensitive lookup in the hunspell dictionaries and in the database.

The difference for the lookup in the dictionaries is insignificant, the difference in speed between the case insensitive regular expressions used now and the .startswith() used before seems to be within measurement error.

But the difference when doing the database lookup is huge. As soon as I use

PRAGMA case_sensitive_like = false;

for the sqlite database, it becomes about 10 times slower for me.

So probably there are two things slowing this down:

the case insensitive LIKE operator is probably slow already
When doing the LIKE case insensitive, the number of records matched is probably a lot bigger and then calculating the linear combinations of the counts of the previous two words has a lot more work to do and becomes much slower

psads-git commented 3 years ago

So, Mike, case-insensitiveness is perhaps a bad idea! ;-)

mike-fabian commented 3 years ago

Some examples from my current database:

sqlite> PRAGMA case_sensitive_like = true;
sqlite> select sum(user_freq) from phrases where input_phrase like "Thi%" ;
411
sqlite> PRAGMA case_sensitive_like = false;
sqlite> select sum(user_freq) from phrases where input_phrase like "Thi%" ;
2768
sqlite>

sqlite> PRAGMA case_sensitive_like = true;
sqlite> select sum(user_freq) from phrases where input_phrase like "Th%" ;
2712
sqlite> PRAGMA case_sensitive_like = false;
sqlite> select sum(user_freq) from phrases where input_phrase like "Th%" ;
18887
sqlite> 

sqlite> PRAGMA case_sensitive_like = true;
sqlite> select sum(user_freq) from phrases where input_phrase like "Te%" ;
188
sqlite> PRAGMA case_sensitive_like = false;
sqlite> select sum(user_freq) from phrases where input_phrase like "Te%" ;
2199
sqlite> 

sqlite> PRAGMA case_sensitive_like = true;
sqlite> select sum(user_freq) from phrases where input_phrase like "te%" ;
1998
sqlite> PRAGMA case_sensitive_like = false;
sqlite> select sum(user_freq) from phrases where input_phrase like "te%" ;
2199
sqlite>

Looks like the amount of entries returned by a case insensitive LIKE operator is really a lot bigger.

Probably we cannot do case insensitive matching in the database then.

Maybe only in the dictionaries.

So for example when using the en_US dictionary, “mike” and “Mike” would both match:

$ grep -i mike /usr/share/myspell/en_US.dic 
Mike/M
mike/MGDS
mfabian@taka:~
$

but not when using the en_GB dictionary:

mfabian@taka:~
$ grep -i mike /usr/share/myspell/en_GB.dic 
mike/DMGS
mfabian@taka:~
$

It looks like I can do the case insensitive match in the dictionaries without measurable slowdown and at least that makes it possible to type “corme” or “Corme” and get “Cormeilles-en-Parisis” when using the French dictionary:

mfabian@taka:~
$ grep Paris /usr/share/myspell/fr_FR.dic 
Cormeilles-en-Parisis
Paris
Seyssinet-Pariset
Tout-Paris
mfabian@taka:~
$

mike-fabian commented 3 years ago

Would a “halfway case insensitive” solution, i.e. case insensitive in the dictionaries but not in the database as described above be good?

I tend to think this is better than nothing.

psads-git commented 3 years ago

Yes, Mike, a “halfway case insensitive” solution would definitely be better than nothing. And to use case insensitiveness in the database is, as is very clear, impractical.

mike-fabian commented 3 years ago

I just noticed that case insensitive matching in the database is even possible with no loss in performance at alŀ if I do it the same way I do it for accent insensitive matching.

Accent insensitive matching in the database is currently done like this:

Before the input the user has typed (input_phrase) is saved to the database all accents from the user input are removed, i.e. the input_phrase is always stored without accents in the database.

If the user then types the something again, the accents are removed from the input and then the match against the database is done.

This works well and does not cause any loss in performance. The disadvantage is of course that this is an option one cannot switch immediately. When I make accent sensitive matching in the database configurable as discussed in

https://github.com/mike-fabian/ibus-typing-booster/issues/231

Then a change in that option will only be effective for new input after that option was changed. Old rows in the database cannot be changed anymore, they would stay as they were.

One could do the same with upper and lower case: Always store input_phrase in lower case in the database, always convert user input to lower case before matching.

That would be fast.

But if this would be added as an option in the setup tool, then changing that option would only have an effect on new input.

psads-git commented 3 years ago

Thanks, Mike, but I am not sure whether it will work. Suppose that one wants to write

Mike

and types mike. How can ibus-typing-booster automatically transform mike into Mike?

My idea of case-insensitive search was to avoid typing uppercase letters mainly in names.

I believe that you are thinking in something like the following: One types

Mi (notice the capital M)

and ibus-typing-booster will suggest

Mike

This would be a great progress though, as it would allow a substantial size-reduction of the database while maintaining the same prediction performance.

mike-fabian commented 3 years ago

Thanks, Mike, but I am not sure whether it will work. Suppose that one wants to write

Mike

and types mike. How can ibus-typing-booster automatically transform mike into Mike?

That can be done easily in Python:

mfabian@taka:~
$ python3
Python 3.10.0 (default, Oct  4 2021, 00:00:00) [GCC 11.2.1 20210728 (Red Hat 11.2.1-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 'Mike'.lower()
'mike'
>>> 'Mike'.upper()
'MIKE'
>>> 'mike'.title()
'Mike'
>>> 'mike'.upper()
'MIKE'
>>> 'mIKe'.upper()
'MIKE'
>>> 'mIKe'.lower()
'mike'
>>> 'mIKe'.title()
'Mike'
>>>

My idea of case-insensitive search was to avoid typing uppercase letters mainly in names.

I believe that you are thinking in something like the following: One types

Mi (notice the capital M)

and ibus-typing-booster will suggest

Mike

I think that you can type either Mi or mi and in both cases you will get the suggestion Mike.

That is almost the same as with the accents, for example my database currently contains:

sqlite> select * from phrases where input_phrase  == "deja";
426586|deja|déjà|tu|As|1|1636959600.50043
426719|deja|déjà assisté|tu|as|1|1636960347.05661
427645|deja|déjà|||2|1636983062.65822
sqlite>

The input_phrase column always contains deja without the accents, even when I actually typed it with the accents.

When saving the the database, that column is always saved without the accents (There are some language dependent exceptions, we talked about these before, but let’s ignore these for this explanation). And from the user input the accents are also removed before attempting a match in the database. That means no matter whether the user types de or dé, both will match the above rows.

One could do the same with upper and lower case. For example when the user currently types Mi and selects Mike, or types mi and selects Mike, that would create two database entries like this:

426586|Mi|Mike|||1|1636959600.50043
426719|mi|Mike|||1|1636960347.05661

If the user input was always converted to lower case when when saving to the database and when matching against the database, one would have only one entry with count 2 like this:

426719|mi|Mike|||2|1636960347.05661

and typing Mi would still match that row because Mi would be converted to mi before trying the match.

But switching that option would only have an effect of new entries. The old ones would very slowly disappear because of the expiry.

Theoretically one could convert the database to all lower case input_phrase automatically when the option case insensitive matching is switched on.

But when it is switched off again, there would be no way to automatically convert it back as some information was lost when doing the conversion to lower case.

I think it might be better not to attempt any conversion of the existing rows at all and let the value of that option only have an effect on new rows.

This would be a great progress though, as it would allow a substantial size-reduction of the database while maintaining the same prediction performance.

I am not sure whether this is a good idea or not, probably I would make it an option if I implement this. I think the reduction in the number of rows would not be substantial, there are nto so many words which usually start with upper case in most languages. There are more in German because all nouns start with upper-case in German, but even in the case of German I think the number or rows which would be merged into one because of this would be rather small.

Whether prediction performance gets better or worse is partly a matter of opinion. If you type more exact, the prediction can be more exact. If you always type upper-case letters correctly, then the prediction has less choice. It is the same as with the accents, I think. If you type de now, words starting with de and dé can be predicted because we use accent insensitive matching. Accent sensitive matching would reduce the number of matching candidates.

This is exactly the same with case insensitive matching, it increases the number of matching candidates.

If there are more candidates, it may take more time to scroll through the candidate list and select the correct one.

So in the long run, both accent insensitive matching and case insensitive matching should be options. Both on by default probably.

psads-git commented 3 years ago

Thanks, Mike. That is not only names that need to be capitalized, but all first words after a period! Example:

This is a period. Now is the first word after period.

So, the need of capitalization is very recurrent!

mike-fabian commented 3 years ago

There is an autocapitalization feature after sentence endings.

psads-git commented 3 years ago

Thanks, Mike, for remembering me that feature, which I had, meanwhile, forgotten about.

mike-fabian commented 3 years ago

By the way, with the current database limited to 50000 rows, doing case insensitive matching by using

 PRAGMA case_sensitive_like = false;

causes a slowdown of about a factor of 2. Enough that I can notice the slowdown while typing.

The advantage of this method is of course that one could switch the option in an instant and it would have an effect on all rows existing in the database.

Whereas the other method of storing input phrases only in lower case in the database and converting each new input to lower case before doing the matching would effect only new rows.

Use PRAGMA case_sensitive_like = false;
- Disadvantage: Slowdown by a factor of 2
- Advantage: If this is made into an option, it can be switched immediately
Use python lower()
- Disadvantage: If this is made into an option, switching the option effects only new rows
- Advantage: No slowdown

I tend to think 2. is better because one will probably not switch that option very often. One will probably figure out what setting one likes and then just keep that. So one will have the disadvantage that it effects only new rows for a limited time but will have the speed advantage forever.

And I am not even sure whether I want to make this into an option, maybe just do case insensitive matching always with no option to switch to case sensitive matching. I don’t know yet whether I should add this as an option or not.

First I’ll make a build with case insensitive matching now with no option to switch if off and let you test and hear your opinion.

mike-fabian commented 3 years ago

While doing the case insensitive match according to method 2. I found this small problem when learning by reading from text files which contain accented words:

https://github.com/mike-fabian/ibus-typing-booster/issues/252

psads-git commented 3 years ago

Thanks, Mike. I also prefer the option 2 (python lower()). A power user can change all database entries to lowercase...

And I do not see any important reason why to leave optional (to the user) the search to be case-insensitive or case-sensitive.

psads-git commented 3 years ago

While doing the case insensitive match according to method 2. I found this small problem when learning by reading from text files which contain accented words:

252

That is not a big problem, Mike!

mike-fabian commented 3 years ago

And I do not see any important reason why to leave optional (to the user) the search to be case-insensitive or case-sensitive.

I think I also prefer not to add an option to switch this off. Let’s see what we think after testing it for a while.

I already implemented option 2. and found that it does not cause any slowdown (Actually it was even slightly faster in my measurement but that was well within measurement error, I think it certainly is a tiny bit slower, but not enough to be easily measurable)

psads-git commented 3 years ago

Great that you have that already implemented, Mike!

mike-fabian commented 3 years ago

https://copr.fedorainfracloud.org/coprs/mfabian/ibus-typing-booster/builds/ has 2.14.20 builds now which have case insensitive matching both in the dictionaries and in the database and they also have a fix for

https://github.com/mike-fabian/ibus-typing-booster/issues/252

psads-git commented 3 years ago

Thanks, Mike. I will try and will let you know whether something goes wrong.

This may be not related, but, the option auto capitalize does not work when typing one's answer here in GitHub.

mike-fabian commented 3 years ago

This may be not related, but, the option auto capitalize does not work when typing one's answer here in GitHub.

It works even here after a sentence end character. I.e. if you type test . test you will get test. Test.

It does not work if you move the focus with the mouse to some other window and come back to the github comment and place the cursor after a sentence end character. Neither does it work if you move the cursor with the arrow keys. And it doesn't work if you place the cursor at the beginning of the github comment field. All these use cases when changing the cursor position would need well working surrounding text. Typing Booster must be able to read the surrounding text and find out whether there is a sentence end character or the beginning of the entry box left of the cursor. This does not work well with the current broken surrounding text support.

Currently, self._new_sentence is set to True only when a non-empty commit ends with a sentence end character:

        if not commit_phrase.isspace():
            # If the commit space contains only white space
            # leave self._new_sentence as it is!
            self._new_sentence = False
            if itb_util.text_ends_a_sentence(commit_phrase):
                self._new_sentence = True

And it is reset to False again in do_reset():

    def do_reset(self) -> None:
        '''Called when the mouse pointer is used to move to cursor to a
        different position in the current window.

        Also called when certain keys are pressed:

            Return, KP_Enter, ISO_Enter, Up, Down, (and others?)

        Even some key sequences like space + Left and space + Right
        seem to call this.

        '''
        ...
        self.clear_context()
        ...

and in do_focus_in():

    def do_focus_in(self) -> None:
        '''Called when a window gets focus while this input engine is enabled

        '''
        ...
        self.clear_context()
        ...

both do_reset() and do_focus_in() call clear_context() which resets the self._new_sentence:

    def clear_context(self) -> None:
        '''Clears the context stack which remembers the last two words typed
        '''
        if DEBUG_LEVEL > 1:
            LOGGER.debug(
                'context=“%s” “%s” “%s”',
                self._ppp_phrase, self._pp_phrase, self._p_phrase)
        self._ppp_phrase = ''
        self._pp_phrase = ''
        self._p_phrase = ''
        self._new_sentence = False

There is a get_context() which tries to get the context again using surrounding text. get_context() currently doesn’t attempt to update self._new_sentence. I didn’t try that yet because surrounding text is so unreliable.

So currently it works only if you commit something which ends with a sentence end character and then do not move the cursor and continue typing the next word.

This should be improved of course but to really improve this, first surrounding text must work.

psads-git commented 3 years ago

That is fine, Mike, as one can write one's answers in gedit and paste them here.

mike-fabian commented 3 years ago

But in gedit the problem is the same. It works only if you don’t move the cursor.

psads-git commented 3 years ago

Thanks, Mike, but I had never noticed that!

mike-fabian commented 3 years ago

I just noticed that reopening preëdits works in firefox entry fields (for example here in github comments). In gedit it stopped working after recent ibus updates.

psads-git commented 3 years ago

I just noticed that reopening preëdits works in firefox entry fields (for example here in github comments). In gedit it stopped working after recent ibus updates.

I can confirm that, Mike!

mike-fabian commented 3 years ago

I think the case insensitive match is an improvement, after testing this for a while I like it better than the old behaviour.

Maybe I could make an official release now...

psads-git commented 3 years ago

I also find the case-insensitive search an improvement, Mike.

For now, I have only one suggestion to offer. I use inline completion, but when all suggestions come from dictionaries, no inline completion is done. What do you think about this, Mike?

mike-fabian commented 3 years ago

I think this has nothing to do with whether the completions come from dictionaries or not.

Do you have a specific example where you think it doesn’t work as you think it should?

psads-git commented 3 years ago

For instance, see the screenshot below. Why is not longe offered as an inline completion?

inline

mike-fabian commented 3 years ago

First of all longe is not a dictionary completion, it is from the user database. You see that because of the black colour and the star ⭐. longe might also be in the dictionary, but apparently you typed it before already so it is in the user database.

Inline completions are only shown when the first candidate is an exact continuation of what you typed:

        if (not first_candidate.startswith(typed_string)
            or first_candidate == typed_string):
            # The first candidate is not a direct completion of the
            # typed string. Trying to show that inline gets very
            # confusing.  Don’t do that, show standard lookup table:
            self.update_lookup_table(self.get_lookup_table(), True)
            self._update_preedit()
            return

And in that case it is not an exact continuation of what you typed, you typed an uppercase L and the completion has a lower case l.

(Before the case insensitive match, this wouldn't even have matched, the first match would have been Lon then but this would not have been shown as an inline completion either because there would have been nothing to complete, the preedit would have been exactly equal to the first candidate already in that case.

I found it far to confusing when an inline completion changes anything the user has typed already. If such a change is there, better show the lookup table and have a closer look.

mike-fabian commented 3 years ago

In this video, I checked on the command line that there is neither ratat nor Ratat in the user database.

Then type ratato and get ratatouille as an inline completion. Obviously from the dictionary, I confirmed already that it is not in the user database and in the candidate list it is shown in gray and with the 📖 emoji.

Then I empty the preedit with Backspace and type Ratato and do not get an inline completion but see that the first two candidates in the lookup table are

ratatouille
ratatouiller

This is because of the capital R in my input.

https://user-images.githubusercontent.com/2330175/141991259-cb4184d9-e803-4edb-9aa0-b3de3268c7a4.mp4

psads-git commented 3 years ago

OK, Mike. After having thought a bit about that matter, I have not reached at any improvement.

I think you can go ahead with the new official release!

mike-fabian commented 3 years ago

Theoretically I could change inline completion to allow changes in case or accents. If there is a bigger change than that by doing the inline completion, i.e. something like adding or removing letters or transposing letters, it is a serious spellchecking issue and changing this automatically by inline completion would be surprising. I am very upset when typing on the phone using SwiftKey and it replaces what I typed with something else. I have tried to disable this kind of “auto correction”.

But maybe one could make a case that changes from upper-case to lower case or adding or removing accents when doing the inline completion is “mostly harmless” and could be done with too many surprises. I guess it would still annoy me more than helping me but maybe I’ll try that sometime in the future.

Currently egalit, Egalit, Ègalit will not complete inline to ègalite. ègalitè will be among the candidates in the lookup table though.

Wouldn’t it be weird if you type Ègalit and suddenly get the black part of the preedit change to ègalit followed by a gray è? This might destroy a capital letter one has typed on purpose because one is starting a new sentence for example and it would also interfere with auto-capitalizatioņ, sometimes lowercasing something to offer an inline completion when it was capitalized on pupose after a .. One could try to avoid that by doing a case changing inline completion only if self._current_case_mode == 'orig', i.e. when the case has not been changed from what the user typed by auto-capitalization.

When I first implemented inline completion and did not yet have the

        if (not first_candidate.startswith(typed_string)
            or first_candidate == typed_string):
            # The first candidate is not a direct completion of the
            # typed string. Trying to show that inline gets very
            # confusing.  Don’t do that, show standard lookup table:
            self.update_lookup_table(self.get_lookup_table(), True)
            self._update_preedit()
            return

it was extremely confusing and annoying to get the preedit changed to the first candidate always even if that changed the spelling a lot.

So I had to add the above limitation to make it usable. Maybe the current limitation is too strict and there are a few more circumstances where it might be useful to allow inline completion.

Would you really want that your input Lon gets completed to lon + ge?

Somehow I doubt this would be useful, but maybe I have to try it and see how it feels.

psads-git commented 3 years ago

Mostly, I agree with you Mike, but if the user writes

Lon

why are not all suggestions capitalized?

mike-fabian commented 3 years ago

Mostly, I agree with you Mike, but if the user writes

Lon

why are not all suggestions capitalized?

Adapting the case of the suggestions to the case the user typed may result in suggesting the wrong case.

This is also a difficult problem.

When I type portuguese in English, it is probably wrong and should be capitalized to Portuguese. In English this is written in upper case no matter whether Portuguese is used as a noun or as an adjective.

In German however it depends on whether it is used as an adjective or as a noun:

“Ich spreche kein Portugiesisch.” (noun) “Ich spreche nicht portugiesisch.” (adjective)

So when typing German, adapting the suggestion to the case the user typed will probably not make it worse, both could be correct.

But in English, when typing portuguese adapting the suggestion Portuguese to he lower case the user typed makes it worse.

So whether suggestions should be converted automatically to the case typed by the user is unfortunately also a difficult question. I am not sure if they should. Always? Or only in some cases? If only in some cases, when exactly?

psads-git commented 3 years ago

In my opinion, Mike, if the user starts typing the word by a capital letter, the suggestions should always be capitalized words. However, if the user starts typing in lowercase, then the suggestions should mix (if appropriate) capitalized words with non-capitalized ones.

To type a capital letter is more costly than to type a non-capitalized one. Consequently, when an user chooses to start typing with an uppercase letter, then the user is meaning that wants only capitalized words.

Does this make sense, Mike?

mike-fabian commented 3 years ago

In my opinion, Mike, if the user starts typing the word by a capital letter, the suggestions should always be capitalized words. However, if the user starts typing in lowercase, then the suggestions should mix (if appropriate) capitalized words with non-capitalized ones.

To type a capital letter is more costly than to type a non-capitalized one. Consequently, when an user chooses to start typing with an uppercase letter, then the user is meaning that wants only capitalized words.

Does this make sense, Mike?

Maybe, I’ll think about it.

psads-git commented 3 years ago

OK, Mike. Thanks!

mike-fabian commented 3 years ago

I opened the new issue https://github.com/mike-fabian/ibus-typing-booster/issues/253 for this so that this idea does not get lost.

mike-fabian / ibus-typing-booster

Should search be case-sensitive? #251

252