sailfish-keyboard / sailfishos-presage-predictor

Presage based input predictor for the Sailfish OS
https://openrepos.net/content/sailfishkeyboard/maliit-plugin-presage
GNU General Public License v3.0
7 stars 4 forks source link

General discussion #16

Closed rinigus closed 6 years ago

rinigus commented 6 years ago

I am opening this issue for general discussion and updates/notes. Please feel free to close it at any time or move it to some other channel, as it is the best. Right now, we have been using 'English keyboard' issue for such discussion, maybe its better to make something more dedicated.

As a heads up, after build environment adjustments, I have updated presage library at https://github.com/rinigus/presage to

I have worked a bit on improving the performance, so far about 20% gain. I have plans on how to improve it further by using other database format, let's see if it will work.

I will read now the plugin code and will make probably some small adjustments. Will submit PR when ready.

martonmiklos commented 6 years ago

I am fine with this issues to be used as a general discussion.

20% is a great improvement! Have you implemented the index based lookup mentioned by the author in the TODO or used different approaches?

Thanks for working on it I would happily help, but now I am not well equipped with free time :(

rinigus commented 6 years ago

20% was relatively easy - presage once in a while asks for sum of counts for all n-grams. By caching it, you avoid full database scan.

I'll have to look at TODO and see the earlier plans

rinigus commented 6 years ago

I have a working prototype which is using MARISA together with counts file as n-gram database. The speed on my test case is about 3.8x faster than the original Presage version. Total benchmark run went from 90s to 25s with 15s spent in UserSmoothedNgramPredictor.

In addition of being faster, the prototype MARISA based database is also smaller: 3.4MB vs 20MB of the same data in SQLite format.

I will work on it a bit more now to change few things in the code before merging into the master branch. Then the import tools will be needed and we could after that test it on devices as well.

martonmiklos commented 6 years ago

It sounds awesome!

rinigus commented 6 years ago

And the first test results on device are very good! I generated Estonian database using 1.7GB text and text2ngram leading to 16GB n-gram database in SQLite format. From that SQLite database, I made MARISA database by applying threshold of 10 counts (only ngrams with 11 or more counts made a cut) leading to the database

Loading n-gram: 1
Number of n-grams to load: 515218
Range of n-gram counts: 7041791 11

Loading n-gram: 2
Number of n-grams to load: 1898011
Range of n-gram counts: 488578 11

Loading n-gram: 3
Number of n-grams to load: 833865
Range of n-gram counts: 28505 11

Data loaded, saving
Keys:  3247094

with 3247094 n-grams. Size is 24MB for that database

When used with marisa-branch SFOS predictor, it felt very snappy and predicted quite well.

I will have to write docs on how to package dictionaries for SFOS and then merge with the SFOS predictor changes to make MARISA the default predictor. As for how to generate MARISA database itself, this is documented in https://github.com/rinigus/presage/blob/master/README.md

martonmiklos commented 6 years ago

Sounds great!

Only one thing came into my mind: what about the memory usage?

rinigus commented 6 years ago

Good question!

maliit-server memory consumption is

With Estonian predictor: 318MB VIRT / 48MB RES / 24MB shared No predictors (after restart of keyboard): 292MB VIRT / 33MB RES / 24MB shared

The database is split into MARISA trie (that maybe loaded into RAM, 13MB) and counts file (11MB) which is MMAP'ed into RAM.

From these numbers, I would say consumption is reasonable.

rinigus commented 6 years ago

PS: I have added stats collection for maliit-server to collectd configuration. This will let us monitor maliit server over longer time and see if there are any issues with it as well as access its RAM consumption better

rinigus commented 6 years ago

Since I monitor maliit-server by collectd, I can report that during the last 2 days max RSS memory usage was 60.6MB. That is with this Estonian database.

martonmiklos commented 6 years ago

Sounds great!

Do you have any hints for release plans? I have seen your Marisadb PR, I would like to test drive it and create a new release based on that.

Before the release I would like to sort out this issue at least https://github.com/rinigus/presage/issues/11

And I have had an email what I could not type through a point because the keyboard is crashed always. I will try to reproduce that as well and see in gdb what happends. BTW. if you have some guidance how to attach the SDK's QtCreator's gdb to the presage library/ the predictor that would be great.

rinigus commented 6 years ago

I think we should be getting close to the next release. However, it will be more clear after you tested Marisa version.

I don't know how to attach SDK's gdb to presage library. I presume that you need to attach it to maliit-server. What you could do though, is to get that email text and test if presage is crashing on desktop through src/tools/presage_simulator. The similar approach should help with https://github.com/rinigus/presage/issues/11 - try to use the same databases and start typing in presage demo program on desktop with all predictors in DEBUG mode.

As for blocking issues:

Anything else is a bonus. The release doesn't stop the development. I would like to address https://github.com/martonmiklos/sailfishos-presage-predictor/issues/9 and, after that, https://github.com/rinigus/presage/issues/4 . But we all have our priorities.

rinigus commented 6 years ago

I have adopted NLTK for generation of n-gram database for English. It seem to work quite well and I can get words like it's predicted. See updated readme in https://github.com/rinigus/presage and scripts in utils for details.

It seems to me that we will have to adjust tokenization by Presage (or Maliit?) to each language separately. Such adjustments shouldn't be too hard, I just would have to find the right knobs in the code. But it will be more clear when the feedback will start floating in regarding errors in predictions.

As it is, on my part, I would be happy with the release of the current state. That would allow users to start testing and give feedback. Let me know how does your testing go and when you are ready. I'll probably setup also language support package at OBS to let users get development versions easily.

rinigus commented 6 years ago

And an update, I managed to get English and Estonian keyboards and language support into OBS as well. So, for testing, you could just install English layout and language support together with Maliit plugin from http://repo.merproject.org/obs/home:/rinigus:/keyboard/sailfish_latest_armv7hl/

ljo commented 6 years ago

@rinigus Great! I am also still looking at the language model and tokenization per language.

rinigus commented 6 years ago

@ljo: nice! In which context do you look into LM and tokenization? To generate ngram database or use it in presage? For ngram database generation, I am using NLTK now with relatively simple script, slightly altered tokenization to keep contracted words (it's didn't ...) which worked for English (allowing to use it for presage as well). Or do you look how to make tokenizer in presage language-specific?

ljo commented 6 years ago

@rinigus yes, the tokenization need to be same both in n-gram generation and in use of it in library/keybord like I wrote in mid-December. So this must be a language specific configuration even if many languages can share. I started with the tokenizer in presage, but any generator capable of csv/tsv should use a configured tokenizer otherwise there will be missmatch in some places. There are also other areas where the LM should be language specific both in the keyboard part and in the library. I am also working on improving the suggestions since they are too strict wrt keeping already typed characters.

rinigus commented 6 years ago

@ljo - awesome! Would you mind to base your changes on https://github.com/rinigus/presage fork? This will allow you to use much faster and compact database for n-grams storage and keep our work easily combined via merging. Although, if you want to generate the database via text2ngram, you would have to use SQLite as intermediate step before generating MARISA database (that's documented in README.md).

I guess, if tokenization is right, then there is no problem with mixing it. The problems will arise if its wrong :) . But its great that you look into it and we are all looking forward to better tokenizer.

I will start then working on async mode of the plugin (https://github.com/martonmiklos/sailfishos-presage-predictor/issues/9) that should allow me later to implement predictor based on a speller (https://github.com/rinigus/presage/issues/4) - I expect it to be a bit more processing requiring and, as a result, slower. After that, I'll look into the next outstanding issues around the plugin

ljo commented 6 years ago

@rinigus I will base it on your fork. I have followed the discussion and progress, but have not had much time for commenting and interaction for a while. Sounds great!

rinigus commented 6 years ago

Hi

one aspect that hit me while working on Hunspell based predictor: we specify the keyboard by language only: HU, EN, ... Actually, we should use hu_HU, en_US and so on instead as a languageCode in the keyboard configuration file.

I guess such syntax is supported, just would have to test it

martonmiklos commented 6 years ago

By seeing that my free time is one of the narrow cross-section in the process of moving forward with bringing open source prediction solution to SFOS I would propose to create an organisation here on Github for coordinating development. Presage-SFOS-hacking or similar. In this case we could still do PR based code reviews, but it won't be necessary to always waiting for me.

@ljo @rinigus what do you think?

rinigus commented 6 years ago

Its quite late over here to discuss this, let me think about it. It seems like a good idea, but not due to the limited amount of your time - that can strike each of us any moment, but since it will allow us to setup a single place with all relevant repositories. We'll probably want to add separate repo for configuration dialog as well. It will also ensure that my presage commits are reviewed, surely good thing!

I'll have to sleep now. Just wanted to let you folks know that I managed to add Hunspell predictor to Presage and hook it to the keyboard plugin. It works on top of async, so we don't see any sluggishness induced by it (if there is any). So, if you misspell the word, just keep typing and you should get speller predictions when its ready to figure it out and when there are no prefixed words in n-gram databases. I am using for that full language codes. So, for English, keyboard has to specify en_US in its conf file. Its available at OBS as well, under https://build.merproject.org/package/show/home:rinigus:keyboard:devel with Hunspell dictionaries for English (and Estonian).

But as for original question, let's think about it. I don't know how to set it up, though.

rinigus commented 6 years ago

If we are going to make such reorganization, we may as well propose UBTouch developers to join. There are probably more environments that use Maliit keyboard and they could be interested as well. So, this organization could have some more general name.

If we do it that way, we could still keep Sailfish code there as well, together with the code for other OSes.

rinigus commented 6 years ago

I looked into UBPorts keyboard and a little bit into Maliit. To be honest, I am a bit confused what are we exactly developing? :)

Looks like UBPorts keyboard has presage and hunspell support. These are also implemented as separate threads, as done in async PR. From the look at the source tree, its quite modular and has been incorporated into maliit repositories at https://github.com/maliit/keyboard .

The plugin that we work on here hooks to jolla-keyboard, which is proprietary, right? Should we consider switching to FOSS solution? Although that may require too much work, no idea about it.

rinigus commented 6 years ago

OK, I think I learned a bit regarding this infrastructure and now I do wonder what's the long term plan for the keyboard development for SFOS. I'll send the email around at sailfish devel list to get some feedback. I presume that you both (@martonmiklos and @ljo) are there as well :)

martonmiklos commented 6 years ago

I do wonder what's the long term plan for the keyboard development for SFOS

From my side I could describe the goal in the following manner: I do use SFOS every day with a "community suppored language" where the Xt9 is not available and I would have a proper text prediction. I know that we might never get the same usability as the Xt9, but we can try to do the best.

Given the number of SFOS users in Hungary I doubt that Jolla will ever add support in Xt9 for my language, so I looked for other ways. If I do things like this I usually try to make it scalable. In the current situation I mean under scalability that adding support for a random language is documented and possible easily. Having fun, and learning new technologies during the development is a bonus.

rinigus commented 6 years ago

I agree with that. Also ported devices don't have access to Xt9 - so, for me using 1+X, this is a great and interesting project. But its interesting to hear what's the long term plan for SFOS, maybe we can contribute to that as well .

martonmiklos commented 6 years ago

I think we could try to go down the same path with this like we did with the community translations. First just get it working (like on transifex at the beginning of the translations). Then polish it until a point when it get attention and even some kind of support (when Jolla provided us the Pootle). And then as the time goes it might land eventually the official releases for the community supported languages.

rinigus commented 6 years ago

I have just sent email to the list. Let's see what we get from there.

I agree, let's get it working and polish. The large fraction of the things we worked on (porting presage, understanding how it works, fixing, extending it, hooking to the keyboard) would had to be done anyway. As well as dealing with the bugs :)

Now a question - should I push hunspell branch on top of async or wait for you to test async till the end first? I have to change one small thing in hunspell support, but then it will be ready for submission.

rinigus commented 6 years ago

Looks like hunspell will have to wait. I have hit an issue with using Estonian: when using a word starting with Õ, Presage couldn't make a lower case letter out of it. Which is due to use of rather simple lower case algorithm.

I think we need full UTF8 support in Presage and I will look into it. It seems to me that we should move Presage internally to ICU strings and require that input is in UTF8 always. Any objections against it? Or should we always convert the strings from current locale to UTF8?

It will probably require to work through tokenizer as well. But I think that UTF8 would be handy there too

martonmiklos commented 6 years ago

I see no objections about making presage UTF-8 internally.

ljo commented 6 years ago

@rinigus Yes, we will probably hit a lot more of those issues (hunspell lowercasing algo etc) coming weeks. Getting it working and polishing iteratively is probably the best way forward. It is better to judge a switch UBPorts keyboard if we see where we are heading a bit clearer and how this initiative differs to theirs. I mean you mentioned a few points already, but a larger group will require more time on administration e.g. alignment and dialogue. Which I don't mind but preferably take later.

rinigus commented 6 years ago

@ljo: great to see the packaged swedish! I guess we'll need to do larger reorganization with these packages. I wonder if we can rename packages in openrepos or would have to make the new ones.

I have been reading on normalization of unicode strings (how to compare them) and stumbled on tokenizer that could possibly do a job for us as a part of ICU:

http://userguide.icu-project.org/boundaryanalysis http://userguide.icu-project.org/boundaryanalysis#TOC-Dictionary-Based-BreakIterator

I'll start on incorporating ICU for normalization and lowercase support, but we could also try to use it as a tokenizer. Or is there a better alternative for it?

I guess we need proper UTF support internally. Otherwise, it would work for English and will suck in Hungarian, Swedish and Estonian to name few languages :)

rinigus commented 6 years ago

@ljo: As for polishing and fixing - agreed! its needed for the presage part already irrespective to the keyboard engine that we use in the end.

rinigus commented 6 years ago

I have read up a bit on unicode and related subjects and here is what I learned. I am writing it up to get us on the same page. If you are familiar with the concepts, just jump towards the end :)

To do properly n-grams and search in them, we need to apply normalization, probably NFKC together with case folding (nice description at https://www.elastic.co/guide/en/elasticsearch/guide/current/case-folding.html). In addition, unicode string would have to be tokenized properly using tokenizer from some unicode library to avoid chopping the string in a middle of multibyte char. With the words detected, NFKC and case folding applied, we could search in n-gram database for occurrence of the n-grams fitting the prefix of the entered one. However, since case folding can lead to misspelled word (such as ß->ss), we would have to keep an correct NFKC representation of that word as well. That way, we would have in the database:

Table 1 | NFKC_case_folded ngrams -> number of times occurred, index of the last word
Table 2 | index of the last word -> word in correct NFKC form

Such approach should allow to support any language, as far as I can understand.

At present, Presage is targeting ISO 8859-1 which is a bit extended. For normalization it can use lowercase mode, but, as I found earlier, its far from perfect. In essence, any non ASCII char is kept as it is. With unicode multibyte chars that can probably mess up big time, so maybe lowercasing should be disabled for anything non-Latin.

The library that can do probably all is ICU. However, we would have to rip whole Presage apart by replacing all strings with UnicodeString. Then all processing of these should be replaced as well by the functions that would do similar things. As an input/output we can use UTF8 containing std::string, but internal processing is probably better to do using ICU.

In short, its a huge amount of work. Maybe we should focus on getting the first release out and decide which bugs have to be fixed before that. For hunspell I can make a simpler workaround by packaging dictionaries in UTF8 format since the strings are coming in it from SFOS. Maybe after release, someone with the experience in ICU and some worse supported language would like to help us out :)

ljo commented 6 years ago

@rinigus Yes, I am familiar with this and your analysis sounds about right. I have been using ICU for other projects, so if we can use it it will save us/(me) some time. About the procedure I agree we should focus on the first release and the bugs. The workaround for hunspell sounds safe within the SFOS context for now.

rinigus commented 6 years ago

@ljo: great that you've been using ICU already, I am not familiar with it. I can easily make an update to Marisa database that would work as described, but ICU expertise would be of great advantage when working on tokenization.

I can now confirm that packaging UTF8 hunspell dictionary indeed made Hunspell working as expected on SFOS. I'll polish it a bit and will ping when I'm ready. Then we can decide whether to submit it on top of Async or separately.

Any update on Async review status? Does it look (code) and feel fins or should I make some additional changes?

(typed as many other things these days using this plugin)

ljo commented 6 years ago

@rinigus It looks good and feels super. Typing a lot with it here too. :) Mostly it is the odd things in the suggestion handling and insertion that makes me react. So I look forward to your ping to test the hunspell predictor for other languages too. I agree we can then decide and start addressing the bugs (in suggestion handling) etc.

rinigus commented 6 years ago

@martonmiklos: "forgot" to reply regarding organization. If you want to do it, I will surely support it. I guess the main advantage would be keeping all projects together. In addition to sailfishos-presage-predictor and presage, we would probably have some configuration tool (https://github.com/martonmiklos/sailfishos-presage-predictor/issues/5) and can then think if we want to add other keyboard core as well. Maybe it would be good to keep n-grams also somewhere. Right now we are somewhat spread. Although, with only 3 of us, its not a major issue :) . (Would be good to keep the name relatively short though...)

I have now tuned hunspell plugin as well. So, presage has hunspell "predictor" that is very handy when you misspell the word. Its configured via its file basename (hunspell has 2 files, affix and dictionary, the both should have the same basename in this config to simplify the switch).

Now, since there is a distinction between language variants, I am using 'en_US' for specific English. This comes basically from https://github.com/rinigus/sailfishos-presage-predictor/blob/hunspell/src/presageworker.cpp#L104 and requires also that the keyboards specify their language in the same way. Such notation drops any lowercasing as was done earlier in the plugin and allows us to use any filesystem-allowed UTF8 sequence for keyboard language specification. From user's POV, we have "en_US" on the spacebar. Similar for any other language. I think we'll need it in the future for correct normalization using ICU as well (turkish i comes as a classic example used in all these normalization texts).

Hunspell, as mentioned above, requires UTF8 dictionaries. I have seen some at Github, but they were broken for Estonian. However, after converting system-installed myself to UTF8, I had no problems. See description at the end of https://github.com/rinigus/presage/blob/master/README.md

Hunspell builds are at https://build.merproject.org/project/show/home:rinigus:keyboard:devel .

Few questions:

martonmiklos commented 6 years ago

I have just read through the async implementation and LGTM got merged. As far as I see you have more experience with these stuff so I can only spot the low hanging fruits otherwise I have to delve myself to the docs.

rinigus commented 6 years ago

@martonmiklos: no worries, we do our best and its progressing nicely. Just think where we were a halfa a year ago, before you got the ball rolling :).

I'll prepare Hunspell for submission. Will start with the corresponding dictionaries so you could test. Will ping here as well when ready.

rinigus commented 6 years ago

Hi! Hunspell PR submitted and the packages are baked by our friendly OBS.

A bit annoying change that I had to introduce is the usage of full language spec (enUS) in all package names and configs. I hope that its the last change we have to do before releasing the packages in terms of package names. It was required to be able to differentiate between en and some other language variants. Seems unavoidable to me.

Now when you misspell the word, its adding it through hunspell predictor. If the word is spelled correctly, hunspell adds it to the "predictions" as well. That way it should give you a feedback regarding correct spelling too. Hunspell predictions are using fixed probability which is declying linearly along hunspell suggestions to keep the order the same. Seems to work decently well to me. Maybe probabilities have to be tuned, I used the value suggested for other dictionaries.

@ljo: I think the quality of predictions will depend on UTF8 support. Before that we are hit by not so good normalization before searching in ngram database (could influence its creation too).

Right now I am going to check for the bugs that you reported and see if I can reproduce them

rinigus commented 6 years ago

Before using the version with Hunspell, I would suggest to remove all presage packages. Then remove .local/share/presage and .presage (if you have any). After that check that you don't have old /usr/share/maliit/plugins/com/jolla/layouts/ with presage left. I had some and it created some confusion.

After cleanup, maybe reboot, and then install again. I have all working as expected on my device with the latest version. So far, no unexpected bugs (lowercasing of Õ will happen only with UTF8). In my eyes, its ready for preparation for release. If you see bugs, please report

rinigus commented 6 years ago

Morning! From Hunspell testing, any bug reports? Any bugs to fix blocking the release of the next version?

I guess before the release, @martonmiklos, it probably makes sense to remove all presage libraries and extra packages from OpenRepos (or just delete RPMs from those and state that they are not needed due to linking).

Do we go with the organization? If we do, it may be good to do that before release and merge all repositories in one place. Then we can collect all issues over there. Otherwise, let's advertise this repository as a main user's issue collector. If it's presage issue, we can move it as a part of resolution.

martonmiklos commented 6 years ago

I have not got to the point yet to install the hunspell stuff.

Organization: if we agree on the name I can create it any time.

rinigus commented 6 years ago

@martonmiklos - fair enough :) . just looking forward to the feedback...

As for the name, it should reflect what do we try to achieve. I presume its mainly focused of Sailfish keyboard. As examples of names:

I presume we'll move main development of this plugin and presage over to that organization. What else? Maybe, in future, we will want to take a shot on syncing with UBPorts keyboard to make fully OSS keyboard component under SFOS. That would probably fit nicely there. I would suggest to make repository for n-grams and hunspell dictionaries to keep them in a single place.

I would suggest from such scope to name it sailfish-keyboard . Or shorter: sfos-kbd ? Folks, maybe there is something better? Just doesn't pop up into my mind...

martonmiklos commented 6 years ago

+1 for sailfish-keyboard. Waiting for @ljo comments.

rinigus commented 6 years ago

Edited out since not relevant :)

martonmiklos commented 6 years ago

Yeah notifications are tricky :D

martonmiklos commented 6 years ago

I have created the sailfish-keyboard organisation. If better names will proposed we can rename it. I am trying to figure out how can I move repos under the organization.

rinigus commented 6 years ago

Great ! We always have option to playback all commits by merging with the current ones. But that will leave issues behind. Maybe there is a better way