feat(web): add support for OS-provided user dictionaries - Githubissues

keymanapp / keyman

Keyman cross platform input methods system running on Android, iOS, Linux, macOS, Windows and mobile and desktop web

https://keyman.com/

Other

403 stars 112 forks source link

feat(web): add support for OS-provided user dictionaries #11872

Open jahorton opened 5 months ago

jahorton commented 5 months ago

As originally listed on https://github.com/keymanapp/keyman/issues/5673:

OS integration:

[ ] #3987
[ ] #344

While we aren't doing predictive-text for those platforms quite yet, the sentiment is real and is something we can address for Android and iOS:

https://developer.android.com/guide/topics/providers/content-provider-basics#Basics
For iOS: UILexicon apparently allows us to offer suggestions from user's local dictionary (contacts, etc). May be helpful. But yes, iOS so much more limited than Android.

This would allow us to include a user's contacts (names) and such as predictive-text suggestions. Names generally aren't in lexical models, yet can often be frequently used - as a result, this would be quite helpful to users.

jahorton commented 4 months ago

According to https://ioactive.com/discovering-and-exploiting-a-vulnerability-in-androids-personal-dictionary/, there used to a permission we'd need to request when running under Android, but now it supposedly checks to see if the request is coming from an IME or spellchecker... somehow. I've followed the guide stuff I could see in the original link above, and I'm able to avoid it throwing errors... but so far, I'm only ever getting an empty dictionary.

jahorton commented 4 months ago

I figured I'd try a similar approach out, but with contacts data. The documentation I've found suggests that contact names are not automatically included in the user dictionary, even if I were able to get data from it, so we'd need to do so separately. It does require requesting additional permission - permission that the user needs to enable, as far as I can see - but I have had success with it.

To be clear, after adding the permission request within the Android manifest, I had to toggle it manually within the settings menu for the contacts-data requests to work. Permissions name: android.permission.READ_CONTACTS.

Documentation re: the data source:

https://developer.android.com/identity/providers/contacts-provider#DataBasics

Requesting ContactsContract.Data.DISPLAY_NAME (querying against ContactsContract.Data.CONTENT_URI) will provide the full name of each of the user's contacts. I haven't (yet) found "given" vs "family" sections, which would be useful for languages like Khmer and Thai, but it's certainly a start.

jahorton commented 4 months ago

Found the incantations to get the name split into components, but something's a bit off at the moment.

ContactsContract.CommonDataKinds.StructuredName.GIVEN_NAME (and .FAMILY_NAME) give us the desired split.

That said, for some reason, when I query for these components, there's a spontaneous + malformed extra row that gets returned from the query. When just querying the DISPLAY_NAME, I don't have this issue. I haven't been able to determine a reliable 'filter' I can use to handle this yet.

jahorton commented 4 months ago

As for iOS, we can make a request for the lexicon both in-app and in system-keyboard mode, but we'll only get a UILexicon response when in system-keyboard mode. We're given an empty one when in-app.

That said, what we do get looks very straightforward to work with. Note that there doesn't seem to be any notion of probability or frequency associated with its values.

jahorton commented 4 months ago

OK, I figured out what's needed for filtering with Android.

There is a column within the table queriable by ContactsContract.Data.CONTENT_URI corresponding to the value held by ContactsContract.Data.MIMETYPE. From there, ContactsContract.CommonDataKinds.____.CONTENT_ITEM_TYPE (fill in the blank with any child type of CommonDataKinds) holds a matching value to the row's entry for this column if the row holds data of its type. That's the filter we can use to determine whether or not the row holds "structured name" data, and from there we can use it to build a "contact lexicon" or similar.

We are able to query this data in both in-app and system-keyboard modes.

jahorton commented 3 months ago

Interesting limitation to consider / design question: how do we search-term-key user dictionary entries?

This can cause some trickiness, as the on-device user dictionary... doesn't exactly have a pre-defined search-term keying function. We'd likely want to re-use the one from whatever lexical model is active... but that's buried within the model itself, which is only loaded within the predictive-text worker.

Toward my current design to address this:

In order to blend two models together into one, there's a natural underlying assumption: that both models are 'keyed' the same way.
- "Traversal compositing", the approach taken with #11994, aims to blend the models together - this appears to be a near-requirement to do so.
- Blending models allows us to search both simultaneously, through the same, single correction-search process. If the search-term keying were to differ between the two models, the difference in keying could easily "prune" paths in an unintended manner by making them unreachable.

@mcdurdin has mentioned an alternate idea, though it comes with its own notable caveats: "why not just search the two separately and combine results after?"

So, there are two main approaches in view... and there's actually a third I thought of originally but had eliminated as a possibility. It's probably worth documenting all of them, as whichever ones we reject now might end up having uses elsewhere down the road. So...

Model blending / traversal compositing

That is, merging the results from two or more models into one at runtime, then searching over that.

Note: not actually merging them. Merging their paths + traversability.
Pros:
- We won't need to alter the correction-search whatsoever; we can use it as-is.
- The two models do not have to use the same "model template".
- Is quite compatible + clear-cut in regard to dictionary-based wordbreaking efforts. (#10973)
- I've kinda already implemented the core components we'd need for this approach.
  - If nothing else, my work toward this approach gives me a clear picture on how to address the similar problem for #10973 should we take the approach from @mcdurdin's suggestion for user-dictionary implementation.
- This approach is likely decently tailored for supporting "learning" features in the future, where we'd "blend in" effects from learning to tailor word frequencies to better match the user's usage.
  - This feature is on the predictive-text roadmap.
  - Then again... it likely would be limited somewhat in regard to models for agglutinative languages.
    - Even so, it's decently generalizable for as-of-yet unknown / undefined model types.
- This approach will ensure that we search from "most likely" to "least likely" based on the final, composited model - rather than the best likelihood from each individual model, which would penalize words appearing in both.
Cons:
- Blending the models together at runtime does incur overhead costs.
  - It is likely that "suboptimal" cases can be constructed that suffer significant penalties from this.
    - There is potential for non-constant asymptotic time here... though we are at a low enough 'N' / iteration count on each step that it may not matter significantly.
  - This overhead will likely be applied in average cases, even in cases where no duplicates exist (with current implementation)
    - Optimization may be possible to allow paying this cost just once per internal node, rather than consistently.
- Needs compatible, if not identical, search-term keying for all models being blended
  - Is a 'natural fit' for learning efforts, though - tailoring a specific model to fit the user would naturally re-use that model's keying patterns.
- Assumes compatible, if not identical, wordbreaking for all models involved.

Parallel model searching

@mcdurdin's proposition - "Why blend? Just search separately, then merge."

Pros:
- There's no direct complexity from trying to force two models to act as if they were just one.
- The two models can safely use very different search-term keying approaches.
- The two models do not have to use the same "model template".
Cons:
- the current correction-search process isn't well-tailored to sharing runtime with another search instance
  - They'd need to share a single ExecutionTimer.
  - If model 1 has weight of 0.8 and model 2 has weight of 0.2, model 1 should get a lot more of the runtime in the majority of cases. We'd likely decrease our suggestion quality otherwise.
- we'd likely need to add correction-search overhead to allow prioritizing the "current potential best" entry at each step, regardless of how often "current best" flips back and forth between the two models.
  - One "low-level correction search" per model would be needed, which would likely require notable alterations of the existing getBestMatches search method.
    - It currently only yields once a decent correction has been found... which can take a while in certain scenarios.
    - We'd probably want the generator function to return notably more frequently and aggressively.
  - We'd likely need a priority queue setup to choose which model is best to search at each step.
    - Would be outside of the current getBestMatches, thus add execution-time overhead. Likely not much, but the potential exists.
    - This overhead is likely to be relatively low, with constant-time asymptotic complexity, at least.
- Searching two+ models means caching search progress on each, something we haven't done yet thus far.
- (Likely) niche cases can be constructed where things that should be suggested won't be due to being "too unlikely" from an individual model but "likely enough" when summing together the likelihood from all models involved.
  - So long as one model is likely enough, it's not hard to add in the results from the other models, though.
- Assumes compatible, if not identical, wordbreaking for all models involved.
- Results in the need for more complexity for dictionary-based wordbreaking. (#10973)
  - My current design there assumes a single traversal for word-searching and dictionary lookup.
  - Granted, I'm pretty confident we can use similar strategies to what supports "traversal compositing" within the dict-breaker in order to support multiple model sources instead of just one.
Neutral:
- Noting the needed alterations to correction-search noted above, this approach would not assist with future "learning" feature efforts.
  - It doesn't get in the way... it just... doesn't help.
    - It does avoid turning either of the other two approaches into "tech debt" if we decide to go a different way when actually doing "learning".
    - But it's also a guarantee to add the noted extra correction-search complexity now and have no chance for reuse for that feature... meaning a (likely) greater total-sum complexity down the road.
  - There's technically no reason we can't use either of the other approaches in the future on a model-specific level to facilitate learning while also using this approach to select from a group of models.
    - That said... if learning while two+ models are active... how do we pick which one is edited?
    - Naive answer to that: which model gave us the suggestion?
    - Next naive answer: if more than one, split likelihood changes among each contributor.
      - I could go on, but this issue isn't a feature request for learning; that's out of scope.

Live model editing:

A third possibility that I'd previously eliminated from consideration - Why stop at blending the results? Go for a full, complete blending!

Pros:
- Quite possibly the simplest and most efficient approach... assuming a Trie-based wordlist.
- The core idea: what if we directly edited the active lexical model, actively storing and persisting "delta" values and adding user-dictionary entries into the model once it's loaded?
  - With our current Trie implementation, we could efficiently edit the Trie at runtime.
  - In theory, this would even be compatible with "learning" feature efforts in the future.
    - I first thought of the 'delta' idea because of the potential for use in "learning" down the road.
- Does avoid ambiguity re: search term keying + wordbreaking. There's only one model, thus only one keying function and one wordbreaking function.
- Any "overhead penalty" is generally a single, up-front cost as far as correction-searching goes.
- Is quite compatible with dictionary-based wordbreaking efforts. (#10973)
Cons:
- Requires either caching the previously-edited version of the model or rebuilding it each time after it's loaded.
  - A "delta" cache is likely quite possible, to persist the changes as a whole and reapply just that in an efficient manner.
- Is very likely not to generalize to other model types.
  - It's not model-type agnostic - we are explicitly leveraging our knowledge of the structure used by our Trie-based wordlist models.
  - That said... considering what design might be needed for agglutinative models, we likely would want at least a component that is model-template tailored.
    - Learning deltas on word roots / affixes is likely better than whole-word deltas, so maybe this isn't as big a con as I first thought.

jahorton commented 3 months ago

I'm starting to think the "Parallel model searching" route might actually be the optimal way forward - being able to drop the "same search-term keying" requirement may actually be a pretty big win.

I only realized some of the notable performance-time caveats to "traversal compositing" once I got far enough into its implementation. Still, I'm glad I worked out the needed logic for it - it was a bit tough, but it was a fun problem with a solution that will likely see use in some form down the road, even if not now.