feat(common/models/wordbreakers): dictionary-based word breaking

mcdurdin commented 3 years ago

Predictive text: dictionary-based word-breaking (as in many SE Asia languages)

The one I’ve given Makara so far is a hack that can definitely break.
Definitely easier than attempting morphological modeling
Potential lead: https://www.geeksforgeeks.org/word-break-problem-dp-32/
It’s not that simple to import, given our sliding context window… but is definitely a great starting point.
Other possible leads: just search “dictionary-based wordbreaking.”
Marc has mentioned that ICU has existing implementation for a number of SE-Asian scripts. Especially for C; may not for JS.

jahorton commented 2 years ago

https://unicode-org.github.io/icu/userguide/boundaryanalysis/break-rules.html#details-about-dictionary-based-break-iteration looks to be quite informative; it documents the ICU approach, though it is somewhat outdated.

Then, match that against a dynamic programming approach: https://www.techiedelight.com/word-break-problem/

jahorton commented 2 years ago

Alternatively, https://www.npmjs.com/package/full-icu exists, which uses MIT licensing. I doubt trying to isolate and import its wordbreaking code would be simple, but at least the licensing is permissive, so extraction should be possible.

mcdurdin commented 2 years ago

full-icu is node-only -- it has a binary component. Intl exists on modern browsers and we could be using that. BUT: our use case is special: we are supporting scripts and languages which do not yet have data in Intl, nor in ICU. The writing system may not yet have stabilised. But users still need to be able to enter data. So we really can't rely on language data from other libraries for this.

We can use their algorithms, but must be able to provide our own data. The question is whether or not this is going to work for our users.

jahorton commented 2 years ago

There's enough of a description of their approach in the "looks to be quite informative" link that I can probably code soemthing similar, rather than try to full-dive the ICU code, though I'm still a little unclear of exactly how cases with text not in the dictionary operate. (I think it extrapolates from the same principles, but it still gets tricky to follow.)

I did start looking at the ICU node package, but I wouldn't be surprised if it's actually within the mentioned binary component.

jahorton commented 2 years ago

Defines parent types:

https://github.com/unicode-org/icu/blob/main/icu4c/source/common/brkiter.cpp

The real code:

https://github.com/unicode-org/icu/blob/main/icu4c/source/common/dictbe.cpp

Note: it seems that ICU has its own distinct license - https://github.com/unicode-org/icu/blob/main/icu4c/LICENSE.

Some of the dictionary-based word-breaking stuff appears to be sourced from elsewhere, based on certain contents of that license. The CJK breaker seems to come from Chrome code and follows BSD?

jahorton commented 2 years ago

Also, a "funny" side note:

* Other possible leads:  just search “dictionary-based wordbreaking.”

... and guess what link the first search result leads to now?

Fortunately, some of the other near-top results are useful for us.

jahorton commented 1 year ago

@srl295 Note the earlier comment with ICU links.

jahorton commented 1 year ago

A thought - it'd be a fair bit of work, and it's been a while since I last touched it, but toward the classic "name" problem for dictionary-based wordbreaking - the fact that names aren't included in dictionaries:

https://ufdcimages.uflib.ufl.edu/UF/E0/04/71/46/00001/HORTON_J.pdf

My ol' dissertation's application could be repurposed to approximately learn the general linguistic pattern of a language's words from its wordlist. Since names in a language generally do have the same linguistic structure as the language's words, that'd give us a way to provide a "probability of being a word" (or a name), even if not actually within the dictionary. It'd involve machine learning, but only at compile-time.

Then...

<word in dict> | well-structured possible word not in dict | <word in dict> - high probability of a good segmentation <word in dict> | text not matching valid word pattern | <word in dict> - lower probability of a good segmentation

Get the N best-looking segmentations & go from there.

jahorton commented 1 year ago

Notes from personal investigation and followup conversations with @srl295 in other communication channels:

The TrieBuilder class (uchar variant) is used to compile dictionary-tries used by Unicode's word-breaking algorithms and could, in theory, be used at runtime. However, it's very much worth noting that ICU4C does not do this.

Toward that end, ICU4C uses gendict (link)...

at build time to serialize the trie to a memory-mapped file.

In effect, the TrieBuilder class is used to precompile the relevant dictionaries and ensure byte-compatibility when loading said dictionaries from files distributed with ICU4C.

gendict is triggered by BUILDRULES.py.

Note that for dictionaries that do not use frequency weighting, the implementation actually uses weight 0 instead of weight 1 as we do in our lexical model Tries.

From there, the DictionaryMatcher classes preload the pre-compiled tries and set them to their internal characters field, which is used to initialize the UCharsTrie class (or its other-typed variant) each time the matches method is called. (Reference.)

If we were to try interfacing our predictive-text engine with ICU4C, perhaps via WASM cross-compilation, it's worth note that ICU4C expects its own format for dictionary tries. That said, while still conjecture at this stage, I believe it may be possible to implement a wrapper for our dictionary tries that can fulfill the ICU4C Trie interface. If so, we could load just the wordbreaker code and avoid needing to double-compile our tries. (Double-compile: once for ICU4C components, once for our TS components. Just to be clear.) It'd definitely take some work to pull off, though.

jahorton commented 8 months ago

After a rough brainstorming with a colleague a while back...

I decided to revisit this problem again in more detail today, so here are some extra thoughts to flesh things out a bit more, which will hopefully also explain what that note is even about.

Extra Context

For this version of the dictionary-based wordbreaking problem, we do not assume the context to be word-broken consists solely of words within the dictionary. It is quite possible to have accidental typos or proper names, for example, that are not found in the dictionary - we need a rough way to handle those too.

Central Question

What is the "cheapest way" / "optimal path" that can result in a word-break boundary existing after a character?

For any path that follows that starts with a boundary at this location, we should only ever continue from the lowest-cost / optimal way to reach it. It's safe to be "greedy" on this specific point.

The real complications arise when asking how to use the answer to this question at one location to obtain the answer for a future location. That said, note the implications of the former paragraph: we need only consider the cost of reaching the most recent preceding boundary + the path taken from there to the new potential boundary.

Observations

Our LexiconTraversal type is fantastic for efficiently navigating these paths.

https://github.com/keymanapp/keyman/blob/e8d0737467e48bb82b78af2ad885f9a5eac00213/common/models/types/index.d.ts#L22-L26

Iterating

When moving from one codepoint of the context to the next...

For all still-valid LexiconTraversals, does each have a branch representing the next codepoint?
- If yes, it's still valid - keep it in the running. (Also, remember the boundary at which it started.)
- If no, it's no longer useful as a possible way forward; discard it.
- Note which ones have leaf entries at their new positions.
Find the path of minimal cost, considering any traversal that would end at the new location (i.e., with a leaf "entry").
- This becomes the minimal cost to have the new codepoint be the end of a potential word.
- If one does not exist, create a new LexiconTraversal and apply a penalty cost for skipping one character after the previous codepoint; use it to serve the same role.
  - Add it to the active LexiconTraversal set. (same one as "if yes, it's still valid" in step 1)
- Discard no LexiconTraversals at this stage.

Iterate until the end. Note that at the last position, we may also consider incomplete paths, rather than only leaf entries - we may be in the middle of a word, and that's okay.

Potential needs & optimizations

To facilitate doing this efficiently:

LexiconTraversals should know the cost/weight of the most likely child path they represent.
- This would allow optimized searching of the traversals for the best path, possibly including path trimming.
- Basis: "most likely child path" makes a great A* heuristic.
LexiconTraversals should also report the cost/weight of leaf-entries reached.
- Basis 1: it is theoretically possible (albeit quite unlikely) for a shorter word to be the root of a longer, yet higher-frequency word. We need the leaf-node probability to finalize the search path noted in point 1.
- Basis 2: common/models/wordbreakers knows of the LexiconTraversal type, but not the TrieModel type, which is in common/models/templates (and which depends on common/models/wordbreakers). It'll keep a cleaner dependency graph.
LexiconTraversal should provide a way to directly request a child by codepoint-name, rather than have to iterate among all children. O(1) vs O(children) is an easy optimization, and the existing Trie implementation can provide this easily.

Approximate cost-analysis:

O(N), N = length of string: loop over each codepoint
- O(N [worst-case]): loop over each still-valid traversal at each codepoint
  - at most, matches the index of the current codepoint
  - in practice, will be NOTABLY fewer after the first few codepoints as some letter combinations become intractable as paths and/or become too long.
  - O(A), A = size of alphabet: find matching path for current traversal.
    - Unoptimized: must loop over traversal branches: .children
    - Optimized: becomes merely O(1) (see optimization 3)
  - track which traversal (if any) gives us the best result for a boundary after the current codepoint via comparisons
  - Note: dropping paths that cannot continue would come with... "some" cost.
    - If middle of an array, technically is O(N [worst-case]) per iteration
    - Dropping from a tree-like structure, such as a priority queue, should be roughly O(ln N [worst-case]) per iteration.
      - Either way, this does mean one less iteration on the next round of the top-level loop.

Optimizations 1 + 2 may allow us to evaluate the middle-layer loop less frequently by only evaluating paths that have a chance to be optimal. We'd also be able to use a priority queue to facilitate this and incorporate some of the needed comparisons there automatically.

Note that if the cost to continue a path ends up exceeding "previous minimal cost" + penalty to reach the same location, we become able to outright prune further processing for the path.

keymanapp / keyman