Open mcdurdin opened 3 years ago
https://unicode-org.github.io/icu/userguide/boundaryanalysis/break-rules.html#details-about-dictionary-based-break-iteration looks to be quite informative; it documents the ICU approach, though it is somewhat outdated.
Then, match that against a dynamic programming approach: https://www.techiedelight.com/word-break-problem/
Alternatively, https://www.npmjs.com/package/full-icu exists, which uses MIT licensing. I doubt trying to isolate and import its wordbreaking code would be simple, but at least the licensing is permissive, so extraction should be possible.
full-icu is node-only -- it has a binary component. Intl exists on modern browsers and we could be using that. BUT: our use case is special: we are supporting scripts and languages which do not yet have data in Intl, nor in ICU. The writing system may not yet have stabilised. But users still need to be able to enter data. So we really can't rely on language data from other libraries for this.
We can use their algorithms, but must be able to provide our own data. The question is whether or not this is going to work for our users.
There's enough of a description of their approach in the "looks to be quite informative" link that I can probably code soemthing similar, rather than try to full-dive the ICU code, though I'm still a little unclear of exactly how cases with text not in the dictionary operate. (I think it extrapolates from the same principles, but it still gets tricky to follow.)
I did start looking at the ICU node package, but I wouldn't be surprised if it's actually within the mentioned binary component.
Defines parent types:
https://github.com/unicode-org/icu/blob/main/icu4c/source/common/brkiter.cpp
The real code:
https://github.com/unicode-org/icu/blob/main/icu4c/source/common/dictbe.cpp
Note: it seems that ICU has its own distinct license - https://github.com/unicode-org/icu/blob/main/icu4c/LICENSE.
Some of the dictionary-based word-breaking stuff appears to be sourced from elsewhere, based on certain contents of that license. The CJK breaker seems to come from Chrome code and follows BSD?
Also, a "funny" side note:
* Other possible leads: just search “dictionary-based wordbreaking.”
... and guess what link the first search result leads to now?
Fortunately, some of the other near-top results are useful for us.
@srl295 Note the earlier comment with ICU links.
A thought - it'd be a fair bit of work, and it's been a while since I last touched it, but toward the classic "name" problem for dictionary-based wordbreaking - the fact that names aren't included in dictionaries:
https://ufdcimages.uflib.ufl.edu/UF/E0/04/71/46/00001/HORTON_J.pdf
My ol' dissertation's application could be repurposed to approximately learn the general linguistic pattern of a language's words from its wordlist. Since names in a language generally do have the same linguistic structure as the language's words, that'd give us a way to provide a "probability of being a word" (or a name), even if not actually within the dictionary. It'd involve machine learning, but only at compile-time.
Then...
<word in dict> | well-structured possible word not in dict | <word in dict>
- high probability of a good segmentation
<word in dict> | text not matching valid word pattern | <word in dict>
- lower probability of a good segmentation
Get the N best-looking segmentations & go from there.
Notes from personal investigation and followup conversations with @srl295 in other communication channels:
The TrieBuilder
class (uchar
variant) is used to compile dictionary-tries used by Unicode's word-breaking algorithms and could, in theory, be used at runtime. However, it's very much worth noting that ICU4C does not do this.
Toward that end, ICU4C uses gendict (link)...
at build time to serialize the trie to a memory-mapped file.
In effect, the TrieBuilder
class is used to precompile the relevant dictionaries and ensure byte-compatibility when loading said dictionaries from files distributed with ICU4C.
gendict
is triggered by BUILDRULES.py.
Note that for dictionaries that do not use frequency weighting, the implementation actually uses weight 0 instead of weight 1 as we do in our lexical model Tries.
From there, the DictionaryMatcher
classes preload the pre-compiled tries and set them to their internal characters
field, which is used to initialize the UCharsTrie
class (or its other-typed variant) each time the matches
method is called. (Reference.)
If we were to try interfacing our predictive-text engine with ICU4C, perhaps via WASM cross-compilation, it's worth note that ICU4C expects its own format for dictionary tries. That said, while still conjecture at this stage, I believe it may be possible to implement a wrapper for our dictionary tries that can fulfill the ICU4C Trie interface. If so, we could load just the wordbreaker code and avoid needing to double-compile our tries. (Double-compile: once for ICU4C components, once for our TS components. Just to be clear.) It'd definitely take some work to pull off, though.
After a rough brainstorming with a colleague a while back...
I decided to revisit this problem again in more detail today, so here are some extra thoughts to flesh things out a bit more, which will hopefully also explain what that note is even about.
For this version of the dictionary-based wordbreaking problem, we do not assume the context to be word-broken consists solely of words within the dictionary. It is quite possible to have accidental typos or proper names, for example, that are not found in the dictionary - we need a rough way to handle those too.
What is the "cheapest way" / "optimal path" that can result in a word-break boundary existing after a character?
The real complications arise when asking how to use the answer to this question at one location to obtain the answer for a future location. That said, note the implications of the former paragraph: we need only consider the cost of reaching the most recent preceding boundary + the path taken from there to the new potential boundary.
Our LexiconTraversal
type is fantastic for efficiently navigating these paths.
When moving from one codepoint of the context to the next...
LexiconTraversal
s, does each have a branch representing the next codepoint?
LexiconTraversal
and apply a penalty cost for skipping one character after the previous codepoint; use it to serve the same role.
LexiconTraversal
set. (same one as "if yes, it's still valid" in step 1)LexiconTraversal
s at this stage.Iterate until the end. Note that at the last position, we may also consider incomplete paths, rather than only leaf entries - we may be in the middle of a word, and that's okay.
To facilitate doing this efficiently:
LexiconTraversal
s should know the cost/weight of the most likely child path they represent.
LexiconTraversal
s should also report the cost/weight of leaf-entries reached.
common/models/wordbreakers
knows of the LexiconTraversal
type, but not the TrieModel
type, which is in common/models/templates
(and which depends on common/models/wordbreakers
). It'll keep a cleaner dependency graph.LexiconTraversal
should provide a way to directly request a child by codepoint-name, rather than have to iterate among all children
. O(1) vs O(children) is an easy optimization, and the existing Trie implementation can provide this easily..children
Optimizations 1 + 2 may allow us to evaluate the middle-layer loop less frequently by only evaluating paths that have a chance to be optimal. We'd also be able to use a priority queue to facilitate this and incorporate some of the needed comparisons there automatically.
Predictive text: dictionary-based word-breaking (as in many SE Asia languages)