Closed jahorton closed 1 week ago
Notes for a follow-up / further enhancement based on this:
https://github.com/keymanapp/keyman/pull/7279#issuecomment-1246117067
I've also thought of "a way" to shrink the size of the backing data table, but that would be its own beast of a side project and would result in a notably less human-readable file. [...]
(The idea: there's little reason we can't compress the table into two coded character strings - one for BMP, one for SMP. One char instead of 4 or 5 [representing the numeric value] would make a big difference.
A fun note from @srl295: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter provides a standardized implementation for some word-breaking and related functionality. It's not in all the browsers we aim to support - in fact, it's not in Firefox at all yet - but it's a promising detail for the future.
Yes, except that we can't just use system-supplied (browser or OS) functionality because we are enabling the bleeding edge of language support. We still need to be able to do this ourselves.
I think I've said this a few times, but it bears repeating: with Keyman, we can never rely on language support that is there in the system, whether that is segmentation, normalization, BCP 47, or anything else. We support languages that have never been supported and which may never be supported. And even if they are eventually supported, we aim to provide the functionality today.
I know; I just wanted to note that it exists; it may also be of some use for supplying default data during model development, for example. He had some other ideas too, but I'll let him write that comment.
Really keen to use existing functionality where it helps, so long as we have a way to roll-our-own also :grin:
Briefly, at the very least this is the api we ought to use even if implementation is something else.
Adding this as a related note: see #10568 for a reference on a license to copy over if/when implementing this potential update, especially should we copy over and check in the related source files.
Is your feature request related to a problem? Please describe.
The current data.ts for our predictive-text wordbreaker is based on Unicode 13.0 / https://www.unicode.org/reports/tr41/tr41-26.html#Props0, but there are more recent versions of Unicode available. We may want to consider some mechanism to update the file periodically.
Note that the file is generated from code provided by @eddieantonio @ https://github.com/eddieantonio/unicode-default-word-boundary/tree/master/libexec. (In fact, the rest of the wordbreaker code was developed there first, then replicated here in namespace format instead of the module format seen there!)
Describe the solution you'd like
There are a few different approaches we could consider:
/latest/
part!