feat(common/models): update wordbreaker data

jahorton commented 2 years ago

Is your feature request related to a problem? Please describe.

The current data.ts for our predictive-text wordbreaker is based on Unicode 13.0 / https://www.unicode.org/reports/tr41/tr41-26.html#Props0, but there are more recent versions of Unicode available. We may want to consider some mechanism to update the file periodically.

Note that the file is generated from code provided by @eddieantonio @ https://github.com/eddieantonio/unicode-default-word-boundary/tree/master/libexec. (In fact, the rest of the wordbreaker code was developed there first, then replicated here in namespace format instead of the module format seen there!)

Describe the solution you'd like

There are a few different approaches we could consider:

Just write up a readme about the process, including links to that repo, and remember to run an update manually once a release cycle or something.
- For now, I suppose this issue is that "readme", in a sense.
Import the code used to generate the data.ts, tweak it if (and as) necessary, and write up a readme for that.
We could consider writing a tool to automate most, if not all, of the process!
- Noting the format of the URLs provided by the Unicode reports, they may provide an evergreen link to the most current version of the files:
  - https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/WordBreakProperty.txt - note the /latest/ part!
- It should be "simple enough" to write up a tool to poll the relevant URLs (there's an extra file that was originally 'baked in'), download 'em, and run the data.ts-generator on 'em.
- If the URLs are indeed stable and always point to the 'latest', we could, in theory, include the update as a CI step.

jahorton commented 1 year ago

Notes for a follow-up / further enhancement based on this:

https://github.com/keymanapp/keyman/pull/7279#issuecomment-1246117067

I've also thought of "a way" to shrink the size of the backing data table, but that would be its own beast of a side project and would result in a notably less human-readable file. [...]

(The idea: there's little reason we can't compress the table into two coded character strings - one for BMP, one for SMP. One char instead of 4 or 5 [representing the numeric value] would make a big difference.

jahorton commented 1 year ago

A fun note from @srl295: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter provides a standardized implementation for some word-breaking and related functionality. It's not in all the browsers we aim to support - in fact, it's not in Firefox at all yet - but it's a promising detail for the future.

mcdurdin commented 1 year ago

Yes, except that we can't just use system-supplied (browser or OS) functionality because we are enabling the bleeding edge of language support. We still need to be able to do this ourselves.

mcdurdin commented 1 year ago

I think I've said this a few times, but it bears repeating: with Keyman, we can never rely on language support that is there in the system, whether that is segmentation, normalization, BCP 47, or anything else. We support languages that have never been supported and which may never be supported. And even if they are eventually supported, we aim to provide the functionality today.

jahorton commented 1 year ago

I know; I just wanted to note that it exists; it may also be of some use for supplying default data during model development, for example. He had some other ideas too, but I'll let him write that comment.

mcdurdin commented 1 year ago

Really keen to use existing functionality where it helps, so long as we have a way to roll-our-own also :grin:

srl295 commented 1 year ago

Briefly, at the very least this is the api we ought to use even if implementation is something else.

jahorton commented 7 months ago

Adding this as a related note: see #10568 for a reference on a license to copy over if/when implementing this potential update, especially should we copy over and check in the related source files.

keymanapp / keyman

feat(common/models): update wordbreaker data #7224