keymanapp / keyman

Keyman cross platform input methods system running on Android, iOS, Linux, macOS, Windows and mobile and desktop web
https://keyman.com/
Other
372 stars 102 forks source link

feat(core): normalization per spec for transforms/etc 🙀 #9468

Closed srl295 closed 4 months ago

srl295 commented 9 months ago

CLDR-16943 details (or will detail) SC consensus about the role of Unicode normalization. Implement it.

Split out to remaining issues under)m:normalization

srl295 commented 7 months ago
mcdurdin commented 7 months ago

NFD marker tricks:

  1. U+0061 \m{marker1} U+0300 gives NFC of à. So what happens to the marker in the cached context? Is cached context NFD but app context is NFC? How do we sync?
  2. U+0061 \m{marker1} U+0300 \m{marker2} U+0320 goes to NFD U+0061 U+0320 U+0300 which breaks markers, because we no longer know where they go. Should this be a compiler error? Or is there some clever way we can work around this (e.g. normalize both app context and cached context before comparison for equality?) Do markers glue to previous codepoint, e.g. U+0061 \m{marker1} U+0320 U+0300 \m{marker2}
  3. keyboard rule gives U+00e0 U+0320 \m{marker1}, which we NFD to U+0061 U+0320 \m{marker1} U+0300??
  4. keyboard rule gives U+00e0 \m{marker1} U+0320, which we NFD to U+0061 U+0320 U+0300 \m{marker1}??
srl295 commented 7 months ago

upstream CLDR normalization ticket was merged, but basically, we don't need the ticket, we need the behavior. So this is shovel ready.

srl295 commented 7 months ago

So, I'm kind of thinking at this moment about not trying to normalize in kmc at all. The reason is, because the core side will already need to be able to normalize not just all strings in the compiled data, but also the context. Secondly, it gets us out of having to even consider what version of node (or browser!) kmc is running under. This could even lead to a class of non-determinism in the compiler, where two runs of kmc give different kmx depending on the node version. By a 'leave it alone' approach, we just write into kmx exactly whatever is in the xml.

mcdurdin commented 7 months ago

Hmm. I hear you on non-determinism. It should only impact unencoded scripts though, right? Given stability rules?

But... how will you do regex matching? It seems to me it would be difficult to normalize regexes post-construction.

srl295 commented 7 months ago

Hmm. I hear you on non-determinism. It should only impact unencoded scripts though, right? Given stability rules?

previously unencoded, yes.

But... how will you do regex matching? It seems to me it would be difficult to normalize regexes post-construction.

The regex pattern is just a string and can be normalized before constructing the matcher, I'd think? (He says, confidently)

mcdurdin commented 7 months ago

The regex pattern is just a string and can be normalized before constructing the matcher, I'd think? (He says, confidently)

\uxxxx says otherwise.

mcdurdin commented 7 months ago

And even worse, [a-z]?

mcdurdin commented 7 months ago

Ref https://unicode.org/reports/tr18/#Canonical_Equivalents

Note the magical step 2: "Having the user design the regular expression pattern to match against that defined normalization form."

srl295 commented 7 months ago

The regex pattern is just a string and can be normalized before constructing the matcher, I'd think? (He says, confidently)

\uxxxx says otherwise.

if \u{xxxx} was processed later by the regex engine, then yes. but that should only be necessary for syntactical elements.

srl295 commented 7 months ago

[a-z]

may need to parse and process such a range.

For example, the pattern should contain no characters that would not occur in that normalization form, nor sequences that would not occur.

characters can be checked for perhaps… sequences may be more challenging.

mcdurdin commented 7 months ago

Precisely. If I have a transform from [\u{00E8}-\u{00EB}] (èéêë) to [\u{00EC}-\u{00EF}] (ìíîï), how will that work with decomposition? Do we need to expand ranges (beware 0020-10FFFF)?

srl295 commented 7 months ago

Precisely. If I have a transform from [\u{00E8}-\u{00EB}] (èéêë) to [\u{00EC}-\u{00EF}] (ìíîï), how will that work with decomposition? Do we need to expand ranges (beware 0020-10FFFF)?

Could be a reason for limiting the size of ranges…if need be

We may end up from all of this needing to say, the regexes must be written in NFD and see the TR…

srl295 commented 7 months ago
<transform from="[a][\u{0300}][\u{0320}]" />
<transform from="a\u{0300}\u{0320}" />
<transform from="à̠" />
<transform from="a\u{0320}\u{0300}" />

etc

mcdurdin commented 7 months ago

Let's discuss at our meeting tomorrow

srl295 commented 7 months ago

Ok. An issue with "push all normalization into core" is this … Identity. <key id="a" output="\u{00e0}" /><key id="b" output="\u{0061}\u0300}" /> will create two strs entries.

At least it will mean that code can't use the strs index to negatively test for string identity.

srl295 commented 7 months ago

https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classNormalizer2.html for reference

srl295 commented 7 months ago

From discussion, captured by @mcdurdin

Markers:

  1. 1 e\u{0300}\m{problem_marker}\u{0320} (NFD) reorders 1 e\u{0320}\u{0300} (without markers)
  2. We then re-inject markers:
    • 1 e\m{problem_marker}\u{0320}\u{0300} (tie marker to following char), or
    • 1 e\u{0320}\u{0300}\m{problem_marker} (tie marker to preceding char), or
    • 1 e\m{problem_marker}\u{0320}\u{0300} (insert marker as early as possible in norm. cluster)

Argument for following-char method is that it makes more sense for end-of-context:

Proposed Algorithm:

  1. First, we need to transform compat chars and NFC chars to NFD: break strings at markers, normalize each string to NFD
  2. Concatenate the result, including markers, and re-break at stable characters
  3. Re-normalize each chunk, removing markers and remembering attachment to following char, to get any NFD reordering, then re-inject markers

This algorithm can be applied to any string, including: keyboard source strings, input context from app (cached context), and output in various stages from the processor (i.e. before and after each transform step).


What about repeated characters:

This hinges on:

Refs:


Alternative solution: markers always move to end of a normalized sequence.

<key to="[\u{0320}\u{0300}]\m{problem_marker}">

srl295 commented 7 months ago

I'm kind of leaning towards marker-moves-to-end of sequence.

It will make the marker make more sense with the NFC content.

I think we could say that markers stay in the same order they were injected in.

mcdurdin commented 7 months ago

marker-moves-to-end of sequence: NFC seq or NFD seq? That is, will the marker remain interleaved with combining diacritics? Because if not, I think that's going to be troublesome for keyboard devs to figure out.

srl295 commented 7 months ago

I think we landed on "tie marker to following char"

srl295 commented 7 months ago

i'll make that my password so i remember. tie,MARKER2following-char

Hmm, maybe some downsides to that 🤔