feat(core): normalization per spec for transforms/etc 🙀

srl295 commented 9 months ago

CLDR-16943 details (or will detail) SC consensus about the role of Unicode normalization. Implement it.

Split out to remaining issues under)m:normalization

10320
10317

srl295 commented 7 months ago

Cached Context needs to stay in NFD
Engine needs to be able to do a normalization-insensitive compare
Normalization should be available to KMN but not switched on now - KeyboardProcessor

mcdurdin commented 7 months ago

NFD marker tricks:

U+0061 \m{marker1} U+0300 gives NFC of à. So what happens to the marker in the cached context? Is cached context NFD but app context is NFC? How do we sync?
U+0061 \m{marker1} U+0300 \m{marker2} U+0320 goes to NFD U+0061 U+0320 U+0300 which breaks markers, because we no longer know where they go. Should this be a compiler error? Or is there some clever way we can work around this (e.g. normalize both app context and cached context before comparison for equality?) Do markers glue to previous codepoint, e.g. U+0061 \m{marker1} U+0320 U+0300 \m{marker2}
keyboard rule gives U+00e0 U+0320 \m{marker1}, which we NFD to U+0061 U+0320 \m{marker1} U+0300??
keyboard rule gives U+00e0 \m{marker1} U+0320, which we NFD to U+0061 U+0320 U+0300 \m{marker1}??

srl295 commented 7 months ago

upstream CLDR normalization ticket was merged, but basically, we don't need the ticket, we need the behavior. So this is shovel ready.

srl295 commented 7 months ago

So, I'm kind of thinking at this moment about not trying to normalize in kmc at all. The reason is, because the core side will already need to be able to normalize not just all strings in the compiled data, but also the context. Secondly, it gets us out of having to even consider what version of node (or browser!) kmc is running under. This could even lead to a class of non-determinism in the compiler, where two runs of kmc give different kmx depending on the node version. By a 'leave it alone' approach, we just write into kmx exactly whatever is in the xml.

mcdurdin commented 7 months ago

Hmm. I hear you on non-determinism. It should only impact unencoded scripts though, right? Given stability rules?

But... how will you do regex matching? It seems to me it would be difficult to normalize regexes post-construction.

srl295 commented 7 months ago

Hmm. I hear you on non-determinism. It should only impact unencoded scripts though, right? Given stability rules?

previously unencoded, yes.

But... how will you do regex matching? It seems to me it would be difficult to normalize regexes post-construction.

The regex pattern is just a string and can be normalized before constructing the matcher, I'd think? (He says, confidently)

mcdurdin commented 7 months ago

The regex pattern is just a string and can be normalized before constructing the matcher, I'd think? (He says, confidently)

\uxxxx says otherwise.

mcdurdin commented 7 months ago

And even worse, [a-z]?

mcdurdin commented 7 months ago

Ref https://unicode.org/reports/tr18/#Canonical_Equivalents

Note the magical step 2: "Having the user design the regular expression pattern to match against that defined normalization form."

srl295 commented 7 months ago

The regex pattern is just a string and can be normalized before constructing the matcher, I'd think? (He says, confidently)

\uxxxx says otherwise.

\u{xxxx} is already processed by KMC.
So \u{00E9} will already be E9 00 in UTF-16 in .kmx
when core pulls it in, it can be normalized

if \u{xxxx} was processed later by the regex engine, then yes. but that should only be necessary for syntactical elements.

srl295 commented 7 months ago

[a-z]

may need to parse and process such a range.

For example, the pattern should contain no characters that would not occur in that normalization form, nor sequences that would not occur.

characters can be checked for perhaps… sequences may be more challenging.

mcdurdin commented 7 months ago

Precisely. If I have a transform from [\u{00E8}-\u{00EB}] (èéêë) to [\u{00EC}-\u{00EF}] (ìíîï), how will that work with decomposition? Do we need to expand ranges (beware 0020-10FFFF)?

srl295 commented 7 months ago

Precisely. If I have a transform from [\u{00E8}-\u{00EB}] (èéêë) to [\u{00EC}-\u{00EF}] (ìíîï), how will that work with decomposition? Do we need to expand ranges (beware 0020-10FFFF)?

Could be a reason for limiting the size of ranges…if need be

We may end up from all of this needing to say, the regexes must be written in NFD and see the TR…

srl295 commented 7 months ago

maybe the spec should say that the transforms are actually in NFD space (may make the most sense). That is, the pattern becomes NFD.
In other words, this would never match, because the input text would always be normalized to a\u{0320}\u{300}

<transform from="[a][\u{0300}][\u{0320}]" />

however, all of these could match, because the string would be normalized

<transform from="a\u{0300}\u{0320}" />
<transform from="à̠" />
<transform from="a\u{0320}\u{0300}" />

etc

mcdurdin commented 7 months ago

Let's discuss at our meeting tomorrow

srl295 commented 7 months ago

Ok. An issue with "push all normalization into core" is this … Identity. <key id="a" output="\u{00e0}" /><key id="b" output="\u{0061}\u0300}" /> will create two strs entries.

At least it will mean that code can't use the strs index to negatively test for string identity.

srl295 commented 7 months ago

https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classNormalizer2.html for reference

srl295 commented 7 months ago

From discussion, captured by @mcdurdin

Markers:

1 e\u{0300}\m{problem_marker}\u{0320} (NFD) reorders 1 e\u{0320}\u{0300} (without markers)
We then re-inject markers:
- 1 e\m{problem_marker}\u{0320}\u{0300} (tie marker to following char), or
- 1 e\u{0320}\u{0300}\m{problem_marker} (tie marker to preceding char), or
- 1 e\m{problem_marker}\u{0320}\u{0300} (insert marker as early as possible in norm. cluster)

Argument for following-char method is that it makes more sense for end-of-context:

1 e\u{0300}\u{0320}\m{problem_marker} ==> 1 e\u{0320}\u{0300}\m{problem_marker} (tie marker to follow char, i.e. end-of-string)
To implement this, work backwards from end-of-context to first stable char

Proposed Algorithm:

First, we need to transform compat chars and NFC chars to NFD: break strings at markers, normalize each string to NFD
Concatenate the result, including markers, and re-break at stable characters
Re-normalize each chunk, removing markers and remembering attachment to following char, to get any NFD reordering, then re-inject markers

This algorithm can be applied to any string, including: keyboard source strings, input context from app (cached context), and output in various stages from the processor (i.e. before and after each transform step).

What about repeated characters:

1 e\u{0300}\u{0320}\m{problem_marker}\u{0300} --> 1 e\u{0320}\u{0300}\m{problem_marker}\u{0300}. i.e. first match (iterating end-to-start)
1 e\u{0300}\u{0320}\m{problem_marker}\m{m2}\u{0300} --> 1 e\u{0320}\u{0300}\m{problem_marker}\m{m2}\u{0300}
1 e\m{m2}\u{0300}\m{m3}\u{0320}\m{m1}\u{0300} --> 1 e\m{m3}\u{0320}\m{m2}\u{0300}\m{m1}\u{0300}. keep same order.

This hinges on:

Given two independent normalized-to-NFD strings, when concatenated, can we ever have codepoints change? (it is recognized that reorder is possible.) --> my expectation is that if this is the case, it will be one or two old sequences that we can special-case on, so we should be able to move forward.

Refs:

https://unicode.org/reports/tr15/#Concatenation

Alternative solution: markers always move to end of a normalized sequence.

<key to="[\u{0320}\u{0300}]\m{problem_marker}">

<transform from="e\u{0320}\m{problem_marker}\u{0300}" to="HELLO">
<transform from="e\u{0320}\u{0300}\m{problem_marker}" to="GOODBYE">

srl295 commented 7 months ago

I'm kind of leaning towards marker-moves-to-end of sequence.

It will make the marker make more sense with the NFC content.

I think we could say that markers stay in the same order they were injected in.

mcdurdin commented 7 months ago

marker-moves-to-end of sequence: NFC seq or NFD seq? That is, will the marker remain interleaved with combining diacritics? Because if not, I think that's going to be troublesome for keyboard devs to figure out.

srl295 commented 7 months ago

I think we landed on "tie marker to following char"

srl295 commented 7 months ago

i'll make that my password so i remember. tie,MARKER2following-char

Hmm, maybe some downsides to that 🤔

keymanapp / keyman

feat(core): normalization per spec for transforms/etc 🙀 #9468

10320

10317

From discussion, captured by @mcdurdin