Closed srl295 closed 4 months ago
NFD marker tricks:
U+0061 \m{marker1} U+0300
gives NFC of Ã
. So what happens to the marker in the cached context? Is cached context NFD but app context is NFC? How do we sync?U+0061 \m{marker1} U+0300 \m{marker2} U+0320
goes to NFD U+0061 U+0320 U+0300
which breaks markers, because we no longer know where they go. Should this be a compiler error? Or is there some clever way we can work around this (e.g. normalize both app context and cached context before comparison for equality?) Do markers glue to previous codepoint, e.g. U+0061 \m{marker1} U+0320 U+0300 \m{marker2}
U+00e0 U+0320 \m{marker1}
, which we NFD to U+0061 U+0320 \m{marker1} U+0300
??U+00e0 \m{marker1} U+0320
, which we NFD to U+0061 U+0320 U+0300 \m{marker1}
??upstream CLDR normalization ticket was merged, but basically, we don't need the ticket, we need the behavior. So this is shovel ready.
So, I'm kind of thinking at this moment about not trying to normalize in kmc at all. The reason is, because the core side will already need to be able to normalize not just all strings in the compiled data, but also the context. Secondly, it gets us out of having to even consider what version of node (or browser!) kmc is running under. This could even lead to a class of non-determinism in the compiler, where two runs of kmc give different kmx depending on the node version. By a 'leave it alone' approach, we just write into kmx exactly whatever is in the xml.
Hmm. I hear you on non-determinism. It should only impact unencoded scripts though, right? Given stability rules?
But... how will you do regex matching? It seems to me it would be difficult to normalize regexes post-construction.
Hmm. I hear you on non-determinism. It should only impact unencoded scripts though, right? Given stability rules?
previously unencoded, yes.
But... how will you do regex matching? It seems to me it would be difficult to normalize regexes post-construction.
The regex pattern is just a string and can be normalized before constructing the matcher, I'd think? (He says, confidently)
The regex pattern is just a string and can be normalized before constructing the matcher, I'd think? (He says, confidently)
\uxxxx
says otherwise.
And even worse, [a-z]
?
Ref https://unicode.org/reports/tr18/#Canonical_Equivalents
Note the magical step 2: "Having the user design the regular expression pattern to match against that defined normalization form."
The regex pattern is just a string and can be normalized before constructing the matcher, I'd think? (He says, confidently)
\uxxxx
says otherwise.
\u{xxxx}
is already processed by KMC. \u{00E9}
will already be E9 00
in UTF-16 in .kmxif \u{xxxx}
was processed later by the regex engine, then yes. but that should only be necessary for syntactical elements.
[a-z]
may need to parse and process such a range.
For example, the pattern should contain no characters that would not occur in that normalization form, nor sequences that would not occur.
characters can be checked for perhaps… sequences may be more challenging.
Precisely. If I have a transform from [\u{00E8}-\u{00EB}]
(èéêë) to [\u{00EC}-\u{00EF}]
(ìÃîï), how will that work with decomposition? Do we need to expand ranges (beware 0020-10FFFF
)?
Precisely. If I have a transform from
[\u{00E8}-\u{00EB}]
(èéêë) to[\u{00EC}-\u{00EF}]
(ìÃîï), how will that work with decomposition? Do we need to expand ranges (beware0020-10FFFF
)?
Could be a reason for limiting the size of ranges…if need be
We may end up from all of this needing to say, the regexes must be written in NFD and see the TR…
a\u{0320}\u{300}
<transform from="[a][\u{0300}][\u{0320}]" />
<transform from="a\u{0300}\u{0320}" />
<transform from="Ã Ì " />
<transform from="a\u{0320}\u{0300}" />
etc
Let's discuss at our meeting tomorrow
Ok. An issue with "push all normalization into core" is this … Identity.
<key id="a" output="\u{00e0}" /><key id="b" output="\u{0061}\u0300}" />
will create two strs
entries.
At least it will mean that code can't use the strs
index to negatively test for string identity.
Markers:
1 e\u{0300}\m{problem_marker}\u{0320}
(NFD) reorders 1 e\u{0320}\u{0300}
(without markers)1 e\m{problem_marker}\u{0320}\u{0300}
(tie marker to following char), or 1 e\u{0320}\u{0300}\m{problem_marker}
(tie marker to preceding char), or1 e\m{problem_marker}\u{0320}\u{0300}
(insert marker as early as possible in norm. cluster)Argument for following-char method is that it makes more sense for end-of-context:
1 e\u{0300}\u{0320}\m{problem_marker}
==> 1 e\u{0320}\u{0300}\m{problem_marker}
(tie marker to follow char, i.e. end-of-string)Proposed Algorithm:
This algorithm can be applied to any string, including: keyboard source strings, input context from app (cached context), and output in various stages from the processor (i.e. before and after each transform step).
What about repeated characters:
1 e\u{0300}\u{0320}\m{problem_marker}\u{0300}
--> 1 e\u{0320}\u{0300}\m{problem_marker}\u{0300}
. i.e. first match (iterating end-to-start)1 e\u{0300}\u{0320}\m{problem_marker}\m{m2}\u{0300}
--> 1 e\u{0320}\u{0300}\m{problem_marker}\m{m2}\u{0300}
1 e\m{m2}\u{0300}\m{m3}\u{0320}\m{m1}\u{0300}
--> 1 e\m{m3}\u{0320}\m{m2}\u{0300}\m{m1}\u{0300}
. keep same order. This hinges on:
Refs:
Alternative solution: markers always move to end of a normalized sequence.
<key to="[\u{0320}\u{0300}]\m{problem_marker}">
<transform from="e\u{0320}\m{problem_marker}\u{0300}" to="HELLO">
<transform from="e\u{0320}\u{0300}\m{problem_marker}" to="GOODBYE">
I'm kind of leaning towards marker-moves-to-end of sequence.
It will make the marker make more sense with the NFC content.
I think we could say that markers stay in the same order they were injected in.
marker-moves-to-end of sequence: NFC seq or NFD seq? That is, will the marker remain interleaved with combining diacritics? Because if not, I think that's going to be troublesome for keyboard devs to figure out.
I think we landed on "tie marker to following char"
i'll make that my password so i remember. tie,MARKER2following-char
Hmm, maybe some downsides to that 🤔
CLDR-16943 details (or will detail) SC consensus about the role of Unicode normalization. Implement it.
Split out to remaining issues under)m:normalization
10320
10317