chore(common): global input normalization in lexical models

jahorton commented 11 months ago

Then, once we have (4) done, the model engine should inherit the normalization form of the associated keyboard -- so if the keyboard emits NFC, the model engine should normalize(nfc) its outputs. And the converse for NFD. The internals don't matter so much, but we use NFC throughout, so let's stick with that. Inputs to the model should be normalized to NFC (they probably already are?)

The lexical model compiler already does the normalization to NFC at build time, so I think we can declare this issue done. Remaining work is in referenced issues.

Originally posted by @mcdurdin in https://github.com/keymanapp/keyman/issues/2880#issuecomment-1726770721

If using a Trie-based wordlist, part of the model functionality converts both sides to NFD (with the default search-term keyer) when doing a word lookup. It's consistent and has been working well, so no worries here.

For custom models, and on the global level... we technically don't yet enforce a normalization pattern.

MattGyverLee commented 11 months ago

From #2880 Comment

Some may argue this is "linguistically appropriate" because the decomposed diacritics are tone marks (I think?) but it's a bad situation caused by tech stack limitations.

Yeah, that's where I land. In most cases here in Cameroon, the diacritics are separate processes (autosegmental). Here, é and è are the same letter with different tone. In French, those are two different vowels, and fully composed makes sense since it's it is possible. Mass changes to diacritics and base characters are much easier when decomposed. Cameroon started with dead-key based diacritic-letter keyboards that would output multiple decomposed characters, and they have LOVED the letter-diacritic keyboards, especially in cases where they are stacking diacritics. I love the idea of the keyboard determining the composition of the language model output.

My admittedly biased opinion is that if a language can be represented as FULLY composed (not mixed), then it might as well be composed. If there are combinations in the language that cannot be composed (ə̀ for example), then it is more consistent to manipulate NFD. I think this is the underlying opinion for FLEx which uses decomposed internally but outputs NFC. I don't have any cases where normalization doesn't perfectly round-trip, so I'm happy to see tools doing normalization before comparison.

The problem is that MOST users don't have access to business-facing tools that can do normalization. Word, Excel, PowerPoint, LibreOffice, etc. don't have this functionality out of the box without SIL Converters. Most non-linguistic users in this country only have a keyboard and whatever tools they use everyday (an office suite and a browser).

Side Story: The App Builders were converting the text to NFC or consuming NFC output from FLEx, which makes sense for display reasons. However, searching with an NFD keyboard resulted in comparing forms that didn't match. They ended up composing the search term before comparing it to the data and things started working again. As long as the data is normalized one way or the other, things work out.

2. IMHO, NFC is the most appropriate default form for almost any language and situation, especially as vast majority of content online is NFC. Keyboards should deal with backspacing scenarios in the case that you want consistent backspacing behavior that smooths over the normalization form peculiarities.

In a world where only Keyman keyboards are used, I agree that backspacing should be handled by the Keyboard. Unfortunately I have MSKLC and XKB versions of the Cameroon Keyboard, and backspacing is left to the OS. Unless you know exactly which major-language combinations made it into Unicode before the cutoff, backspacing is entirely unpredictable in NFC with any keyboard other than a well-designed KM keyboard.

I know that Unicode's intention is that all applications handle composed and decomposed forms as equivalent. SIL's back-and-forth normalization depending on the case approaches this ideal. Spellcheckers have not traditionally done this, but I hear David Rowe just got hunspell working properly in LibreOffice with decomposed dictionary (LO was assuming composed).

Side Question: I haven't played with LDML keyboards. Do LDML keyboards support backspace rules? Do we have any hope that OS based keyboards (I'm thinking about MIKLC, XKB, Google and iOS keyboards) will eventually either support on-the-fly normalization of the context or backspace rules?

mcdurdin commented 11 months ago

So I guess the real question here is: do you want NFC, or do you want consistent backspacing in all scenarios? Choose one. And choose wisely. There are gotchas both ways.

I haven't played with LDML keyboards. Do LDML keyboards support backspace rules? Do we have any hope that OS based keyboards (I'm thinking about MIKLC, XKB, Google and iOS keyboards) will eventually either support on-the-fly normalization of the context or backspace rules?

LDML keyboards do support backspace rules. They are not ready to work with yet, as the spec is still in draft, and Keyman's implementation is underway. Obviously we cannot know if operating system vendors will implement the spec, but we are certainly proceeding under the hope that they will, one day. But you are welcome to take a look at the spec and start to gain an understanding of it: https://github.com/unicode-org/cldr/blob/main/docs/ldml/tr35-keyboards.md shows the current draft document.

(One further wrinkle: even if backspacing is built into the keyboard, you are going to see divergent behaviour on backspace when switching between different language keyboards. Probably not a major problem though.)

MattGyverLee commented 11 months ago

Thanks.

I know I'm not going to change things. Data input is usually working around traditional limitations that don't exist anymore (255 characters per font, offset keys on typewriter posts, jamming typewriter pins from typing too fast, and more.) I say this typing in Unicode on a columnar Dvorak computer keyboard. Call me a rebel, I dare ya!

I suppose what I would love to see is FD or FC on a per-language basis. Editing NFD makes my brain happy for its consistency. I'm so glad FLEx does this and I want to see the same smooth experience as an option in non-linguistic software. Composing to publish "shouldn't" be necessary in today's world, but PowerPoint is STILL the main holdout on letting fonts reliably place diacritics.

The "N" in NFC is what makes it messy. Major languages can choose FC because they got in before the cutoff. Many (most?) minority languages can't ever have FC because the composed combinations will never exist.

(Edit: I moved the rest of this post to https://github.com/keymanapp/keyman/issues/5809 .)

mcdurdin commented 11 months ago

The "N" in NFC is what makes it messy. Major languages can choose FC because they got in before the cutoff. Many (most?) minority languages can't ever have FC because the composed combinations will never exist.

Just for clarity: it is still NFC, per the definition of NFC, basically "maximal composition". Once we have maximal composition, the remaining combining diacritics are still legitimate and permitted to remain and the text is normalized. So, NFC still remains the best choice. As discussed, we're working to get the input methods to catch up, and once they do, the remaining user interface issues are largely resolved.

keymanapp / keyman

chore(common): global input normalization in lexical models #9598