w3c / i18n-activity

Home pages, charters, style-guides, and similar documents related to the W3C Internationalization Activity.
65 stars 22 forks source link

When should UAOA be used? #492

Closed r12a closed 4 years ago

r12a commented 6 years ago

This is a tracker issue. Only discuss things here if they are i18n WG internal meta-discussions about the issue. Contribute to the actual discussion at the following link:

https://www.unicode.org/review/pri359/


http://www.unicode.org/reports/tr53/

It's not clear from the UTR when applications should use UAOA. One implication seems to be that it should be used any time normalisation is applied, however, "5.6 Other uses for UAOA" seems to imply that its use is optional and use just for certain operations.

If UAOA should be used any time Arabic text is normalised - and that seems likely given that NFC/NFD forms are unable to detect the differences that UAOA can when string matching – it seems problematic to me to package the algorithm as a separate UTR. For example, a large number of W3C specs require normalisation prior to string matching (think, for example, of matching CSS selectors against HTML ids). It is unlikely we can, if we wanted to, go back to all those specifications and require instead that they both normalise AND apply UAOA in order to ensure that strings match correctly. This is particularly complicated because people may think that UAOA needs to be applied only if Arabic text is being matched. On the other hand, if this algorithm were simply part of the overall normalization algorithm no change would be needed.

If the application of this algorithm is actually optional, or implemented for just certain operations (such as preparation for rendering), the situation is much worse, because it would increase the likelihood of mismatched strings in Arabic.

It's hard to escape the implication that this is actually an extension to normalization, since if a content author uses an editor that automatically normalises text, and reordering of Arabic content away from the standard normalised order would be lost every time the author saves a file.

I find myself wondering whether it's really important that the unicode stability rules should prevent changes to normalization output (which i assume to be the reason for UTR 53), rather than to simply admit that Arabic canonical combining class values were badly broken and should simply be changed. Given that normalization produces a mechanical and consistent output when it is applied to text, and that it can be reapplied to text if needed, is it so important to not change the normalization algorithm in this respect?†

The alternative seems to be that we are effectively introducing a new normalisation form, and in the process making NFD and NFC redundant, since they cannot preserve the differences needed for Arabic string matching.

I also have a concern that these Arabic fixes may not be the end of the story, but that other scripts will come up with situations where NFC/NFD are not able to represent significant differences in encoding, and that that will lead to new algorithms of this type – effectively leading to a set of new normalisation forms that have to be used on a script by script basis – not something that seems very workable.

This is not just an issue for string matching. If i read a document into Dreamweaver, edit a line, and resave, it will by default NFC normalise the text – thereby destroying the combining character order carefully set up by the previous editor.

† (There's a question in my mind about whether the UAOA is compatible with NFC at all, even if it was implemented as part of the NFC algorithm, given that NFC recombines certain characters in possibly undesirable ways.)

behnam commented 6 years ago

I've filed a couple of separate issues to track most of what you mentioned here, @r12a.

Regarding UAOA vs Normalization, I think it's the case that it actually is not compatible. I have wrote some details here, which I'm going to expand with some examples soon: https://github.com/w3c/i18n-activity/issues/495

I find myself wondering whether it's really important that the unicode stability rules should prevent changes to normalization output (which i assume to be the reason for UTR 53), rather than to simply admit that Arabic canonical combining class values were badly broken and should simply be changed. Given that normalization produces a mechanical and consistent output when it is applied to text, and that it can be reapplied to text if needed, is it so important to not change the normalization algorithm in this respect?*

Many systems depend on the fact that NFC is stable over years. For example any kind of data storage (applying toNFC() before storing in a DB and byte-matching to find it later). Another important aspect is security.

But, this doesn't mean that there cannot be a Normalization 2.0 spec.

In a similar situation, Unicode Grapheme Cluster spec was updated to better handle many cases; the new one is called "extended" and the old one is called "legacy". Technically, this didn't break stability, but provided a better way to handle text in result.

Another example would be Emoji and ZWJ, which resulted in updates to many core specs, including Segmentation, to handle sequences that were never recognized in Unicode before. There was no stability breakage here (Unicode Grapheme Cluster spec is not stable), but many people have faced issues with editing text around these characters.

Now, comparing those cases to the problem UAOA is demonstrating here:

  1. The problem mentioned here is not actually that wide-spread, and changing the behavior of marks would result in confusing behavior for users. That's besides the fact that transition (which usually takes 5-20 years in these areas) would make a mess for everyone. We have already learned that with all the updates to the Bidi algorithm and dates and numbers in bidi scripts.

  2. If the problem is as big as UAOA claims it to be, then (agreeing with you here, @r12a), there should be an updated Normalization spec that addresses the issue in general, for all processing, without requiring application to implement such a script-specific algorithm optionally.

r12a commented 6 years ago

I'm beginning to suspect that the algorithm described is intended to just indicate how characters should be temporarily reordered prior to rendering, rather than describe the order in which code points should be stored. Since most fonts generally produce the behaviour described anyway, it would therefore amount to documenting expectations in terms of font behaviour, rather than specifying a new form of normalisation.

It's not at all clear from the document whether that is the case, however, so i think that has to be our first question, and perhaps we need to await an answer to that before sending in many of the other comments.

r12a commented 6 years ago

btw, this: http://www.unicode.org/faq/normalization.html#8

In retrospect, it would have been possible to have assigned combining classes for certain Arabic and Hebrew non-spacing marks (plus characters for a few other scripts) that would have done a better job of making a canonically ordered sequence reflect linguistic order or traditional spelling orders for such sequences. However, retinkerings at this point would conflict with stability guarantees made by the Unicode Standard when normalization was specified, and cannot be done now.

r12a commented 6 years ago

Here is the text of the Unicode issue i raised, using the Unicode feedback mechanism. There is no way to see that feedback on the Unicode site at the moment:

I'm sending this on behalf of the W3C i18n WG. It relates to UTR#53.

I'm hearing through other channels that the algorithm described is intended to just indicate how characters should be temporarily reordered prior to rendering, rather than describe the order in which code points should be stored. Since most fonts generally produce the behaviour described anyway, it presumably therefore amounts to documenting expectations in terms of font behaviour, rather than specifying a new form of normalisation.

It's not at all clear from the document that that is the case, however, which has caused the W3C WG significant alarm (and wasted discussion cycles). Please update the document to make this clearer. We will hold back the other comments we currently have queued up to send until we can re-evaluate them in the light of the changes to the document.

Btw, the understanding of the intended use of UAOA is not helped by the way the document mentions canonically equivalent character sequences, nor by the vague descriptions of when CGJ should be used.

behnam commented 6 years ago

I submitted a long individual feedback. Here's the part related to this issue:


2. Scope of the algorithm

The scope of the algorithm is not clear, neither in its title nor in the language.

The name “Unicode Arabic Mark Ordering Algorithm” is suggesting that this is expected to be the only way Arabic Marks should be ordered in Unicode. That’s clearly not the case. In fact, the document is proposing an algorithm for “reordering” Arabic Marks (not just how they should be ordered) to solve a problem in “rendering” of the script. The title need to be clear about this. Maybe “Unicode Arabic Mark Reordering Algorithm for Rendering” (AMRAR)?

Similarly, the Section 2 “Background” doesn’t clarify the scope of the algorithm and only explains how something is not working for some specific application with the existing normalization methods.

r12a commented 6 years ago

In revision 2 the intended use is made much clearer, clarifying that it doesn't change or add to existing normalisation forms, and that it is only a transient reordering, intended for reordering text in an internal rendering pipeline.

Suggest we close this issue.