w3c / charmod-norm

Character Model for the World Wide Web: String Matching and Searching
https://w3c.github.io/charmod-norm/
19 stars 23 forks source link

2.2.1 Canonical vs. Compatibility Equivalence vs Canonical non-equivalence #69

Closed klensin closed 5 years ago

klensin commented 8 years ago

Given your Latin-based examples, should the text comment on the relationship between U+00F8 and the combination U+006F U+0337. When correctly displayed, they have the same visual appearance. However, normalization is of no help at all.

This is a distant relative of the relationship between U+08A1 and the sequence U+0628 U+0654 and between U+0681 and U+076C and the sequences that can be used to form the same graphemes. In none of these cases (and many others, some subtle) is normalization helpful. In many of them, there are distinctions between the precomposed and combining sequence forms that are a function of language or locale within the same script.

asmusf commented 8 years ago

This is a good point. In the discussion of identifiers (that is, accessing resources) it should be noted that normalization by itself is not sufficient to guarantee a unique appearance for each member of any pair of character sequences.

In fact, canonical normalization is not primarily about appearance: it is about folding multiple ways of encoding "the same thing". Two graphemes can look the same, but not represent "the same thing". In that case, normalization would not fold them.

Whether it is useful to go into any details on this, and so, which ones, is a matter of debate. Certainly, there are the instances of "apparent composition" that John mentions, but there are also the case of digraphs (not involving any combining marks). And finally, there are few examples of different letters having exactly the same shape (the three instances of capital D with the left stroke barred are a clear example of the phenomenon, even if IDNA2008 happens to avoid the problem, because it is lowercase only). (There's at least one lower case example, I'm leaving that as an exercise to the reader).

aphillips commented 8 years ago

So... I'm aware of this. I suppose we probably should mention, for example, something like the "Paypal" bug, e.g, U+03A1, U+0420, and U+0050 (P) all look absolutely identical but are unrelated.

I have changed the Section 2.2 introduction to be slightly more technically accurate. I also added a Note Well box.

@asmusf I partially copied your reply above to form part of the note. (I also added you to the acknowledgements list).

Please consider the changes and see if these address the problem.

asmusf commented 8 years ago

On 2/6/2016 5:45 PM, aphillips wrote:

So... I'm aware of this. I suppose we probably should mention, for example, something like the "Paypal" bug, e.g, U+03A1, U+0420, and U+0050 (P) all look absolutely identical but are unrelated.

I have changed the Section 2.2 introduction to be slightly more technically accurate. I also added a Note Well box.

@asmusf https://github.com/asmusf I partially copied your reply above to form part of the note. (I also added you to the acknowledgements list).

Please consider the changes and see if these address the problem.

— Reply to this email directly or view it on GitHub https://github.com/w3c/charmod-norm/issues/69#issuecomment-180912036.

You write:

Obviously, "confusable"characters like this can present spoofing and
other security risks.For more information, see [[UTR39]].

First, the example in this case is not merely about characters that are "confusable" - a term that encompasses a wide spectrum of similarity under a wide variation of circumstances and involving assumptions about human perception - but it is more precisely about characters being "homoglyphs" (in fact, strict homoglyphs, with an appearance that is identical in all practical scenarios).

The distinction matters, because those that feel that normalization "should" have addressed certain issues are fine with "mere" similarities handled differently.

Second, the example may be "famous" but involves only a subset of homoglyphs. There are some other examples of homoglyphs that are in the same script. I don't think you need to give examples of both; but it would not be amiss to add that this effect does not require separate scripts.

My suggested replacement:

==> Similar examples of identical appearance also exist within a single script. Because these characters have with an appearance that is identical for all practical purposes they are an extreme manifestation of "confusable" characters, which can represent...

(Or you can find a way to break that sentence in two).

A./

aphillips commented 8 years ago

Please see commit https://github.com/w3c/charmod-norm/commit/c84dbe84c1c781c6a7dccb1b19ff5d3c014cb054

@asmusf: I adopted some of your sentence, but not all of it. I hope not to recreate UAX39 here. In doing this, I found that keeping the big red box where it was inappropriate. The warning basically came before all of the discussion of what Normalization is. So I created a new section below the others called "Limitations of Normalization". I also included a warning that some visually distinct things that one might think would be normalized (so I'm told by developers, for example) are also not "fixed" by normalization, particularly the "K" forms.

I did retain the big red box with a short warning though in the first section on normalization so that it can't be missed, as a kind of TL/DR warning.

@klensin and @asmusf: Please review for your satisfaction so that I can close this issue. Thanks.

asmusf commented 8 years ago

On 2/20/2016 12:43 AM, aphillips wrote:

Please see commit c84dbe8 https://github.com/w3c/charmod-norm/commit/c84dbe84c1c781c6a7dccb1b19ff5d3c014cb054

@asmusf https://github.com/asmusf: I adopted some of your sentence, but not all of it. I hope not to recreate UAX39 here. In doing this, I found that keeping the big red box where it was inappropriate. The warning basically came before all of the discussion of what Normalization is. So I created a new section below the others called "Limitations of Normalization". I also included a warning that some visually distinct things that one might think would be normalized (so I'm told by developers, for example) are also not "fixed" by normalization, particularly the "K" forms.

Seems fine for a quick overview of the issue.

I did retain the big red box with a short warning though in the first section on normalization so that it can't be missed, as a kind of TL/DR warning.

@klensin https://github.com/klensin and @asmusf https://github.com/asmusf: Please review for your satisfaction so that I can close this issue. Thanks.

— Reply to this email directly or view it on GitHub https://github.com/w3c/charmod-norm/issues/69#issuecomment-186543598.

r12a commented 8 years ago

I think we may still be missing John's original point, which is in some ways the opposite of the (useful) information we have so far. If you consider ø, a developer may expect that it will match the decomposed sequence, but it won't, since it doesn't have a decomposition rule.

So it's one thing that identical glyphs may not represent the same character, but it's another that identical letters may be represented by different and non-normalisable character sequences.

asmusf commented 8 years ago

On 4/4/2016 11:58 AM, r12a wrote:

I think we may still be missing John's original point, which is in some ways the opposite of the (useful) information we have so far. If you consider ø, a developer may expect that it will match the decomposed sequence, but it won't, since it doesn't have a decomposition rule.

So it's one thing that identical glyphs may not represent the same character, but it's another that identical letters may be represented by different and non-normalisable character sequences.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/w3c/charmod-norm/issues/69#issuecomment-205448027

Never forget that the "sequences" in question can all be singletons for the same (identical) letter shape. There's no requirement for this effect to be limited to cases involving combining marks.

A./

klensin commented 7 years ago

Sorry, confirmation of this (more than a year old issue/fix slipped through my cracks). The text is vastly improved, but Richard is right -- part of the original point is still not covered. In the context of the current text, the last paragraph of 2.2 is still a bit dubious: "...does not bring together characters that have the same intrinsic meaning or function, but which vary in appearance or usage" doesn't cover all of those cases either and would be improved by a change to "which may vary".

More generally, while I think the current reference to UTS39 is ok and that document is a useful contribution, we should (continue to) be careful to not reference it as if it were the last word on the subject. Recent discussions in other forums demonstrate that it is not either comprehensive or the last word on the subject. Conversely, because it is "a standard" and some people grasp at straws to "prove" the unproveable, it has been cited as part of claims that any topic it does not cover is not an issue.