w3c / charmod-norm

Character Model for the World Wide Web: String Matching and Searching
https://w3c.github.io/charmod-norm/
19 stars 23 forks source link

Matching Unicode characters that don't normalise together #216

Open r12a opened 3 years ago

r12a commented 3 years ago

Brahmi-derived and arabic script based orthographies have visual graphemes that look the same but have different underlying code points. Some of these are precomposed and decomposed pairs for which Unicode provides mappings – they are not a problem and are already covered by this document.

Unfortunately, there is a very prevalent other case, where different underlying code points produce the same visual output but are not canonically equivalent.

In some cases there is advice from the Unicode Standard about which approach is preferred, but there is no real way of enforcing that advice when users start writing their content. A simple example of this would be the Sinhala equivalence:

ආ U+0D86: SINHALA LETTER AAYANNA and අ U+0D85: SINHALA LETTER AYANNA + ා U+0DCF: SINHALA VOWEL SIGN AELA-PILLA

Unicode says that the 2-character approach should not be used, but users may still type it, and apparently often do do this kind of thing. In such a case, it may be useful for an application that is trying to match items to do some kind of additional normalisation, so that these things match. One could expect such normalisation based on visual similarity to have different rules per writing system, but there may even be different rules per orthography (ie. per language).

But there are many similar scenarios that are not warned against by the Unicode Standard, and often it can be difficult to know which character(s) to use for a given visual result. I have recently been documenting the orthography of Kashmiri and there are several examples of this, leading to different encodings in content such as Wikipedia or even script tutorials. One example is:

ۆ U+06C6: ARABIC LETTER OE vs وٚ U+0648 U+065A: ARABIC LETTER WAW, VOWEL SIGN SMALL V ABOVE

It so happens that Wikipedia and other sources tend to use the precomposed character rather than the sequence in this case. But there are several other letters where the sequence tends to be used, rather than the precomposed character. In some texts, both are used in the same content.

We could say that people should use the right code points, but in Kashmiri it's not even clear which are the 'right' character(s).

This seems to be a case where, on an orthography-specific basis, either: a. some standard needs to be developed that clarifies which characters should and should not be used, and fonts or input systems should police this, or b. additional tailored normalisations should be performed by an application.

It's my expectation that, either due to de facto usage patterns, or due to simple encoding ambiguities, in some cases there will always be two different ways of writing the same thing that are not made equivalent by standard Unicode normalisation.

r12a commented 3 years ago

Should we mention the above in our string matching document? (It's clearly something for the Text Search document, but where there's real ambiguity about which character to use, it may be appropriate to take it into account for string matching too.

r12a commented 3 years ago

There is also another scenario which is very common, where it's reasonably clear to argue that a particular code point is inappropriate for use with a given language, but users use it anyway, often because they can't type the correct code point, but usually because you can't tell the difference. Some examples of this can be found at https://r12a.github.io/scripts/arabic/kashmiri#confusables (and subsections alongside that one give other examples for Kashmiri of the things mentioned earlier).

r12a commented 3 years ago

Here's another example, which occurs in multiple orthographies. Let's consider Persian.

Persian (and Urdu, Kashmiri, etc.) uses ی [U+06CC ARABIC LETTER FARSI YEH] for 'yeh'. It doesn't use ي [U+064A ARABIC LETTER YEH], because there are differences in the glyphs for certain joining forms.

However, Persian sometimes uses a hamza diacritic above yeh. The Unicode Standard explains that a combining hamza should be used rather than a precomposed character. However, many documents use ئ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE]

The problem with this is that ئ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE] decomposes to ئ [U+064A ARABIC LETTER YEH + U+0654 ARABIC HAMZA ABOVE] which produces the Arabic yeh rather than the Persian one.

So in Persian text, the following should be treated as equivalent by an application, even though they are not equivalent in normalisation:

یٔ [U+06CC ARABIC LETTER FARSI YEH + U+0654 ARABIC HAMZA ABOVE] ئ [U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE]