w3c / charmod-norm

Character Model for the World Wide Web: String Matching and Searching
https://w3c.github.io/charmod-norm/
19 stars 23 forks source link

Arabic short vowels and Hebrew pointing in string search #78

Closed aphillips closed 7 years ago

aphillips commented 8 years ago

The presence of Arabic and Hebrew short vowels in a text can interfere with string searching. We should point this out.

asmusf commented 8 years ago

How is that different from spelling variation?

aphillips commented 8 years ago

None, if you want to look at it that way.

The problem here is that users are unlikely to provide the short vowels when doing a "Find" in the browser (which is the most obvious example of string search). If this is similar to something it is similar to Latin script users omitting accents when searching (particularly from their phones) or to Japanese users (perhaps) expecting hiragana input to match katakana items that are "spelled the same" in the other script--both use cases we've already called out.

asmusf commented 8 years ago

I think the general statement is that there is loose matching and (more or less) literal matching. Loose matching can be of many kinds. For example, for the DNS root zone we are working on a project that defines which simplified Chinese strings "match" which traditional Chinese strings. (The actual lookup in the DNS is literal, but the registration would be for bundles of matching labels, achieving an effect similar to loose matching).

In the context of charmod, I think the statement should be in its most general terms.

Loose matching can be required for some applications but it can be difficult to formulate a single, general solution that is satisfactory for all users (let alone all types of applications).

In the Arabic case, for the root zone, the project decided to not support short vowels in top-level domain names. For general text, and searches on general text, that solution isn't adequate.

One important consideration is that there are equivalences that fall outside the Unicode normalization (even NFKx). The Danish/Norwegian O with slash (U+00D8) is functionally equivalent to Swedish O with diaeresis (U+00D6), but O with slash has no decomposition.

There are similar loose matching rules that might apply within the alphabet for a language; sometimes certain letter are pronounced the same way, and a loose matching that is "phonetic" might be needed.

Sometimes it's possible to fold all diacritics on the same base letter, sometimes a language uses a few diacritics to generate new "letters" (rather than "new forms" of letters). In those cases, in that language's context, you'd not want to fold away all diacritics (only the "optional" ones, usually of foreign origin). And so on.

The Arabic/Hebrew case is just one example - a useful one, but only if presented in context, not if this is the only example of non-normlization derived folding.

For identifiers, the concept of using a non-folded lookup with detailed rules on how to bundle "variants" (which then all resolve to the same target so as to emulate loose matching) should be mentioned as an alternative to folding. (A specification of how to set up the rules for that is found here: http://www.ietf.org/id/draft-ietf-lager-specification-08.txt).

A./

PS: from today's posts on the Unicode list, something very much on topic:

Hello Unicode,

I have been involved in a rather long discussion on the Emacs-devel mailing list[1] concerning the right way to do character folding and we've reached a point where input from Unicode experts would be welcome.

The problem is the implementation of equivalence when searching for characters. For example, if I have a buffer containing the following characters (both using the precomposed and canonical forms):

o ö ø ó n ñ

The character folding feature in Emacs allows a search for "o" to mach some or even all of these characters. The discussion on the mailing list has circulated around both the fact that the correct behaviour here is locale-dependent, and also on the correct way to implement this matching absent any locale-specific exceptions.

An English speaker would probably expect a search for "o" to match the first 4 characters and a search for "n" to match the latter two.

A Spanish speaker would expect that n and ñ be different but otherwise have the same behaviour as the English user.

A Swedish user would definitely expect o and ö to compare differently, but ö and ø to compare the same.

I have been reading the materials on unicode.org http://unicode.org trying to see if this has been specifically addressed anywhere by the Unicode Consortium, but my results are inconclusive at best.

What is the "correct" way to do this from Unicode's perspective? There is clearly an aspect of locale-dependence here, but how far can the Unicode data help?

In particular, as far as I can see there is no way that the Unicode charts can allow me to write an algorithm where o and ø are seen as similar (as would be expected by an English user).

[1] https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html

r12a commented 8 years ago

There is also a similarity here to what happens when you apply sorting algorithms. For example, sort algorithms are culturally tailored, tend to separate diacritics from base characters at a certain point, may require locale-specific preprocessing steps, etc.

aphillips commented 7 years ago

This thread belongs with string-search. Need to move it there.

aphillips commented 7 years ago

Closing in favor of https://github.com/w3c/string-search/issues/3