w3c / findtext

An API spec to define how to find text in a Web document, using basic information, and return DOM ranges
14 stars 7 forks source link

i18n-ISSUE-505: Diacritic matching sentence in intro #13

Open r12a opened 9 years ago

r12a commented 9 years ago
  1. Introduction http://www.w3.org/TR/2015/WD-findtext-20151015/#introduction

Browsers do not typically match language patterns that may be found in non-Latin character sets, including collapsed Unicode character sequences, optional diacritical marks, or similar features, such as matching o to ó, ö, ø, and oe.

It's not clear to us what 'collapsed Unicode character sequences' refers to.

The 'oe' at the end should presumably be 'œ' (U+0153 LATIN SMALL LIGATURE OE) instead.

The sentence starts by referring to 'non-Latin character sets', but gives a (plausible) example from the Latin character set. Perhaps just say 'found in multilingual text' ?

This is a particularly difficult problem, btw, in scripts such as Arabic, where vowel diacritics are optional, and can be mixed with additional diacritics. (For an example, see http://r12a.github.io/scripts/tutorial/part3#short-vowels.)

shepazu commented 9 years ago

@r12a Regarding oe, I was referring to the practice of "transliterating" German ae for ä, oe for ö, and ue for ü. (I don't know what this is called.) I am not sure how Unicode deals with this, if at all. Any insight to share?

In general on this issue, I'm probably just missing the point or using wrong terminology, so I welcome any concrete suggestions (like 'found in multilingual text'). Pull requests are also welcome, in addition to issues.