Provide place to record the Unicode Normalization Form used

raffazizzi commented 9 years ago

TEI is about text. Sometimes we read text, sometimes we want to find text. Finding text – ordinary text – can present problems because of the way Unicode allows two ways to encode accented characters. "Åström" and "Åström" may look the same, but you may not be able to find one with the other, since the two accented characters may either occur as one character (the precomposed characters "Å" and "ö",) or as two (a base character, "A" and "o", followed by a combining character, "̊" and "̈"). Similar problems occur with ligatures such as "woﬀle" contra "woffle". These pairs are canonically equivalent and a Unicode-conformant application ought to treat these as identical, but it appears that in point of fact only a few Mac OS X applications do so. It is possible to normalize these documents, making them follow one or the other of the approaches throughout, using different Unicode Normalization Forms, NFC ("C" for "composed") and NFD ("D" for "decomposed"). The command-line tool uconv can make these transformations, but very few desktop applications can. Standards-conformant XSLT and XQuery processors will have the ability to normalize encoding, since fn:normalize-unicode() is an XPath function, and indexers such as Lucene can hook into libraries like ICU4J, but there is no guarantee that this is done, or that both the text basis and any input terms are normalized according to the same scheme.

It is therefore with good reason that the Guidelines, vi. Languages and Character Sets, have the following – quite strong – recommendation:

"It is important that every Unicode-based project should agree on, consistently implement and fully document a comprehensive and coherent normalization practice. As well as ensuring data integrity within a given project, a consistently implemented and properly documented normalization policy is essential for successful document interchange."

The problem is, where to document such a crucial piece of information? I believe this calls for an obligatory element in encodingDesc (perhaps "NFDecl"?) to register which normalization practice has been followed.

Alternatively, TEI could stipulate that all TEI documents are to be normalized according to NFC. This is the recommended Internet practice, since in the early days of the introduction of Unicode this provided a way to make accented characters backwards compatible – the precomposed characters were introduced in order to secure round-trip-ability with legacy encodings.

For some purposes (such as, "find all "a" characters with whichever accents they may have"), NFD is easier to work with, and NFD is more general, applying one one rule for accents, whereas in NFC some accented characters have precomposed forms and others not. This could argue for stipulating the use of NFD instead.

According to Unicode any application is allowed to convert to and from these two normalization forms, just as it can switch between UTF-8 (commonly used for exchange) and UTF-16 (commonly used internally), so no assumption should be made as to which form is used, only that it is used consistently. Indeed, when posting this request, I find that the Chrome browser normalizes "Åström" and "Åström" – which is alright, but would be rather confusing here – but that Firefox does not.

A special problem arises as long as not all applications treat the two pairs of characters as equivalent: this makes it necessary that it should be possible, with a text editor, to search both normalized and un-normalized text and to convert freely between the two normalization forms. oXygen has announced that it will try to make searching along these lines possible (and presumably it could open up for conversion as well), but there is a severe lack of tools in this area.

The Guidelines passage makes a clear and sound recommendation, and the information is requires is simple (unicode-normalized: yes/no; unicode normalization form: NFC/NFD), and there should be a definite place to record this information. What argues against this feature request is that this is yet another technical matter foisted upon the working TEI user and that the tools for making conversions of the required sort are few.

See http://markmail.org/thread/xov2nkg5iwas4uv3.

Original comment by: jensopetersen

raffazizzi commented 24 years ago

This issue was originally assigned to SF user: rviglianti Current user is: raffazizzi

raffazizzi commented 9 years ago

Documentation of the normalization form used might be helpful, but it's at a lower level of XML. In fact the Unicode spec says "the unicode char database supplies properties that allow implementations to quickly determine whether a string x is in a particular normalisation form" : whatever the documentation says, an implementation would need to check in any case. It might be desirable to add this indication to the <?xml declaration, but that's an XML feature request not a TEI one.

Original comment by: lb42

raffazizzi commented 9 years ago

Link to Unicode spec: http://unicode.org/reports/tr15/#Quick_Check_Table

Original comment by: raffazizzi

raffazizzi commented 9 years ago

assigned_to: rviglianti

Original comment by: raffazizzi

raffazizzi / TEI-TEST

Provide place to record the Unicode Normalization Form used #137