Further comments on Unicode FAQ: Unicode and the Web

Some comments on Unicode and the Web https://corp.unicode.org/%7Easmus/proposed_faq/unicode_web

[1] In https://corp.unicode.org/%7Easmus/proposed_faq/unicode_web.html the links to W3C content should not show .en.html or index.en.html. For example, https://www.w3.org/International/questions/qa-choosing-encodings has German, Spanish, Brazilian Portuguese and Swedish translations. We content negotiate access to those pages and provide translations where available, but including the extensions blocks that.

Also, we usually prefer not to show naked URLs in text. I suggest replacing

See https://www.w3.org/International/questions/qa-choosing-encodings.en.html.

with

See the W3C article <a href="https://www.w3.org/International/questions/qa-choosing-encodings">Choosing & applying a character encoding</a>.

etc

[2] Q: We are setting up a database for use with our web server. Does Unicode cover all the character sets we need for a web server?

I feel like this answer should start with "Yes."

Perhaps it would also be worth describing how Unicode greatly simplifies the storage of multilingual data, since most non-Unicode encoded database data will be in multiple languages and code pages, and managing or extending that is a pain in the neck. That all goes away with Unicode encoded databases.

[3] Q: What are Numerical and Named Character References?

This is not really about non-ASCII characters (especially for people working from non-ASCII keyboards). For example, many keyboards have non-ASCII § and ± which don't need to be escaped because you can type them directly. It's rather that this allows you to add the odd character to the text when you don't have a way to input it directly from the keyboard, or to clearly see invisible or ambiguous characters in the source.

I'm dubious about "not handled well by many search engines". Is that true?? I'm also not particularly impressed by other cons listed.

So here's a suggestion for a rewrite of those 3 Q&As, as a single Q&A:

Q: What are Numerical and Named Character References?

Instead of simply including a character such as an “a” in a file, you can instead write it using the Unicode code point value as a Numerical Character Reference (NCR), such as “a” (using the hex code point value) or “a” (using the decimal code point value). For help with calculating hexadecimal and decimal NCRs, see the <a href="https://r12a.github.io/app-conversion/">Unicode code converter</a> page.

Named character references are similar, except that they use abbreviations, such as “é” instead of numbers.

This can be useful when you don't have a character on your keyboard, such as a trademark sign (™) or alpha (α). It can also be useful to clarify visually ambiguous characters in your source code, such as distinguishing a non-break space ( /  vs. a normal space) from an ordinary space, or to make it clear the use of invisible characters or visually ambiguous characters in your source code (such as ‏/&rlm;).

You should avoid overuse of NCRs because they make it harder to read source text when direct character input would suffice. It also takes longer to create them.

A similar character escape mechanism can be used in CSS, but the format is slightly different.

For more information about character escapes on the Web see the W3C page <a href="https://www.w3.org/International/questions/qa-escapes">Using character escapes in markup and CSS</a>.

By the way, one of the main reasons i use NCRs is to prevent normalisation in example text. For example, to produce NFD e-acute in an editor that automatically NFC-normalises your text you can use eé. It's particularly useful for examples involving nuktas and such. But i suspect that that use case might be a little esoteric for inclusion here(?)

[4] And finally, does the Q&A about email really belong in a FAQ about the Web – don't we have an FAQ about email?

hope that helps

Thank you for your input. Very valuable. Here are some notes:

[1] done (in our draft version, pending review by Unicode edcom) -- please suggest phrasing for other "naked" links you'd like to see replaced;

[2] Again, specific wording would help; the tenor looks fine. About the word "Yes". Most of the FAQ used to be written with Yes/No as the first word in each answer. We are trying to remove some of the "chattiness" of the FAQ, but in some cases we may have gone too far (or would need different wording for a question).

I've taken the tail end of your comment: "Unicode greatly simplifies the storage of multilingual data, since most non-Unicode encoded database data will be in multiple languages and code pages, and managing or extending that is a pain in the neck. That all goes away with Unicode encoded databases." and will use that verbatim unless better instructions.

[3] Very nice rewrite. I added it to our private review draft, with a fourth paragraph: "Use of NCRs interferes with automatic normalization applied by some editors. This can be desirable in documents that discuss normalization and need to show examples, but should otherwise be avoided."

It seems to me that the interaction is worth pointing out, so it can be avoided. And while we are at it, we might mention the sole known use case in passing.

[4] We do not have a dedicated FAQ for e-mail. In fact, there used to be much more and we deleted it, because it seemed outdated. We are currently looking for input on what issues should rise to the level that we mention them here or any other FAQ page. BTW, just pointing readers to other standards is fine -- but the group isn't sure what the message should be.

At irregular intervals the contents of the like at the top of your comment will be updated so anyone can check the status of the private draft.

w3c / i18n-discuss

Further comments on Unicode FAQ: Unicode and the Web #26