w3c / i18n-glossary

Definitions of terms used in W3C Internationalization documents.
https://w3c.github.io/i18n-glossary/
5 stars 4 forks source link

Ensure INFRA and i18n-glossary are in sync #28

Open aphillips opened 1 year ago

aphillips commented 1 year ago

See #20 for some details. We need to ensure that definintions found in both Infra and i18n-glossary are compatible (the same) and that we point appropriately to Infra from i18n-glossary for these.

xfq commented 7 months ago

Here's a quick comparison:

Code point

i18n-glossary:

A code point value represents the position of a character in a coded character set. For example, the code point for the letter á in the Unicode coded character set is 225 in decimal, or 0xE1 in hexadecimal notation. Hexadecimal notation is commonly used for referring to code points. See also Unicode code point.

Infra:

A code point is a Unicode code point and is represented as "U+" followed by four-to-six ASCII upper hex digits, in the range U+0000 to U+10FFFF, inclusive. A code point’s value is its underlying number.

Surrogate

i18n-glossary:

Unicode definition: A Unicode code point in the range U+D800..U+DFFF. Reserved for use by UTF-16, where a pair of surrogate code units (a high surrogate followed by a low surrogate) “stand in” for a supplementary code point. This term is also defined by [INFRA].

Infra:

A leading surrogate is a code point that is in the range U+D800 to U+DBFF, inclusive.

A trailing surrogate is a code point that is in the range U+DC00 to U+DFFF, inclusive.

A surrogate is a leading surrogate or a trailing surrogate.

Scalar value

i18n-glossary:

Unicode definition: Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. (See definition D76 in Section 3.9, Unicode Encoding Forms.)

Infra:

A scalar value is a code point that is not a surrogate.

Scalar value string

Not defined in i18n-glossary.

Infra:

A scalar value string is a string whose code points are all scalar values.

Noncharacter

Not defined in i18n-glossary.

Infra:

A noncharacter is a code point that is in the range U+FDD0 to U+FDEF, inclusive, or U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, or U+10FFFF.

Unicode:

A code point that is permanently reserved for internal use. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 1016), and the values U+FDD0..U+FDEF. See the FAQ on Private-Use Characters, Noncharacters and Sentinels.

Code unit

i18n-glossary:

The units of data used by a character encoding to encode or serialize characters into a programming language or other serialized form (such as a file). Common code units are 8-, 16-, and 32-bits in size. On the Web we are mostly concerned with bytes, which are technically 8-bit code units. However, in Javascript a char is a 16-bit code unit (related to the UTF-16 encoding of Unicode).

Infra (it's UTF-16 specific):

A string is a sequence of unsigned 16-bit integers, also known as code units.

xfq commented 7 months ago

Ah, it seems that we also had a comparison in https://github.com/w3c/i18n-glossary/issues/49