tumblr / docs

Tumblr's public platform documentation.
Apache License 2.0
108 stars 26 forks source link

Clarify Unicode terminology #39

Closed leo60228 closed 3 years ago

leo60228 commented 3 years ago

Unicode is complicated.

The Unicode glossary gives four definitions for "character:"

  1. The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding.
  2. Synonym for abstract character.
    • A unit of information used for the organization, control, or representation of textual data.
  3. The basic unit of encoding for the Unicode character encoding.
  4. The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]

1 and #4 are human concepts, not technical ones. #2 isn't clearly defined (hence the "abstract"). #3 is what's intended here, I believe, though clarification is helpful. This is why I changed it to "code point" ("Any value in the Unicode codespace; that is, the range of integers from 0 to 0x10FFFF"), which is consistent with other usages in the docs and seems to be what's intended.