w3c / rdf-concepts

https://w3c.github.io/rdf-concepts/
Other
17 stars 2 forks source link

Improve Unicode terminology and term references. #59

Closed gkellogg closed 1 year ago

gkellogg commented 1 year ago

This tries to improve our use of Unicode terminology. Note that there is no good referencable term for "unicode string" other than [[UNICODE]]. The RDF Literal definition is updated to exclude surrogate code points by referencing "unicode scalar value", which is a normative change.

Most of the Unicode-specific terminology comes from i18n-glossary.

Fixes #51.


Preview | Diff

gkellogg commented 1 year ago

@aphillips said:

Note that I18N gave me an action to add XML-related string definitions to the glossary, best practices to our best practices doc, and a quotable reference for the string equality bits to charmod-norm in today (2023-08-31)'s teleconference. That will take a little time to land by might help you in the future by providing linkable stuff you can just reference.

I'd much rather reference something in i18n-glossary than define our on "string" term 😄.

pfps commented 1 year ago

It seems wrong for RDF concepts to even discuss character encodings here. Character sets or character encodings do not matter at all for RDF strings, just the Unicode code points.

As far as I can tell well-formed Unicode code unit sequences cannot encode surrogate code points. And this is the same distinction that forbids values that are not Unicode code points at all. (From D84 in The Unicode Standard Version 15.0 - Core Specification - "Any code unit sequence that would correspond to a code point outside the defined rante of Unicode scalar values would, for example, be ill-formed.")

So I don't think it is acceptable to say that an RDF string is a sequence of Unicode code points, as just about any system that handles Unicode may only accept well-formed Unicode code unit sequences. Instead of Unicode code points Unicode scalar values much be used.

gkellogg commented 1 year ago

It seems wrong for RDF concepts to even discuss character encodings here. Character sets or character encodings do not matter at all for RDF strings, just the Unicode code points.

Character encodings are discussed as relates to concrete syntaxes. If you think it would improve the text, we could remove "... in any character encoding" from the first paragraph of Strings in RDF.

As far as I can tell well-formed Unicode code unit sequences cannot encode surrogate code points. And this is the same distinction that forbids values that are not Unicode code points at all. (From D84 in The Unicode Standard Version 15.0 - Core Specification - "Any code unit sequence that would correspond to a code point outside the defined rante of Unicode scalar values would, for example, be ill-formed.")

Note that strings are also restricted to the XML Char production, which excludes surrogates.

So I don't think it is acceptable to say that an RDF string is a sequence of Unicode code points, as just about any system that handles Unicode may only accept well-formed Unicode code unit sequences. Instead of Unicode code points Unicode scalar values much be used.

What is accepted is a subject for concrete syntaxes, which are also restricted to the XML Char production. Code points seems like a reasonable representation for an abstract syntax. @aphillips discussed his thought process on using code points vs. code units above. As we're not experts in Unicode, shouldn't we defer to such expert guidance?

pfps commented 1 year ago

As far as I can tell, @aphillips did not express a preference between Unicode code point and Unicode scalar value. But, as you say, there is a further restriction of literal values (which is what I'm more interested in, but which also restricts RDF strings) to XML Char (currently the old one that excludes most control characters, but I think the reference is supposed to be updated to 1.1 which only excludes three scalar values). So I guess things are fine then. It is a bit strange to exclude these three scalar values and not the other noncharacter scalar values - but not worth worrying about.

pchampin commented 1 year ago

This was discussed during the TPAC 2023 meeting: https://www.w3.org/2023/09/12-rdf-star-minutes.html#t03