json-ld / yaml-ld

CG specification for YAML-LD and UCR
https://json-ld.github.io/yaml-ld/spec
Other
22 stars 8 forks source link

char encoding: UTF-8 only? #15

Closed VladimirAlexiev closed 2 years ago

VladimirAlexiev commented 2 years ago

https://ietf-wg-httpapi.github.io/mediatypes/draft-ietf-httpapi-yaml-mediatypes.html#name-yaml-and-json (@ioggstream) "the following ones might have interoperability issues with JSON: non UTF-8 encoding, since YAML supports UTF-16 and UTF-32 in addition to UTF-8".

Can we demand that YAML-LD be in UTF-8 only?

Turtle allows only UTF-8 (https://www.w3.org/TR/turtle/#sec-mediaReg), so I see no loss for those other encodings

ioggstream commented 2 years ago

I'm ok with UTF-8 only. We should ask for feedback later to the YAML community.

gkellogg commented 2 years ago

Both RDF, in general, and JSON, in specific are UTF-8 only. YAML allows a greater variation, but if we allow YAML-LD to include, e.g., UTF-16 or -32 the potential corner cases become quite difficult to handle. I would say that we restrict YAML-LD file compatibility to be UTF-8 only.

anatoly-scherbakov commented 2 years ago

100% agreed, I have not seen any practical use cases where anything other than UTF-8 would have been necessary.

pchampin commented 2 years ago

Nitpicking, but

That being said, I know that in oractice, encoding issues can be a mess. But if the YAML ecosystem is dealing correctly with those, it may be a waste not to take advantage of it.

gkellogg commented 2 years ago

Nitpicking, but

  • RDF as an abstract synyax ls encoding agnostic, it is not UTF-8 only. It might be true that most concrete syntaxes are UTF-8 only (I didn't check) but that does not make UTF-16 or UTF-32 unsuitable for RDF in general.

You're absolutely right, it's the specific encodings that restrict themselves (uniformly, I believe) to UTF-8. Allowing anything other than UTF-8 would create issues when re-serializing to something like Turtle/TriG or even N-Quads. (Actually, RDFa and possibly RDF/XML allow other encodings, but that's because of the legacy HTML/XML carrier).

pchampin commented 2 years ago

Allowing anything other than UTF-8 would create issues when re-serializing to something like Turtle/TriG or even N-Quads.

I disagree. UTF-8 is a universal coding scheme for Unicode, so any Unicode string, regardless of its original encoding, can be serialzed without any problem in Turtle, N-quads...

VladimirAlexiev commented 2 years ago

can be serialzed without any problem

In theory yes. In practice, not always :-) (At least not so easily)

pchampin commented 2 years ago

@VladimirAlexiev I know that encoding issues can be nasty. I got bitten too, I still feel the scars. And most of my problems came from 1) the lack of explicit encoding metadata, and 2) developers making naive simplifying assumptions ("everything is ASCII", "everything is UTF-8"...).

I think we will eventually be better off by raising the awareness of developers to these problems, rather than indulging them into their simplifying assumptions.

ioggstream commented 2 years ago

Some chronological considerations:

  1. YAML 1.1 is utf-8, utf-16 only
  2. YAML 1.2.1 is utf-8, utf-16, utf-32

On input, a YAML processor must support the UTF-8 and UTF-16 character encodings. For JSON compatibility, the UTF-32 encodings must also be supported.

  1. JSON RFC 7159 is utf-8, utf-16, utf-32
  2. JSON RFC 8259 is utf-8 only

I think that the direction is clear: YAML supported utf-32 for interoperability with JSON, which instead 8 years later removed utf-16 too.

My opinion is then to just UTF-8 unless implementers explicitly request utf-16/32 support.

TallTed commented 2 years ago

Remembering Postel's Law, I'll suggest that YAML-LD processor SHOULD accept UTF-8, UTF-16, and UTF-32. It may be reasonable to make UTF-8 consumption a MUST with the others as SHOULD or MAY.

YAML-LD generation could reasonably be absolutely restricted to UTF-8, or allowed to support UTF-16 and/or UTF-32 output upon specific user control/request (though there doesn't seem to be any situation where a consumer would handle UTF-16 or UTF-32 but not UTF-8).

gkellogg commented 2 years ago

I agree on the SHOULD accept, but for interoperability, we should probably always emit UTF-8. Of course, any given implementation may provide their own options for preserving the input character encoding.

ioggstream commented 2 years ago

Remembering Postel's Law

Some years ago I found this very interesting lecture on the The Harmful Consequences of the Robustness Principle :P In general I still see a lof of encoding issues in wide API ecosystems, so I tend to lean on UTF-8.

for interoperability, we should probably always emit UTF-8

+1.

ericprud commented 2 years ago

+1 to being strict on what you accept; find and fix errors as quickly as possible.

Every spec for an RDF textual format I've worked on stipulates UTF-8. This precedent was established in discussion with ‪Martin Dürst‬ (W3C i18n). Non-textual formats (XML, JSON) inherit flexibility from their parent format.

This is a fairly consistent story apart from the split between text/ and application/, where Turtle and CSV favor direct textual display over binary download. Also, I have no idea if people actually include a Character Set or Encoding Type in foo.tsv headers. I guess then it falls back to a text/* default, which is probably something jingoistic like 8859-1. Maybe we can request an update from the authors: U of MN Internet Gopher Team <gopher&boombox.micro.umn.edu>.

ioggstream commented 2 years ago

Fixed by #34