Closed VladimirAlexiev closed 2 years ago
I'm ok with UTF-8 only. We should ask for feedback later to the YAML community.
Both RDF, in general, and JSON, in specific are UTF-8 only. YAML allows a greater variation, but if we allow YAML-LD to include, e.g., UTF-16 or -32 the potential corner cases become quite difficult to handle. I would say that we restrict YAML-LD file compatibility to be UTF-8 only.
100% agreed, I have not seen any practical use cases where anything other than UTF-8 would have been necessary.
Nitpicking, but
That being said, I know that in oractice, encoding issues can be a mess. But if the YAML ecosystem is dealing correctly with those, it may be a waste not to take advantage of it.
Nitpicking, but
- RDF as an abstract synyax ls encoding agnostic, it is not UTF-8 only. It might be true that most concrete syntaxes are UTF-8 only (I didn't check) but that does not make UTF-16 or UTF-32 unsuitable for RDF in general.
You're absolutely right, it's the specific encodings that restrict themselves (uniformly, I believe) to UTF-8. Allowing anything other than UTF-8 would create issues when re-serializing to something like Turtle/TriG or even N-Quads. (Actually, RDFa and possibly RDF/XML allow other encodings, but that's because of the legacy HTML/XML carrier).
Allowing anything other than UTF-8 would create issues when re-serializing to something like Turtle/TriG or even N-Quads.
I disagree. UTF-8 is a universal coding scheme for Unicode, so any Unicode string, regardless of its original encoding, can be serialzed without any problem in Turtle, N-quads...
can be serialzed without any problem
In theory yes. In practice, not always :-) (At least not so easily)
@VladimirAlexiev I know that encoding issues can be nasty. I got bitten too, I still feel the scars. And most of my problems came from 1) the lack of explicit encoding metadata, and 2) developers making naive simplifying assumptions ("everything is ASCII", "everything is UTF-8"...).
I think we will eventually be better off by raising the awareness of developers to these problems, rather than indulging them into their simplifying assumptions.
Some chronological considerations:
On input, a YAML processor must support the UTF-8 and UTF-16 character encodings. For JSON compatibility, the UTF-32 encodings must also be supported.
I think that the direction is clear: YAML supported utf-32 for interoperability with JSON, which instead 8 years later removed utf-16 too.
My opinion is then to just UTF-8 unless implementers explicitly request utf-16/32 support.
Remembering Postel's Law, I'll suggest that YAML-LD processor SHOULD accept UTF-8, UTF-16, and UTF-32. It may be reasonable to make UTF-8 consumption a MUST with the others as SHOULD or MAY.
YAML-LD generation could reasonably be absolutely restricted to UTF-8, or allowed to support UTF-16 and/or UTF-32 output upon specific user control/request (though there doesn't seem to be any situation where a consumer would handle UTF-16 or UTF-32 but not UTF-8).
I agree on the SHOULD accept, but for interoperability, we should probably always emit UTF-8. Of course, any given implementation may provide their own options for preserving the input character encoding.
Remembering Postel's Law
Some years ago I found this very interesting lecture on the The Harmful Consequences of the Robustness Principle :P In general I still see a lof of encoding issues in wide API ecosystems, so I tend to lean on UTF-8.
for interoperability, we should probably always emit UTF-8
+1.
+1 to being strict on what you accept; find and fix errors as quickly as possible.
Every spec for an RDF textual format I've worked on stipulates UTF-8. This precedent was established in discussion with Martin Dürst (W3C i18n). Non-textual formats (XML, JSON) inherit flexibility from their parent format.
encoding="utf-8"
This is a fairly consistent story apart from the split between text/
and application/
, where Turtle and CSV favor direct textual display over binary download.
Also, I have no idea if people actually include a Character Set or Encoding Type in foo.tsv headers. I guess then it falls back to a text/*
default, which is probably something jingoistic like 8859-1. Maybe we can request an update from the authors: U of MN Internet Gopher Team <gopher&boombox.micro.umn.edu>.
Fixed by #34
https://ietf-wg-httpapi.github.io/mediatypes/draft-ietf-httpapi-yaml-mediatypes.html#name-yaml-and-json (@ioggstream) "the following ones might have interoperability issues with JSON: non UTF-8 encoding, since YAML supports UTF-16 and UTF-32 in addition to UTF-8".
Can we demand that YAML-LD be in UTF-8 only?
Turtle allows only UTF-8 (https://www.w3.org/TR/turtle/#sec-mediaReg), so I see no loss for those other encodings