New principle: Discourage polyglot formats

hober commented 4 years ago

This is a generalization of @dbaron's concern in #128.

Polyglot formats tend to lead to interoperability problems, so we should discourage defining them.

Polyglot formats are formats which are defined such that they can be processed by two or more different kinds of processors with roughly equivalent results. For instance, it's possible to write a computer program that is simultaneously valid C and valid C++. Another example is Polyglot Markup, an abandoned attempt to define a markup syntax that was simultaneously valid HTML and XHTML, and whose documents would produce roughly equivalent DOM trees when parsed with an HTML parser or an XML parser.

Authors tend to test their document with only one kind of processor, so they inadvertently introduce errors which would only be caught by the other kind of processor. In the case of Polyglot Markup, this happened when authors introduced XML errors into their document but only tested with an HTML parser. Consumers using an XML parser would, instead of seeing the document, see an XML parser error screen.

If the polyglot format contains fields that are only used by one kind of processor, such fields are likely to experience bit rot problems when authors only routinely test their documents with the other kind of processor. (For instance, if authors routinely use JSON parsers to test their JSON-LD, the @context section is likely to experience bit rot. Downstream consumers of the document who use a JSON-LD processor will start encountering bugs. They'll report it upstream, and that person will say "it works for me, it must be a bug in your software".)

gkellogg commented 4 years ago

One might consider that a value of JSON-LD over plain JSON is to provide a means of validating that the described data corresponds to some schema. JSON-LD provides a means to parse the JSON to ensure that it adheres to some normative constraints, and through an RDF interpretation, that the semantic meaning is consistent with vocabularies in use.

Providing a means to specify data without a corresponding way to validate that data, beyond pure JSON structure, does not help the cause of producers wanting to ensure that such data actually says what it means in a reproducible, and platform-neutral way.

As it is, HTML, itself, is widely processed by different kinds of processors (e.g., web browsers and search engines) for different purposes, and the mere presence of the script tag in HTML specifically provides for the means of inserting non-HTML data, be it JavaScript or some other data format.

msporny commented 4 years ago

Authors tend to test their document with only one kind of processor, so they inadvertently introduce errors which would only be caught by the other kind of processor.

"Web developers tend to only test their websites with one web browser and inadvertently introduce code/markup that doesn't work on other browsers..." ... how is this assertion (and solution to the assertion) different than the one being put forward? Aren't the only solutions to this either 1) a monoculture, or 2) testing?

They'll report it upstream, and that person will say "it works for me, it must be a bug in your software".

Can you point to the data that shows that this is happening on a regular basis for JSON-LD?

Typically what happens in this case is that the developer goes, "oh, my bad, I'll fix that.", and they update their @context" (or remove it).

If the W3C TAG would like to go down this path, and I suggest strongly that it does not -- it'll be a waste of everyone's time -- there will be a strong expectation that the W3C TAG bring a solid body of evidence wrt. how the polyglot nature of JSON-LD has resulted in what you're claiming above.

To be clear -- companies deploying JSON-LD tend to deploy BOTH JSON and JSON-LD simultaneously and don't seem to have an issue with doing so. They also interop with other organizations doing so. When we find that someone has a bad @context they tend to be quickly informed of that fact, because it breaks interop for the data they're publishing... and they either remove the @context (fixing the bitrot) or fix the @context (enabling broader interoperability). In other words, the existence of @context enables us to do a better job at interop than just plain 'ol JSON because we can detect when things go wrong... and we process as both JSON and JSON-LD in different parts of the system because the polyglot nature of the language is extremely helpful depending on the type of processing you need to do.

dbaron commented 4 years ago

Aren't the only solutions to this either 1) a monoculture, or 2) testing?

I think there's also 3) interoperability. Admittedly it requires the interoperability to be quite good, but that's the goal of Web Platform Tests.

OR13 commented 3 years ago

Does this argument also apply to any language that supports typecasting, or OAS 3.0 which allows for JSON Schema as JSON or YAML?

They'll report it upstream, and that person will say "it works for me, it must be a bug in your software".)

To which the reply should obviously be:

Why did you include @context in YOUR JSON if you can't be bothered to understand how to use it properly?

Where are your unit tests?

Do you also forward __proto__ or other prototype pollution values you don't understand to downstream consumers?

If the system is truly polyglot the lazy developer can just remove the @context and stop pretending to understand JSON-LD, or they can root cause the issue, fix it and preserve interop.

Being anti-polyglot formats seems a bit like being anti "ability to do multiple things well at the same time"... I agree, most folks can't... should we lower the bar, or just make being excellent optional?

OR13 commented 3 years ago

I thought about this a bit more, and I think I see the point @hober is making.

Let's consider 2 examples of polyglot data formats in action.

Schema.org and Google Knowledge Graph Search API

https://developers.google.com/knowledge-graph/reference/rest/v1

When querying this JSON-LD API, the JSON content is returned at content-type: application/json; charset=UTF-8 but it contains an @context.

However, the content is valid JSON-LD... Perhaps it would be better if Google did not return this data as JSON-LD since the content type is JSON... or maybe it would be better to set the content type header correctly to JSON-LD.... Most developers unfamiliar with JSON-LD would be surprised by a different content type header, and would have no problem ignoring @context until they had a reason to care about it.... But maybe all that ignoring the context stuff would eventually cause Google Knowledge Graph Search to no longer implement schema.org or linked data knowledge representations correctly.

This specific case was discussed extensively when we were considering DID Document representations, and is the reason the implementation guide contains this section:

https://w3c.github.io/did-imp-guide/#data-model-and-representations

When writing this section and during the great debate wrt the DID Core abstract data model, I tried contacting the folks at Google a number of times, and never got a response.... In absence of a reply I advocated for handling things the same way google does, which the working group did not accept unanimously, even though many folks agreed with the approach, some preferred for make did+json and did+ld+json incompatible... I happen to think that decision is not correct, and luckily today, did+json can be did+ld+json just like google knowledge graph search results function today.

See https://w3c.github.io/did-core/#representations and the normative requirements... you will not that did+json is allowed to (but not required to) contain an @context just like how google returns application/json with an @context.

Mapping Smart Health Cards into the W3C VC Data Model

Per the spec https://w3c.github.io/vc-data-model/#contexts

Verifiable credentials and verifiable presentations MUST include a @context property.

However, smart health cards claims to implement the standard yet does not follow this requirement.

The section on "mapping smart health cards to the VC Data Model" would not be necessary if VCI followed the normative requirements of the spec.

https://spec.smarthealth.cards/credential-modeling/#mapping-into-the-w3c-vc-data-model

This means that all Vaccination Credential JWTs are not actually standards compliant W3C Verifiable Credentials UNTIL they are mapped to that data format... This is a bit like claiming to support USB-C because you can buy a thunderbolt adapter that supports USB-C.

In this case, the polyglot format is even words, since per the VCI spec:

{
    "@vocab": "https://smarthealth.cards#",
    "fhirBundle": {
      "@id": "https://smarthealth.cards#fhirBundle",
      "@type": "@json"
    }
  }

This means that fhirBundle is JSON, but really its some version of FHIR JSON, but its meant to be interpreted as arbitrary json, and all terms in the credential have definitions like this:

https://smarthealth.cards#fhirBundle https://smarthealth.cards#fhirVersion

So this is a custom covid vaccination format using JSON, JSON-LD, FHIR, and the VC DATA Model and a custom compression scheme for JWTs which is not described in the VC Data Model.

In this case, it would probably be better to just not claim to conform to the W3C VC Data Model standard, then there would be no need to add all this mostly incorrect JSON-LD, and you could encode the FHIR JSON directly into the JWT claim field, since JWTs are understood to be over arbitrary serialized data that has no registered semantics beyond the reserved terms.

This would also help clearly communicate that other VC-JWT implementations which are clearly not interoperable with smart health cards, are not supposed to be interoperable with them.

Here is some example healthkit code that relies on this "polyglot" format...

https://developer.apple.com/documentation/healthkit/samples/accessing_data_from_a_smart_health_card

OR13 commented 3 years ago

Please refer to these notes for the context of my previous comment: https://www.w3.org/2021/09/21-did10-minutes.html

I incorrectly asserted the link had been removed initially.

Here is the link to the TAG review of DID Core, which did contain references to this design principle issue:

https://github.com/w3ctag/design-reviews/issues/556#issuecomment-763900128 What this issue raised after the TAG review, but not shared with the WG? Hard to tell the timing from the comment thread.

OR13 commented 3 years ago

Wondering how to move this issue forward...

Seems like the problem might be identifying and labeling when a format has become "polyglot"...

For example, the first time a spec normatively requires that the same data model be parseable by 2 independent parsers, seems like a potential first time to raise the alarm bell.

Do rfc7159 and rfc7493 count as 2 independent parsers?

OR13 commented 1 year ago

Does this issue apply to all uses of structured suffixes?

Or only uses of multiple structured suffixes?

Or not to structured suffixes at all?

OR13 commented 1 year ago

https://datatracker.ietf.org/doc/draft-ietf-mediaman-suffixes/

It would be good to get comments on this IETF draft on the IETF list... from people in w3c who thing this issue should remain open... or it would be good to close the issue.

OR13 commented 1 year ago

FWIW, I asked the lists for clarity on this, and referenced this issue: https://mailarchive.ietf.org/arch/msg/media-types/JxzT03Dhe7Nt8cPAfjbDx3WVQRM/

Hopefully IETF can clarify how this is supposed to work.

hober commented 12 months ago

I wrote down some additional thoughts the other day.

OR13 commented 11 months ago

@hober have you commented on https://datatracker.ietf.org/doc/draft-ietf-mediaman-suffixes/ ?

It would be nice to see guidance on creating media types, being given to the group that manages the media types registries, afaik, W3C is not the keeper of media types, even if we have expertise on how they can interact uncomfortably with the web.

pchampin commented 11 months ago

I wrote down some additional thoughts the other day.

Very interesting read. I agree with a lot of what you write... except for one important premise: I don't think that JSON-LD qualifies as a polyglot format. Here's why.

You write (emphasis is yours):

[Polyglot formats allow] processors to interpret these documents as either format A or format B.

Any JSON-LD processor needs to first parse the document as JSON, and operates on the result of that parsing. It is not a "JSON or JSON-LD" situation, but a "JSON then JSON-LD" situation.

How does it differ from any other JSON format? All JSON formats aim to encode things beyond the objects, arrays, numbers and strings that result from JSON parsing. For example, GeoJSON is about points, lines and polygons, that are parsed from the result of pure-JSON parsing. And yes, I can use a generic JSON tool (such as jq), which knows nothing about points, lines, or polygons, to process my GeoJSON documents, and do some useful stuff with them. Does that make GeoJSON a polyglot?

@dlongley and @iherman develop similar arguments here and here.

Finally, you conclude your post by

if you’re considering speccing a JSON/JSON-LD polyglot, instead define a JSON format and a mapping between it and the RDF data model.

I couldn't agree more with the last part of your sentence. And in fact, a JSON-LD context is exactly that, and nothing more: a mapping between some JSON format and an RDF data model.

OR13 commented 11 months ago

I'd say the key to defining a polyglot media type, is relying on multiple structured suffixes, here is why:

type/subtype+suffix (single data type format)

type/subtype+suffix1+suffix2 (polyglot)

Why use multiple structured suffixes unless you want to signal multiple ways to process?

If there are multiple ways to process, how do you know that 2 parties using 2 different processing schemes come to the same conclusion...

Instead of solving this problem, maybe don't create it in the first place.

pchampin commented 11 months ago

@OR13, you write

type/subtype+suffix (single data type format)

Is it really a single datatype? Reading Section 2 of RFC 6839, that defines structured syntax suffixes (and in particular the +json suffix):

knowing the semantics of the specific media type provides for more specific processing of the content than that afforded by generic processing of the underlying representation.

At the same time, using the suffix allows receivers of the media types to do generic processing of the underlying representation

The whole point of structured syntax suffixes, single or multiple, is to "signal multiple ways to process". Following your logic, we should abandon syntax suffixes all together.

OR13 commented 11 months ago

I said something similar up this thread, a long time ago, and on the lists.

I do think polyglot data formats and multiple suffixes are 2 sides of the same coin... And just because a design is dangerous doesn't mean it's bad all the time.

I think there is a difference between claiming a structured suffix supports generic processing, and claiming it supports 3 or 5 or 26 equivalent representations of the same information.

Seeing +json doesn't signal anything other than JSON.

Seeing +ld+json signals RDF and JSON.

Why stop at 2 though, is this design principle limited to only 2?

How much will it cost implementers to use the data format correctly as the number increases.

Design principles should apply generically (not just to specific technologies like xml), and they should address scale.

I'd like to see the design principle updated to cover multiple suffixes generally... JSON-LD won't be the last time this comes up.

csarven commented 11 months ago

Pardon my nitpick but re:

Seeing +ld+json signals RDF and JSON.

is one way of putting it. Others can correct me but I see it signalling JSON-LD (re "JSON then JSON-LD"), where JSON-LD is a concrete RDF syntax. It doesn't signal other concrete RDF syntaxes. RDF is a (constructed) language.

OR13 commented 11 months ago

JSON-LD is a concrete RDF syntax as described in [RDF11-CONCEPTS]. Hence, a JSON-LD document is both an RDF document and a JSON document and correspondingly represents an instance of an RDF data model.

https://www.w3.org/TR/json-ld11/#relationship-to-rdf

https://github.com/w3c/json-ld-syntax/pull/415

martinthomson commented 7 months ago

Several of us on the TAG discussed this in the context of #453 and a few things popped out. Most pertinent to this was the idea that a single format can be confused about what model it produced.

XML was defined in terms of the infoset but is often processed into a DOM. This dualism turns out to be a dangerous outcome as it means that the same document can produce subtly different interpretations in applications.

msporny commented 6 months ago

I suggest that the W3C TAG involve communities affected by this discussion as you deliberate, namely, any community that has specified a suffix for their media type (+jwt, +xml, +json, etc.), which, AFAICT, (arguably) defines a polyglot format.

This issue is now being misrepresented as a position of the TAG: https://github.com/ietf-wg-mediaman/suffixes/issues/23#issuecomment-2018316436

I realize that @hober published the "Polyglot formats" document as an individual, and not as a TAG Member or Apple representative, but it's being represented as "a TAG thing" in mailing list discussions... and this is triggering individuals in WGs, such as the VCWG, DIDWG, MEDIAMAN, and JSON-LD WG to protect against side-effects that the TAG will cause by moving forward with a recommendation to discourage polyglot formats.

These mitigations include abandoning suffixes, not because it's the right technical decision, but because of the uncertainty that this issue is creating across (at least) the WGs listed above.

w3ctag / design-principles

New principle: Discourage polyglot formats #239

Schema.org and Google Knowledge Graph Search API

Mapping Smart Health Cards into the W3C VC Data Model