Update guidance on use with non-JSON formats

awwright commented 1 year ago

The extent to which JSON Schema can be used to validate data structured as a non-JSON input isn't defined well enough. The spec currently says

However, any document or memory structure that can be parsed into or processed according to the JSON Schema data model can be interpreted against a JSON Schema, including data formats like CBOR

In my personal opinion, this is an interesting fact to point out. However, this isn't enough guidance to ensure that different implementations are compatible. Additionally, is somewhat outside the scope of JSON Schema, and so should be removed.

If this should be written into the standard, it should go into more detail about how this works technically. For any JSON-compatible format, there should be an isomorphism to JSON, or there should be guidance on how to handle the larger value space (for example, CBOR provides data tags, which applications might like to distinguish).

But I think the best option is to remove this for now, and publish guidance on handling non-JSON inputs separately.

Closes #1274

gregsdennis commented 1 year ago

Would you include YAML as a non-JSON input?

awwright commented 1 year ago

Yes, YAML also has a larger value space than JSON. For example, it supports circular references (Anchors and Aliases)—there's no way to encode this to JSON and then back to YAML. However, for a certain subset of YAML, you can just convert it to JSON, then validate that. (Or some equivalent calculation, if you want to optimize away the "conversion to JSON" step.)

Maybe this can be addressed: "Non-JSON formats may be validated if there is a single correct representation as JSON. Values without a JSON representation will either be indistinguishable, or cause an error." Maybe that's enough guidance?

gregsdennis commented 1 year ago

I think that's why we state that we operate on the JSON data model. I believe there's already text that says JSON Schema can operate in any format that maps into that data model.

awwright commented 1 year ago

Well, that's the paragraph I'm proposing to remove, at least from core. (Again that wouldn't suggest you can't pass alternate serializations to a validator, just that it's out of scope to describe in core.)

Related to this, I was thinking that "data model" could be simplified too. The data model is something I introduced to address the fact that the same value in JSON can be represented in multiple different ways. But the section is largely a paraphrase of the instance equality section, it may be easier just to say "the data model distinguishes JSON documents by those that are not instance equal."

And then after this, we can re-examine how non-JSON formats fit into this, maybe by specifying how a non-JSON document can be compared for instance equality to a JSON document.

awwright commented 1 year ago

Like I mentioned above, this issue may be a good place to consolidate "Instance Data Model" and "Instance Equality" into a single section. Each section is describing essentially the same concept just in different terms.

jdesrosiers commented 1 year ago

this is an interesting fact to point out. However, this [...] is somewhat outside the scope of JSON Schema, and so should be removed.

I agree. In fact, I think this kind of thing happens a lot in the spec and it would be nice to clean some of these things up.

gregsdennis commented 1 year ago

@awwright What, precisely, are you proposing be removed, that whole statement, or just the bit at the end about CBOR?

gregsdennis commented 5 months ago

Action is to remove the phrase highlighting CBOR. I think the rest is pertinent.

json-schema-org / json-schema-spec

Update guidance on use with non-JSON formats #1390