Disallow even optional "content*" processing

handrews commented 2 years ago

This feature added substantial complexity and security exposure for essentially no benefit. The functionality could be trivially implemented as a library on top of a JSON Schema implementation that supports annotation collection.

Fixes #1287. Intentionally avoids topics still under discussion in #1288.

karenetheridge commented 2 years ago

I'm against this change as it is written; I've found it useful in OpenAPI validation. I understand that in some situations it might be a security risk, so it's reasonable to have its support be optional (as it already is), and to default to NOT performing validation. I've been thinking of splitting it into two vocabularies: Annotation and Validation. Like with format, it can be enabled when desired, and left disabled most of the time.

To clarify: I'm fully onboard with changing the spec to clarify that an author of a schema should never be unsure whether content* will be validated or not -- implementations should NOT validate by default, so that leaves either something in $vocabulary to indicate that validation is desired, or an out-of-band configuration variable that defaults to false (which is what I've done in my implementation, so it can be used in certain situations where appropriate decoding algorithms have been provided to the implementation).

handrews commented 2 years ago

@karenetheridge this is not about content* having validation behavior (which it does not). This is about the complexity of shoving an entire second (but totally independent) validation process into an interface that doesn't allow it.

You could still create a validating content vocabulary after this change.

handrews commented 2 years ago

@karenetheridge also, the "automatic feed-back-to-validate" process can trivially be implemented as a library on top of any JSON Schema implementation that supports annotation collection and any standard output beyond "flag". This PR is about taking something that is better handled as a separate library out of the spec. It just doesn't need to be here, and there is no other reason to add this much complexity to the basic interface of JSON Schema.

handrews commented 2 years ago

@karenetheridge since the content* keywords in the https://json-schema.org/draft/next/content vocabulary are strictly annotations, there are two ways to validate string contents while remaining in compliance with the spec:

take the output of the completed initial evaluation, and for each instance location to which content* annotations are attached
- decode the string according to contentEncoding (the spec gives no guidance on handling any errors)
- parse the string according to contentType (the spec gives no guidance on handling any errors)
- evaluate the parsed string against contentSchema as a new evaluation process
- recurse if any content* annotations are present in this new output
- store the output to return alongside but separately from the initial evaluation and any other secondary evaluations
create a content-assertion extension vocabulary analogous the the format-assertion vocabulary, which would perform the encoding/parsing/additional validation as part of the regular (single) evaluation process

Option 1 can be done in two ways: Shoved into the main JSON Schema implementation, creating a completely different set of output requirements from everything else (as currently suggested by the spec). Or it can be done as a separate library, without burdening the JSON Schema implementation's output interface that works for literally everything else in JSON Schema.

There is no functional difference whatsoever between doing it within the implementation or as a separate library. Please let me know if you do see a difference, because I would like to understand that. I am proposing this change because fundamentally, it's not a change. It just moves non-essential functionality to a library, which could even have its own little specification to ensure interoperability of this area. That would be an improvement over the current situation.

Option 2 is not impacted by this change at all. Supporting a contentSchema assertion-applicator is tricky as it is not clear how to represent instance locations within a string, so for now any such feature would not be interoperable, but we could figure that out if we wanted to make it interoperable.

karenetheridge commented 2 years ago

I don't understand your references to output requirements in Option 1. What does output have to do with anything here?

The issue I have with moving this functionality out of the spec is, if we only talk about annotations in the spec, we lose the information needed to implement these keywords as validations consistently.

But, more importantly, if we "..take the output of the completed initial evaluation..", then we can't use the results of evaluating the content* keyword(s) and use them as inputs to applicator keywords one level up in the document. So I see Option 2 as much more preferrable.

it is not clear how to represent instance locations within a string.. but we could figure that out if we wanted to make it interoperable

The string is decoded (using the specified media type) into something that matches the JSON document model, so I presumed (when implementing this) that the instance location just carries on using json pointers inside the decoded data structure. That's under-specified though, so it would be good to clarify that (assuming the use of validation stays in the spec at all).

handrews commented 2 years ago

But, more importantly, if we "..take the output of the completed initial evaluation..", then we can't use the results of evaluating the content* keyword(s) and use them as inputs to applicator keywords one level up in the document

The spec has never allowed this since contentSchema was introduced in 2019-09. It's expressly forbidden:

Implementations MAY offer the ability to decode, parse, and/or validate the string contents automatically. However, it MUST NOT perform these operations by default, and MUST provide the validation result of each string-encoded document separately from the enclosing document.

You're currently not allowed to use that result within the validation of the enclosing document. It MUST be produced as separate output results, which puts content* assertion behavior into a totally unique processed divorced from normal implementation.

This PR removes that uniqueness. If anything, it makes what you want to do easier by removing this constraint.

So I see Option 2 as much more preferrable.

That's great! This PR actually makes that easier by removing the part that more-or-less forbids that.

I don't hear any actual objection from you regarding what this PR does. If anything, this sounds like support. You're saying "no, don't merge this", but also "my preferred solution is the thing that is easier if we merge this."

so I presumed (when implementing this) that the instance location just carries on using json pointers inside the decoded data structure. That's under-specified though, so it would be good to clarify that (assuming the use of validation stays in the spec at all).

You should file an issue for that, as it is not relevant to this PR. It was underspecified before, and it will be the same level of under-specified afterwards.

karenetheridge commented 2 years ago

It was underspecified before, and it will be the same level of under-specified afterwards.

Well no, not really - if you take out the descriptions of how the content* keywords would validate, then that leaves the reader to make up their own (inconsistent) rules. ..But on looking again, I see you're not removing those paragraphs.

I don't hear any actual objection from you regarding what this PR does. If anything, this sounds like support. You're saying "no, don't merge this", but also "my preferred solution is the thing that is easier if we merge this."

I guess? I don't see any path forward to allowing in-place evaluation as assertions, other than creating a new vocabulary (akin to the format-annotation and format-validation vocabulary split in 2020-12), but I'd been pondering that myself and it seems reasonable.

So I guess my vote is "not-no" - I'm not terribly happy with the spec staying in the state after merging this, but the parts you are removing are parts that I definitely disagreed with. If that makes sense :)

handrews commented 2 years ago

@karenetheridge

the parts you are removing are parts that I definitely disagreed with. If that makes sense :)

Yes, exactly! This PR is about removing a bad solution to the problem you want to solve. It takes no position on other possible solutions, and you are quite welcome to file proposals for other solutions. I would actually suggest a slightly different approach to validation, using more specific keywords. A major problem with these a validation keywords is that's they're so open-ended. But if you specifically want to validate JSON-in-JSON, or or parse HTML-in JSON, etc., you can write a more clear spec for that behavior.

But that's a discussion for another issue (or discussion).

I'll go ahead and merge this, then. Two "yes"es and a "not-no" seem sufficient :-)

json-schema-org / json-schema-spec

Disallow even optional "content*" processing #1296