json-schema-org / json-schema-spec

The JSON Schema specification
http://json-schema.org/
Other
3.82k stars 266 forks source link

Clarify the handling of "contentSchema" #1288

Closed handrews closed 1 year ago

handrews commented 2 years ago

contentSchema is an annotation. It is never applied as part of normal evaluation (although see #1287), and shouldn't be processed at schema load time like a subschema of an applicator or location ($defs) would be. In other words, you shouldn't be able to target it with a $ref even though it looks like a schema. It's just data.

  1. We should clarify that bit about it being data, not a schema you can $ref
  2. We should note that since the schema location of the keyword is the schema-as-data's retrieval URI, if you fail to set $id then your schema-from-data will have that retrieval URI as its base URI, which means after the fragment is chopped off it is likely to have URI collisions with the parent schema. If you then try to feed it right back into the implementation to use in the same session, this will probably blow up in your face. We probably don't need to forbid it, but we should be very clear about why it is a Bad Idea™.
  3. If you don't include $schema, you might not have any idea how to process it. In a large ecosystem, this might come from a different schema resource than you started from. In a sufficiently complex system, the code asking for the validation (possibly over a network API) may not have direct access to that schema resource in order to check its $schema. We could either warn about this or we could include the dialect URI in the annotation data. Note that we need the dialect and not the specific vocabulary of the contentSchema keyword, because even if it matches the version of the core vocabulary, that doesn't tell us what other vocabularies are assumed. Presumably if we added dialect info to the annotation info, we'd also need to add it to the output format somewhere (which probably deserves its own issue).
jdesrosiers commented 2 years ago

contentSchema is an annotation. It is never applied as part of normal evaluation (although see https://github.com/json-schema-org/json-schema-spec/issues/1287), and shouldn't be processed at schema load time like a subschema of an applicator or location ($defs) would be.

Why wouldn't you treat it like a normal schema? I don't see why being an annotation or not getting evaluated means it can't be treated as a schema. The problems you bring up only apply if you don't treat it as a schema. It seems much simpler to just treat it like a schema and everything just works like normal.

handrews commented 2 years ago

@jdesrosiers contentSchema is specifically not an applicator or schema location keyword (meaning a keyword like $defs that primarily serves as a reference target). Its value is returned as an annotation, at which point it has been removed from the context that allows "treat it like a normal schema" to work.

jdesrosiers commented 2 years ago

contentSchema is specifically not an applicator or schema location keyword

The spec only says that the keyword contains a schema. Without any other guidance than that, I'd expect it to be treated like any other schema. But, I hold the intention of the author very highly, so I accept your interpretation since you were the author.

However, we are the spec authors and we can change it if we want, so why not change it to treat it like a normal schema rather than document all the ways it's weird because it's not a normal schema?

handrews commented 2 years ago

What would it even mean to treat it like a normal schema? How would that solve the problem at hand, which is using it as an annotation? Can you explain to me what "treat it like a normal schema" would accomplish other than allowing it to be used as a $ref target? (Let's set aside for now the question of whether allowing it to be used as a $ref target is a good idea- if we want to debate that, it should get its own issue as it's orthogonal to annotation usage and I only mentioned it because I didn't think it was controversial).

handrews commented 2 years ago

I realize the above may be phrased in a more demanding way than I really intended. I am honestly curious as to whether you see something I am missing. When we added contentSchema, it simply didn't occur to me that it would have these problems, which is why the wording doesn't mention it at all. I only realized the issue when I was actually trying to use annotations.

jdesrosiers commented 2 years ago

What would it even mean to treat it like a normal schema? How would that solve the problem at hand, which is using it as an annotation?

In this case, the keyword value is not useful as the annotation value for this keyword. What's needed is the location of the keyword in the schema. The location references a schema that can be used like any other to validate the content.

Here's an example using an annotation utility library I've been working on but haven't published yet. This currently wouldn't work, but would if contentSchema was redefined so that the annotation it emits is the location of the keyword (a non-relative URI) rather than the value of the keyword (a schema).

Schema

{
  "$id": "https://example.com/schema",
  "$schema": "https://json-schema.org/draft/2020-12/schema",

  "type": "object",
  "properties": {
    "json": {
      "type": "string",
      "contentMediaType": "application/json",
      "contentSchema": {
        "type": "object",
        "properties": {
          "foo": { "type": "number" }
        },
        "required": "foo"
      }
    }
  }
}

Valid Instance

{ "json": "{ \"foo\": 42 }" }

Invalid Instance

{ "json": "{ \"foo\": true }" }

Implementation

const schema = await JsonSchema.get("https://example.com/schema");
const instance = await JsonSchema.annotate(schema, { "json": "{\"foo\": 42}" });

for (const contentInstance of AnnotatedInstance.annotatedWith(instance, "contentMediaType")) {
  const [mediaType] = AnnotatedInstance.annotation(contentInstance, "contentMediaType");
  if (mediaType === "application/json") {
    const value = JSON.parse(AnnotatedInstance.value(contentInstance));
    const [contentSchemaUri] = AnnotatedInstance.annotation(contentInstance, "contentSchema"); // https://example.com/schema#/properties/json/contentSchema
    if (contentSchemaUri) {
      const contentSchema = await JsonSchema.get(contentSchemaUri);
      const result = await JsonSchema.validate(contentSchema, value, JsonSchema.BASIC);
      if (!result.valid) {
        throw new ValidationError(result);
      }
    }
  }
}

Hopefully the API here is intuitive enough to understand without documentation and hopefully it's helps show how this could be implemented.

I realize the above may be phrased in a more demanding way than I really intended.

I wasn't offended, but I appreciate the thought. I didn't answer right away because I was working on something else I wasn't willing to tear my attention away from long enough for a sufficient response.

handrews commented 2 years ago

The problem is that this does not handle the case where the schema containing contentSchema is not directly accessible to the caller. For example, if the schema evaluation happens on a server as the result of an API call, which returns the annotations. While that may seem convoluted, in government or highly regulated industries, complex access control is common. Someone may have access to the annotations but not to the original schema.

Annotations in general need to be usable without access to the original schema. The annotation information is more than just the value, as mandated by §7.7.1 (quoted here with formatting tweaks for readability and [notes] for recent terminology changes):

A collected annotation MUST include the following information:

  • The name of the keyword that produces the annotation
  • The instance location to which it is attached, as a JSON Pointer
  • The schema location path [now "evaluation path], indicating how reference keywords such as "$ref" were followed to reach the absolute schema location.
  • The absolute schema location [now just "schema location"] of the attaching keyword, as a URI. This MAY be omitted if it is the same as the schema location path from above.
  • The attached value(s)

This is enough information to determine which annotations go where, and how they relate to each other (using the evaluation path). But it's not the same as having the original schema, and it's not intended to be.

The only piece truly missing here is the dialect information (if the contentSchema value omitted $schema). If you don't have access to the original schema, you can't read its $schema to solve this problem.

jdesrosiers commented 2 years ago

I see your point, but I can't imagine a case where someone would have access to the annotations but not the schema. Just because I can't image it doesn't mean it doesn't exist, but if there is some real(-ish) example of when this could possibility be the case, I'd feel a lot better about accepting your point. The example you gave about highly regulated industries is too abstract to be convincing.

Assuming you can't have access to the schema, it could also be solved by having the annotation value be a tuple of the location and the schema resource the location is part of. This would keep things self-contained and allow the schema to work as expected.

I'd just really like to avoid the weird special cases you identified in this issue if there's any reasonable way to avoid it. Although having an application-facing annotation that is not just the value of the keyword feels like a special case as well since there are no other application-facing that behave like that.

handrews commented 2 years ago

if there is some real(-ish) example of when this could possibility be the case

There's one buried somewhere in the unevaluatedProperties but I'm not slogging through that mess to find it. I can't remember if it exactly fit, but really the point is that this is data meant to be extracted and sent somewhere. Even if the immediate caller has access to the schema, they may split off that bit of data and send it somewhere.

Assuming you can't have access to the schema, it could also be solved by having the annotation value be a tuple of the location and the schema resource the location is part of. This would keep things self-contained and allow the schema to work as expected.

So you want to return the entire containing schema resource? Why is this simpler than just adding the dialect URI to the list in §7.7.1? That's all we need to do. I don't understand the objection.

If the contentSchema lacks $id and makes JSON Pointer references to the containing resource, then the caller needs to have the containing resource. But I don't think that warrants making an entire copy of the containing resource. Either they have access to it already or it shouldn't work.

All we need to solve the problem is the dialect URI. Why make it more complex? Particularly when needing the dialect URI for errors has come up before. It's not a solution that is specific to contentSchema.

jdesrosiers commented 2 years ago

Consider this example where a value can accept an object with a given schema or a JSON string with the same schema. As horrible as this is, it's a real-life thing I've seen before.

{
  "type": "object",
  "properties": {
    "value": {
      "anyOf": [
        { "$ref": "#/defs/value" },
        {
          "type": "string",
          "contentMediaType": "application/json",
          "contentSchema": { "$ref": "#/$defs/value" }
        }
      ]
    }
  },

  "$defs": {
    "value": {
      "type": "object",
      "properties": {...}
    }
  }
}

If contentSchema was treated as just data, this wouldn't work and the schema author would have to duplicate the schema. Additionally, I think it would be surprising for the schema author that this doesn't work.

So you want to return the entire containing schema resource?

I'm still not convinced that anything other than a schema identifier is need, but if you need a schema, then yes I would expect it to include the entire containing schema resource. In most cases, I would expect that the schema should have an $id and the "entire schema resource" would identical to the value of contentSchema.

Why is this simpler than just adding the dialect URI to the list in §7.7.1?

That (or something equivalent) definitely needs to be done for other reasons, but that doesn't effect the simplicity/complexity of the trade-offs in my mind. The issue is treating $contentSchema as data vs treating it as a schema. As a user, I think my example illustrates that it's confusing if $contentSchema is just data. As an implemeter, I would find it much more challenging to explicitly exclude treating something as a schema that is in all other respects (including the meta-schema) a schema.

handrews commented 2 years ago

@jdesrosiers I definitely see the points you're making, and I agree that having the complete containing schema resource (and any information needed to fully resolve its base URI if it does not have an absolute-URI for $id) would be more useful, particularly in the case where the caller does not have access to the containing resource.

If I understand you correctly, you have offered two solutions

  1. Treating it "like a schema"
  2. Returning the whole schema resource as the annotation value

The first one, if you mean using it in-place as a schema, is not an option. The second one involves significant changes to how annotations work that I do not think are worth the cost.

Treating it "like a schema"

If contentSchema was treated as just data, this wouldn't work and the schema author would have to duplicate the schema. Additionally, I think it would be surprising for the schema author that this doesn't work.

I'm not sure what you mean here. The content* keywords MUST NOT be automatically applied. We have tests for this, including for contentSchema having a true assertion result on a string-encoded JSON Schema that is not valid against the contentSchema schema.

So if it's surprising to them, it means that they didn't read the spec or the test suite.

Perhaps I'm reading too much into grammar here, but I want to make sure we're clear on something here so we're debating the right thing: contentSchema, like the other content* keywords, is data. There are very good reasons for that, most notably that automatically parsing arbitrary media types is a gaping security hole. So that's not going to change.

That's our baseline for this discussion. We're just talking about how to make it usable.

Returning the whole schema resource as the annotation value

This would be a backwards-incompatible change that makes contentSchema a bit more complex to implement:

That wouldn't be too hard to do , but more importantly it would violate the architectural principle that annotation values come from the location described by the annotation data (schemaLocation, instanceLocation, evaluationPath). They are either derived from the keyword value at that location, or from the behavior of the keyword at that location. Producing a value that is sometimes from somewhere else entirely makes annotations more confusing to use, so I'm pretty strongly against that.

On the one hand, I do want contentSchema to be usable by callers that do not have access to the original schema resources/documents. On the other hand, I only need for it to be clear how to make contentSchema usable in such a case. It is fine with me if some possible contentSchema values are not usable in that way as long as this is communicated clearly in the spec, so that a caller that gets such a value back can go to the person who owns the schemas and point to the spec and say "hey, you need to include an $id or else I can't use this."

Looking at the original proposal again

In most cases, I would expect that the schema should have an $id and the "entire schema resource" would identical to the value of contentSchema.

Right, which is why I proposed:

  1. We should note that since the schema location of the keyword is the schema-as-data's retrieval URI, if you fail to set $id then your schema-from-data will have that retrieval URI as its base URI, which means after the fragment is chopped off it is likely to have URI collisions with the parent schema. If you then try to feed it right back into the implementation to use in the same session, this will probably blow up in your face. We probably don't need to forbid it, but we should be very clear about why it is a Bad Idea™.

Since the annotation location information contains the schema location, which identifies the containing schema resource, a caller that does have access to the containing schema can trivially get that resource anyway, and do whatever they need. And as noted above, a caller that does not have access could point to the spec and explain why the schema owner needs to add a $id.

[the dialect URI] (or something equivalent) definitely needs to be done for other reasons

Which covers what I asked for in my original point 3.

So, I would still like to go with my original proposal. What you are suggesting adds complexity to annotation usage by setting a new precedent for what kind of data can be returned. We would then have to expand annotation data to include location information for both the keyword (as we do now) and the source of the data (since it might not be the same, and if others follow this precedent their data may not be a schema resource and therefore wouldn't have an $id to provide that location information).

I definitely prefer documenting proper usage over a significant architectural change to annotations.

jdesrosiers commented 2 years ago

I'm not going to respond to most of that, because I think what I'll say next makes most of it irrelevant. Briefly, no, you haven't correctly understood the solution I was suggesting and, yes, I know the content* keywords are not to be applied during validation and I wasn't suggesting otherwise. I was pointing out that { "$ref": "#/$defs/value" } is not a valid schema outside of the context of it's containing schema resource.

This would be a backwards-incompatible change

I'm not concerned about backward incompatible changes in this case. It wouldn't surprise me if you were the one person who has has ever tried to do something with this feature. I'm much more concerned about getting it right than maintaining backward compatibility. That's not to say it necessarily needs to change to be "right". I'm just saying we shouldn't limit our options as I'd consider this feature still in it's early stages where it's still reasonable to make changes.

On the one hand, I do want contentSchema to be usable by callers that do not have access to the original schema resources/documents. On the other hand, I only need for it to be clear how to make contentSchema usable in such a case.

If I understand you correctly, I think this is a reasonable compromise. The schema would be treated as a normal schema within the schema resource, but the annotation would just be the value of the keyword. In cases where you have access to the original schema, contentSchema is always usable. If you don't have access to the original schema, you can use the annotation result, but that value isn't going to be usable unless you include an $id. You should have a $schema as well, but that's best practice, not required.

We should note that since the schema location of the keyword is the schema-as-data's retrieval URI, if you fail to set $id then your schema-from-data will have that retrieval URI as its base URI, which means after the fragment is chopped off it is likely to have URI collisions with the parent schema.

I agree that not setting an $id is a problem when the original schema is not in scope, but I don't think this is what should be expected if you don't have one. I don't think it's correct to consider the keyword's location as the retrieval URI and the weird behavior you described illustrates why. In this case, I think there isn't a retrieval URI. That can be fine if there are no relative references, but if there are an $id is necessary. Technically you should be able to have internal references without an $id, but in this case you couldn't because those references couldn't make sense in the original schema as well as the truncated schema. Considering all of these limitations, I'm starting to think the best thing is to require that contentSchema always have an $id.

So, I would still like to go with my original proposal.

If you drop part 1 (about it not being a schema) and change the way part 2 (should have $id) is explained. I'd be ok with the proposal. However, I think it would be better to just require $id, strongly encourage $schema, and discourage external references. That makes the "it's not a schema" weirdness go away, the annotation can be simple, and the schema can always be processed without the original schema in scope.

handrews commented 2 years ago

@jdesrosiers

It wouldn't surprise me if you were the one person who has has ever tried to do something with this feature.

I have never added a keyword purely for my own use case. Please do not use such an assumption to dismiss anything about a keyword or about my position regarding it. contentSchema was added in response to use cases seen in the wild, including but not limited to APIs sending JSON documents embedded in JSON strings. Which I would definitely not do, myself. There were other use cases involving non-JSON resources in an OpenAPI context.

I'm much more concerned about getting it right than maintaining backward compatibility.

I realize you are not necessarily saying that the change is necessary to "get it right", but please try to avoid implying that I am prioritizing other concerns over "getting it right." I mentioned backwards incompatibility because it was not clear to me that a.) you were actually proposing that, or b.) if you were, that you realized it was a change (because it seemed to me that you had a different impression of how the keyword works).

I think there isn't a retrieval URI.

If you can de-reference a URI to get a thing, by definition that URI is the retrieval URI. That's just what the term "retrieval URI" means, and you can't re-define it away. So any URI with a JSON Pointer fragment is the retrieval URI for that location in the JSON document, again by definition.

I don't think it's correct to consider the keyword's location as the retrieval URI and the weird behavior you described illustrates why.

The only reason there would be "weird behavior" is if someone fed the non-$id schema from contentSchema into a validator that didn't have access to the original resource and provided that as the retrieval URI without understanding what was going on with the fragments.

On the other hand, if you fed it right back into an implementation that had the original resource, and recognized that that retrieval URI was for a location within that resource, it would work perfectly well. It would recognize that any fragments are relative to that original resource, not to the decontextualized schema you sent in.

However, all of that requires a lot more understanding of URIs than most folks have, despite none of it being JSON Schema-specific aside from $id itself.

If you drop part 1 (about it not being a schema)

As I suggested doing earlier, I have filed a separate issue (#1307) to discuss that separately.

and change the way part 2 (should have $id) is explained

That wasn't proposed wording for the spec, so this is a no-op. We can debate the wording when there's a PR, but I have no intention of diving into URI minutia to get the point across.

However, I think it would be better to just require $id

That would create a new case for contentSchema as requiring special treatment to catch such an error. I'm against that. It's better to just explain that if you take it out of context (just like taking any other random schema object out of context), things that depend on the context won't work.

So we're back to where we started with point 2. I've filed #1307 for point 1. I can't tell if #1065 takes care of point 3 or if I need to file a new issue since #1065 seems focused on keyword-level identification and we only need dialect-level for this — I don't want this to get hung up on associating keywords and dialects which is a larger topic.

jdesrosiers commented 2 years ago

I have never added a keyword purely for my own use case. Please do not use such an assumption to dismiss anything about a keyword or about my position regarding it.

please try to avoid implying that I am prioritizing other concerns over "getting it right."

I find it very frustrating that this is how you interpreted my words. Clearly passions are running too high right now for this to be a productive conversation, so I'm going to leave this alone for a week or two to let us both cool off before we try again.

notEthan commented 2 years ago

contentSchema ... shouldn't be processed at schema load time like a subschema of an applicator or location ($defs) would be. In other words, you shouldn't be able to target it with a $ref even though it looks like a schema. It's just data. (@handrews)

I don't think this is the case in the current published spec - this issue is framed as a clarification, but it seems to me like a major change in how contentSchema is specified. Right now the spec says it's a schema, the metaschema says it's a schema, nothing indicates treatment any different than any other subschema like $defs values. (I'm sure there's volumes of issue comments that I could read that would lead me to understand your baseline assertion that it already is just data and not a schema, but that is sure not what's in the spec.)

I think it is a good change to make. Treating contentSchema as a subschema, it doesn't do much - no validation, no inplace application, it doesn't describe its instance (until that instance is parsed to something else). The only thing it does is what @handrews wants to prevent, $refing into it by anchor or by pointer, if it is a subschema without an $id.

I agree it should just be schema-shaped data, not a subschema. This does have other implications, though.

The metaschema currently describes contentSchema as a schema. I rely on the metaschema describing its subschemas for nearly everything that needs to know what is a schema and what isn't. If contentSchema isn't a subschema, the metaschema should not say it is. This, in my implementation, directly means it won't be recognized as a schema for anchor collection and $refing in will not find something that is a schema.

That affects validation: if the metaschema doesn't describe it as a schema, its content won't be validated as a schema if you validate against the metaschema. I think that is fine. It can be validated as a schema if/when its data are interpreted as a schema.

As just data, it should become a schema only considered as its own detached document, with the rules that apply to root schemas. This is a bit different than resource subschemas (subschemas with an $id).

handrews commented 2 years ago

@notEthan could you move the schema vs data stuff to #1307, please.

notEthan commented 2 years ago

yeah, I had started commenting here and then saw the new one after.

handrews commented 1 year ago

As documented in #1307, I have come to agree that the "contentSchema" schema should be treated as a schema.

As for other ideas in this issue, I would like to make the minimal clarification at this time, which would be to simply note that the "contentSchema" schema is interpreted normally within the larger schema resource, which might not be possible if one only examines the extracted annotation. (I'll sort out the wording in the PR).

Others are free to take any idea from this issue, such as adding SHOULD or MUST requirements around "$id", etc., or any changes in what information is included as part of the annotation result, and file it as a new issue for future work.