Should "contentSchema" have schema location behavior?

For the purpose of this issue (and consistent with #1306), "schema location behavior" means that a keyword indicates that some part(s) of its value are schemas and MUST be recognized as such by an implementation. Being "recognized" means that an implementation knows to scan it for $id, $anchor, etc. and associates the IRIs they create with the schema, along with the JSON Pointer fragment IRI (it's irrelevant whether any of this is done on load or at runtime).

$defs only has schema location behavior
all inline (as opposed to by-reference) applicators inherently have schema location behavior

2020-12 classifies $defs as a location keyword, but the concept of "location keyword" is somewhat muddled. Schemas located through this behavior can be targeted by $ref (or anything else that might reference schemas with an IRI). I think we generally agree that $ref-ing into applicators is a bad practice, but we don't (currently) forbid it. TBH, I wouldn't mind forbidding it, but I suspect I'm in the minority.

contentSchema is defined as an annotation, and was not intended to have schema location behavior. $ref-ing into contentSchema is definitely at least as bad a practice as $ref-ing into an inline applicator.

My preference would be to forbid it by saying that contentSchema lacks this behavior. Framing it in terms of schema location behavior would make this part of the JSON Schema system rather than a weird exception.

As noted in #1288, @jdesrosiers would prefer that contentSchema have schema location behavior.

I'd like to get more opinions on this point, which doesn't change the outcome of #1288. So it's not necessary to read through all of that issue.

An alternative would be to change contentSchema from taking a schema to taking a string containing a schema. Which I almost did when I added it to ensure it was treated as data. If anyone likes this idea please speak up, but I am not expecting it to be popular.

Note that while I definitely have an opinion on this, if there's a clear majority in favor of giving contentSchema location behavior then I'll go with that. I filed this to have a more focused discussion, not to fight to the death on it.

I'm not too sure the distinction between this and #1288, they both seem primarily concerned with treating contentSchema's value as data vs subschema. I commented there, what I said there seems relevant both there and here. I'll reiterate bits that are particularly about contentSchema as a subschema other schemas can $ref to, or as a its own entity that may separately become a schema.

I think it is good to treat contentSchema purely as schema-shaped data, not at all a subschema, in the context of the schema or resource containing it. Treating contentSchema as a subschema, it doesn't do much - no validation, no inplace application, it doesn't describe its instance (until that instance is parsed to something else). The only thing it does is what @handrews wants to prevent, $refing into it by anchor or by pointer, if it is a subschema without an $id.

As just data, it should become a schema only considered as its own detached document, with the rules that apply to root schemas. This is a bit different than resource subschemas (subschemas with an $id).

It is a resource schema, even without an $id.
- However, lacking an $id has the problem @handrews noted that its retrieval URI is the same as its parent's, if present.
- This implies its metaschema may be specified by $schema (regardless of $id).
Pointing a $ref by fragment outside of the contentSchema value to the schema or resource containing the contentSchema will not work.

Some further thoughts on describing this in the metaschema if it is not a thing $ref/$anchor interact with: I mentioned on the other issue that contentSchemas value being described as a schema by the metaschema would be a problem, at least for my implementation, as I use that to determine what is a schema to collect $anchor from and what to consider valid to $ref into. (though, if "location keyword" becomes a concept exclusive to $defs, that might be different)

The metaschema could almost use the content* mechanism to describe contentSchema, if we disregard for a moment that content* only apply to strings. The metaphor is almost the same: contentSchema describes instance data, but is not to be applied to the data in situ, only once some processing has been done. This also describes contentSchema's own data - the schema describing it is the metaschema, but is not to be applied in situ, only once it has been detached, given its retrieval URI, treated as a root schema.

 {
   "$schema": "https://json-schema.org/draft/2020-12/schema",
   "$id": "https://json-schema.org/draft/2020-12/meta/content",
   "$vocabulary": {"https://json-schema.org/draft/2020-12/vocab/content": true},
   "$dynamicAnchor": "meta",
   "title": "Content vocabulary meta-schema",
   "type": ["object", "boolean"],
   "properties": {
     "contentEncoding": { "type": "string" },
     "contentMediaType": { "type": "string" },
-    "contentSchema": { "$dynamicRef": "#meta" }
+    "contentSchema": {
+      "contentMediaType": "application/schema+json",
+
+      // encoding is the wrong description but I think the right metaphor; this is how
+      // the content (schema-shaped json data) is within the instance (a schema).
+      "contentEncoding": "json",
+
+      "contentSchema": {
+        "$id": "content/contentSchema",
+        // ref is not dynamic - contentSchema is independent and has new, empty dynamic scope
+        "$ref": "/draft/2020-12/schema",
+      }
+    }
   }
 }

Initially this seemed to me an interesting diversion to think about but one without practical use - a slight stretching of the metaphor, unusable because content* do just apply to strings and I am not advocating a change to that. But then @handrews said

An alternative would be to change contentSchema from taking a schema to taking a string containing a schema. Which I almost did when I added it to ensure it was treated as data. If anyone likes this idea please speak up, but I am not expecting it to be popular.

If contentSchema were a string, the above would basically be working. It lets the metaschema describe the instance without indicating that it is a subschema. It makes it clear to readers and authors that the value is not like a subschema. I have some negative feeling toward putting json data as a string inside other json data, but I think it does really fit better as a string. And the $ref to the metaschema is not dynamic, though I'm not sure that is more of a problem than any other way to describe contentSchema in the metaschema would be.

@notEthan thanks for replying and copying over that text.

However, lacking an $id has the problem @handrews noted that its retrieval URI is the same as its parent's, if present.

That's not quite correct: It has a different retrieval URI (differing by the fragment), but since using a URI as a base URI disregards the fragment, that means that they end up with the same base URI. Proper resolution relative to that base URI would take the context schema into account, but if the context schema is not accessible (e.g. an API request evaluated an instance, got the annotations, and sent them back without providing access to the context schema), the application would not be able to resolve relative references into the context schema.

Because the annotation data (at least as of the next release which will strengthen the annotation output requirements) MUST include the schema location with a JSON Pointer fragment, references within the extracted contentSchema schema could theoretically be resolved correctly, although that would require doing additional work rather than just handing it off to the usual URI reference resolution code.

though, if "location keyword" becomes a concept exclusive to $defs, that might be different

Yes, that's what #1306 is about, although it's not that it would be exclusive to $defs, it's that $defs is the only current keyword that has that behavior without being an applicator. So implementing #1306 would allow us to declare contentSchema to have that behavior in addition to its annotation behavior. This would support @jdesrosiers 's position that it should be a normal schema without requiring special keyword-specific handling. Which is why I'm open to that outcome — while it's not my preference, if it can be described without keyword-specific hacks, I'll be OK with it.

I believe that we should accept #1306 as it is important for reasons beyond contentSchema, which is why it's a separate issue from this one, and why we could accept it and not necessarily give contentSchema schema location behavior.

I've been considering this off and on for a while, including the comments from @notEthan .

I have come to the reluctant conclusion that it is better to give "contentSchema" location behavior, as @jdesrosiers prefers (although not expressed in those words).

Using a "contentSchema" that lacks an "$id" and/or a "$schema" based on an annotation that gives its location as something like "https://example.com/schema#/properties/whatever/contentSchema" is no different than starting validation from an identical schema at something like "https://example.com/schema#/$defs/whatever". In both cases, you must:

Understand that your starting schema is a secondary rather than primary resource
Evaluate the schema as part of the primary resource: if you do not have access to the entire primary schema resource, you cannot evaluate it properly

The only difference between the "contentSchema" and "$defs" case is that the "contentSchema" schema is extracted as an annotation value, and removed from its context. There are a number of other possible solutions for that which can be discussed in #1288, including simply not extracting the schema as an annotation value and requiring that the user have access to the original schema (which would save memory as well).

At this point, I am convinced by @jdesrosiers 's assertions that treating "contentSchema" "like a schema" is less confusing than treating it differently, which means giving "contentSchema" location behavior. If #1306 is accepted, then we can do that explicitly with those words, but it is characterized as a schema already so technically we do not have to change anything.

@notEthan you're welcome to object to the PR or continue to raise arguments here, btw. I should have left those comments up for a while before making a PR, but was having a bit of an off day and going on auto-pilot.

json-schema-org / json-schema-spec

Should "contentSchema" have schema location behavior? #1307