json-schema-org / json-schema-spec

The JSON Schema specification
http://json-schema.org/
Other
3.62k stars 257 forks source link

Determine behavior of $ref #66

Closed awwright closed 7 years ago

awwright commented 7 years ago

"$ref" is causing a lot of problems because it's been inconsistently implemented. Determine:

  1. Proper URI base
  2. How to validate instances that are literally {"$ref":"some string"}
  3. Support for constants/non-schema references
sgpinkus commented 7 years ago

@sam-at-github I'm proposing a behavior that falls in line with a lot of current implementations, where you can only use $ref in places where a schema is expected, meaning you can use "$ref" literally in places that expect a literal value (like "enum" and "properties")

So yeah my general objection is your taking something that make sense stand alone and making its behaviour dependent on json schema. For example, if $ref is independent of JSON Schema one does this:

   dereferenced_schema_doc = JSONDereferencer.deref(some_doc)
   validation_results = valaidate(dereferenced_schema_doc, some_doc)

Step one and step two are independent. Your proposing step one has to know about the structure of JSON Schema.

awwright commented 7 years ago

$ref already seems to be dependent on JSON Schema behavior, because "id" sets the base URI, and it has to be late bound to support recursive schemas.

What I figure is we're defining a new media type anyways, we can get to define how we represent hyperlinks.

In any event, we'll have to write in the behavior if there's no standards-level spec to reference. The question is, what's the behavior.

On Sep 21, 2016 17:21, "sam-at-github" notifications@github.com wrote:

@sam-at-github https://github.com/sam-at-github I'm proposing a behavior that falls in line with a lot of current implementations, where you can only use $ref in places where a schema is expected, meaning you can use "$ref" literally in places that expect a literal value (like "enum" and "properties")

So yeah my general objection is your taking something that make sense stand alone and making its behaviour dependent on json schema. For example, if $ref is independent of JSON Schema one does this:

dereferenced_schema_doc = JSONDereferencer.deref(some_doc) validation_results = valaidate(dereferenced_schema_doc, some_doc)

Step one and step two are independent. Your proposing step one has to know about the structure of JSON Schema.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/json-schema-org/json-schema-spec/issues/66#issuecomment-248780246, or mute the thread https://github.com/notifications/unsubscribe-auth/AAatDRycBWj4UWTNzItepJxu7F4jnOf9ks5qscoFgaJpZM4KDMfd .

sgpinkus commented 7 years ago

$ref already seems to be dependent on JSON Schema behavior, because "id" sets the base URI, and it has to be late bound to support recursive schemas.

id is supposed to establish a base URI for relative URI resolution. It's a JSON schema specific way of doing Base URI resolution. The JSON Reference spec leaves how it is done undefined, just like URI spec does:

It is beyond the scope of this specification to specify how, for each media type, a base URI can be embedded.

Your right, the JSON Schema imposes an especially complex method that requires tight coupling... But note its still a qualitatively different type of dependency on JSON Schema, than restricting where a $ref can occur and what the target must be based on a JSON Schema structural definition.

If you want to talk about feature that should be dropped because they are practically useless and not widely implemented, I would say id base URI resolution is at the top of the list!

handrews commented 7 years ago

I am strongly in favor of obliterating id from the specification. It's exceptionally confusing and in the course of publishing nearly 20 service definitions involving over 600 JSON Schemas we did not find any use for it whatsoever.

$ref on the other hand is extremely useful and easy to understand, completely independent of JSON Schema.

awwright commented 7 years ago

@handrews You should post this critique in the appropriate issue, and reference the part of the current 'master' draft that's too confusing.

It serves the same purpose as "base" and a rel=self link in HTML, which shouldn't be a confusing concept at all: it gives the document a base to resolve URI references against, and it lets you bundle multiple 3rd party schemas into a "definitions" section without needing to make any changes to them.

handrews commented 7 years ago

@awwright Will do. From what I've seen the current "master" is substantially less confusing. There are sensible uses but it opens up a lot of complexity if it still has the same capabilities as in v4.

sgpinkus commented 7 years ago

It serves the same purpose as "base" and a rel=self link in HTML, which shouldn't be a confusing concept at all: it gives the document a base to resolve URI references against,

Serves the same broad purpose yes, but in a much more complicated way. HTML4 BASE occurs once per document. Plus in HTML you don't actually need to dereference every reference to read a given document..

When present, the BASE element must appear in the HEAD section of an HTML document, before any element that refers to an external source. The path information specified by the BASE element only affects URIs in the document where the element appears.

awwright commented 7 years ago

@sam-at-github There's precedent for this in other technologies though, RFC3986 describes how this works in general, XML has xml:base (which works in application/xhtml+xml), and Atom I think too, and HTML has iframes.

In HTML you do have to parse the entire document starting from the top, URI references are resolved at the same time. (And if you use the DOM to change the base, they all have to be re-computed!)

Most of the issue is that JSON Schema is context-free, so we can't enforce a restriction that a keyword must only appear in a root schema, any root schema must also be a valid subschema, and behave the same way. And this is intentional, so that you can bundle together third party schemas into a "definitions" section.

sgpinkus commented 7 years ago

OK so the general argument for id is embedding schema within schema. I understand that id is providing a technical capability here. I just don't think it's a) practically useful, and b) widely implemented (TODO: survey). And as such an unnecessary barrier to conformance.

The recurring example of where this might actually be practically useful seems to be:

so that you can bundle together third party schemas into a "definitions" section.

Is that actually done anywhere in the wild though?? The alternative is to just not embed independent schema in your schema and to use absolute URLs (or paths). That is what I've been doing. Works fine. What is the argument for why this is so much less appealing to embedding schema in your schema? Something to do with efficiency right? I don't buy it.

@awwright You say that JSON Schemas are "context free". But all JSON Schema are loaded from a resource location. They have this implicit context. This is the base URI one resolves against in the absence of an id. There is also "$schema" which - "MUST be located at the root of a JSON Schema".

handrews commented 7 years ago

@awwright I think use case for "definitions"-based re-use needs some thought, at which point the desirability of id as the mechanism for this will be more clear. I'm basing this, of course, on the 600+-schema system I mentioned which did not use id. What we did isn't an ideal solution either (combine a bunch of schemas in a JSON document which is not itself a schema- there were reasons but I'm not going into it right now). But it indicates to me that there are viable alternatives that are less problematic than id.

Actually I think multi-schema integration in general needs some thought. I will give it some and file new issues / comment on existing issues appropriately. I understand the precedents and mechanisms (I had to in order to explain to people writing schemas what it did, which was not what they thought it did, and why they should never use it). But even if/when URI resolution needs adjustment, this has never felt like a good way to accomplish it.

awwright commented 7 years ago

Is that actually done anywhere in the wild though??

I've done it on a few cases! Primarily now, though, I'm storing schemas I use (including cached 3rd party schemas) in a document database, and looking them up by "id".

They have this implicit context.

There is a context, but it only has two properties, both strings: (1) the URI base (set by "id"), and (2) the schema vocabulary (set by "$schema"). This is why JSON Schema suggests every root schema set these, so it doesn't exhibit unknown behavior.

Schemas with these keywords can be found in sub-schemas too, and it sets the new "context" in the same fashion. So it's not an issue that "$schema" MUST be present in a root schema (though this is isn't the case in the current 'master' draft, it's merely SHOULD). Any root schema (following the suggested behavior) can be embedded as a sub-schema without any changes, without change in behavior.

This is kkiiiind of getting off track though. What does this have to do with $ref?

handrews commented 7 years ago

This is kkiiiind of getting off track though. What does this have to do with $ref?

@awwright you earlier justified tying $ref to JSON Schema by saying:

$ref already seems to be dependent on JSON Schema behavior, because "id" sets the base URI, and it has to be late bound to support recursive schemas.

and that seems to have prompted a side discussion about killing off id. Which I agree should not be going on in this issue and I apologize for my part in the derailment.

I'm still generally a bit puzzled by this concern over $ref as it was always one of the least problematic things about working with JSON Schema for me and the teams I worked with. At least once people understood how JSON Pointers work as URI fragments, which also wan't hard.

epoberezkin commented 7 years ago

I think killing ID altogether is a terrible idea - you need a way to link schemas in multiple files. and within the file you want to use IDs as the base for resolution so you don't have to write the whole URI, only the file name.

At the same time the embedding argument is weird - I think it simply should be not allowed to use root schema as a subschema (and it's easy to have a meta-schema that would respect that distinction).

However much time I invested into correctly managing $ref resolution in Ajv (it seems to be the only validator that fully supports the spec), I like the idea of renaming id to $id (for consistency) and restricting it to the top level and using JSON-pointers for everything else.

@fge was expressing similar views repeatedly and the only use case which I didn't like loosing at the time was "named dependencies" - where you give shorter ids to them. But I would happily trade this convenience for the simplification and the consistency of support in all validators.

At the same time, I would say that validators MUST support recursive and mutually recursive references (both within a single root schema and between root schemas) as it is the only way to define recursive data structures - trees, graphs, etc. It also means that $refs cannot be resolved in all cases and the "final"/"resolved" schema cannot be generated.

awwright commented 7 years ago

After implementing my "jsonschema" package, I'm having a hard time imagining how limiting the functionality would make implementation or usage any easier, since the "context free" paradigm is so central to schemas (that the behavior for a schema and a subschema is the same).

But you're using an entirely different approach than I am.

Performance for "jsonschema" is explicitly not the top priority like you're going for, but being reference-quality and customizable is, I recall mine being the first ECMAScript implementation to fully pass the JSON Schema Test Suite (it right now appears to pass all 824 tests from the test suite, though skipping network tests, and of course bigint/arbitrary precision tests).

Presently, I'm working on a brand new implementation from scratch that takes a JSON document as a stream, and reports errors with line numbers, that does support arbitrary precision numbers, and generally has better error reporting, especially for very large (possibly indefinitely large) JSON documents. I'll report back how that goes.


Anyways, I'm primarily trying to figure out here, since it's important that JSON Schema has no edge cases, how do we handle validation of instances that look literally like {"$ref": "some string", "$refb": "more strings, ..."}

As of right now's master branch, "$ref" is now only interpreted as such where a schema is expected. Does anyone have a list of cases in the wild where it's being used otherwise?

handrews commented 7 years ago

The only requirement for instance validation is that "$ref" be treated as a literal key name in properties, correct? Any other location, whether it expects a schema or not, is unambiguous, so it could be allowed anywhere else.

awwright commented 7 years ago

Possibly also as a value for "enum", and proposed stuff like "constant" or custom properties.

epoberezkin commented 7 years ago

Also dependencies and patternProperties. I think allowing $ref only in places where the schema is expected is a more sane approach than allowing to use $ref for anything else.

epoberezkin commented 7 years ago

@handrews I am bringing here the conversation regarding how the $ref should be treated: inclusion vs validation (from issues #85 and #98).

Treating $ref as inclusion has two problems: 1) recursive schemas 2) $ref resolution inside referenced subschemas

Problem 1: recursive schemas

@awwright wrote above:

$ref already seems to be dependent on JSON Schema behavior, because "id" sets the base URI, and it has to be late bound to support recursive schemas.

I am not sure what "late bound" means here if not "executing validation on the current part of the data instance using the referenced schema". @awwright could you explain what else could an implementation do if not validation?

The recursive data structures, and therefore recursive schemas that reference one another are very common - trees and graphs are used to represent many real world objects. I can point to some examples if necessary, but it seems quite obvious.

If we treat "$ref" as inclusion/structural manipulation, how does it work with recursion? Please bear in mind that in case you have mutual recursion between different files you cannot determine whether the $ref is recursive from the format of the URI.

I was relatively recently addressing issues with mutual recursion - see https://github.com/epoberezkin/ajv/issues/210#issuecomment-226956993 and https://github.com/epoberezkin/ajv/issues/240 . I am only posting these links as an illustration that a lot of people use recursive schemas in the wild, so we can't simply ignore this issue.

Problem 2: $ref resolution inside referenced subschemas

Another issue is reference resolution that would work differently, depending on whether you treat $ref as inclusion or as validation.

@awwright writes about it:

The only time there would be a difference is if the base URI changes. Which isn't a problem if your root schemas always have an absolute-URI "id" like JSON Schema recommends.

But it only solves the problem if you include the whole root schema that has ID. If you include the fragment, this fragment usually won't have id (or will have a relative id) to correctly change resolution scope. So if this fragment contains relative $ref to the schema from which it is included, the reference will not correctly resolve.

There is a test case in JSON-Schema-Test-Suite that illustrates this problem. If you treat $ref as inclusion the test will fail. I will post a slightly modified version here, so it is simpler to understand the problem (It is only modified to not rely on some assumptions that test-suite makes about schema IDs, the structure is the same).

Main schema:

{
    "id": "http://localhost:1234/schema.json",
    "properties": {
        "int": {
            "$ref": "definitions.json#/refToInteger"
        }
    }
}

definitions.json:

{
    "id": "http://localhost:1234/definitions.json",
    "integer": {
        "type": "integer"
    }, 
    "refToInteger": {
        "$ref": "#/integer"
    }
}

It all seems clear - property int points to "definitions.json#/refToInteger" which in its turn points to "#/integer" (that is a relative reference to "definitions.json#/integer"). If "$ref" is an instruction to validate referenced schema there is no problem. If "$ref" is an inclusion, then the main schema should be equivalent to this schema:

{
    "id": "http://localhost:1234/schema.json",
    "properties": {
        "int": {
            "$ref": "#/integer"
        }
    }
}

But the problem here is that this schema contains "$ref" that points to "#/integer" that is undefined in this schema. It was obviously present in "definitions.json", but as soon as we've included the fragment into the main schema we have lost that context.

That use case is very common in real world. When you define the collection of schemas in some domain space, it is a common practice to group many definitions in one file, so other schemas can reference them. And some definitions are usually referring to others, like in this example. So if these definitions were simply included they would not work.

Conclusion

I understand that historically $ref started as a separate thing, based on another standard. But both the spec, the official test-suite and the usage practice made $ref evolve and essentially become a special validation keyword, at least in some cases that are important enough to ignore...

@awwright @handrews I am looking forward to your suggestions how these problems can be addressed in any simpler way (!) than treating "$ref" as a special validation keyword.

epoberezkin commented 7 years ago

And, by the way, if we decide to acknowledge that $ref is a special validation keyword, as I believe it deserves :), we can also drop the requirement to have it as the ONLY keyword in the schema and ignore everything else. We would finally be able to stop dancing around $ref with clunky allOf to do what we need.

If usage practice is any indication, people do mix $refs with other keywords, it seems natural. When I relatively recently introduced the option in Ajv to ignore other keywords used with $ref (it will be the default behaviour in the next version as per spec, but now it's an option) and added a warning that you should not be mixing them, I immediately got an issue asking to be able to suppress the warning.

I think ignoring other keywords with "$ref" is the worst thing we can do - it's unexpected and confusing when some keywords do not apply. Also it's quite difficult to detect false positives in validation - very few people add enough fail tests to their schemas to understand that some keywords are ignored. I think we should either allow mixing (to acknowledge existing usage practice and to de-clunkify compliant schemas) or make the schemas where "$ref" has siblings invalid (to avoid confusion and surprises).

handrews commented 7 years ago

@epoberezkin thanks for moving this here, and even more for expanding on your concerns in detail. I see what is going on now.

I am not sure what "late bound" means here if not "executing validation on the current part of the data instance using the referenced schema".

"late bound" just means that you only dereference the references as needed during the process of validation:

{
    "definitions": {
        "foo": {"properties": {"bar": {"$ref": "#/definitions/bar"}}},
        "bar": {"properties": {"foo": {"$ref": "#/definitions/foo"}}}
    },
    "type": "object",
    "properties": {"foo": {"$ref": "#/definitions/foo"}}
}

This schema validates instances like: {"foo": {"bar": {"foo": {"bar": {"foo": {}}}}}}

and so on.

It works because the as it validates each child value, it goes through just the one reference needed to do that. Eventually it gets down to that innermost foo, which has no properties, so it doesn't need to dereference anything else, and validation passes. Which means the whole thing passes. No recursion properties.

We had a lot of recursive or mutually recursive situations in my last project and it was not a problem- this worked just fine.

So it's not "inclusion" in the sense of the C pre-processor where all of the inclusion happens before you run validation, and it would be possible to write out an equivalent schema with no "$ref". It's only "inclusion" in the sense that, at each level, one at a time, it is as if you have included that level.

In other words, it's just a difference of how the inclusion is implemented.

This does, of course, involve validating against the referenced schema, but that doesn't make "$ref" a validation keyword. Here are my definitions:

To illustrate, I'll unroll the references enough to validate the same instance without further "$ref" dereferencing. This means i've applied the minimum transformations specified by the structural keyword "$ref", and I have not impacted the validation outcomes against any possible instance in any way.

This is the unrolled schema:

{
    "definitions": {
        "foo": {"properties": {"bar": {"$ref": "#/definitions/bar"}}},
        "bar": {"properties": {"foo": {"$ref": "#/definitions/foo"}}}
    },
    "type": "object",
    "properties": {
        "foo": {
            "properties": {
                "bar": {
                    "properties": {
                        "foo": {
                            "properties": {
                                "bar": {
                                    "properties": {
                                        "foo": {
                                            "properties": {
                                                "bar": {"$ref": "#/definitions/foo"}
                                             }  
                                        }
                                    }   
                                }   
                            }   
                        }   
                    }   
                }   
            }   
        }   
    }   
}

which also validates {"foo": {"bar": {"foo": {"bar": {"foo": {}}}}}} in the exact same way as the reference-only one does. The validator just doesn't need to do any dereferencing here as we have done it manually.

I'll address the id / resolution problem 2 in another comment. I just wanted to handle the easy case first and see if we could agree on this part.

epoberezkin commented 7 years ago

@handrews your approach essentially means that the schema with all "$refs" included depends on the data being validated. I.e. for each data instance you will have different equivalent schema without "$refs". I specifically was asking for a simpler way than treating "$ref" as a special validation keyword. Creating a new schema for each data instance kind of solves the problem, but seems more complex - you could have avoided this issue altogether.

A validation keyword potentially changes the outcome of validation

That depends on the point of view and on the definition of what the $ref is, not the other way around. If you consider "the result of the validation of the data against referenced subschema" to be the result of the validation of $ref keyword, then $ref keyword satisfies the definition. If you consider $ref to be a structural transformation, then it would satisfy the second clause.

handrews commented 7 years ago

we can also drop the requirement to have it as the ONLY keyword in the schema and ignore everything else

I don't think this works the way you think it does. If it works just like "allOf", we're not gaining anything except a tiny streamlining of syntax, and no difference in behavior (which makes it irrelevant to #98 where overwriting is needed).

If it is not just a shorthand for "allOf", that just turns it into $merge with less clear semantics, where

{
    "$ref": "#/definitions/x",
    "properties": {"y": {"type": "boolean}}
}

is roughly equivalent to

{
    "$merge": {
        "source": {"$ref": "#/definitions/x"},
        "with": {"properties": {"y": {"type": "boolean}}}
    }
}

Except without the clarity of application/merge-patch+json semantics. So either you specify them (in which case they are exactly equivalent) or you're making up yet another set of merge semantics (which seems like a bad idea, we've got two already with "$merge/$patch").

So if we want $merge/$patch, and then want to declare $ref plus other keywords to be a shorthand for $merge, that's fine. But the same objections from issue #15 apply whether we spell it as $merge or as this expanded $ref.

epoberezkin commented 7 years ago

I'd rather we have a separate keyword for inclusion ($merge would do or any other) and treat $ref as validation - it would simplify things and also allow polymorphism.

awwright commented 7 years ago

@handrews Note JSON Schema already segregates keywords into classes, Core has "core keywords", JSON Schema Validation has "validation keywords" and "metadata keywords", Hyper-schema has, well, there's no name, but we can just call them "Hypermedia keywords"

epoberezkin commented 7 years ago

Let's not diverge to $merge/inclusion here, I will tomorrow post another suggestion to $merge #15.

handrews commented 7 years ago

your approach essentially means that the schema with all "$refs" included depends on the data being validated.

No, it's always the same schema. I was just illustrating the effective equivalence of the runtime behavior of a conforming validator (there are Draft 04 validators that do exactly this).

This is no different than recursion in programming. Having recursive functions doesn't mean that you copy out enough distinct copies to handle each set of input without recursion. It does mean that a compiler or interpreter can do that unrolling as an optimization, but the unrolling is a detail of compilation and does not affect either the outcome or the actual structure of the source. It just changes the steps (and hopefully performance) of whatever special case the compiler optimized.

Really, $ref schema recursion is no different than function recursion, so I really do not understand the hang-up.

handrews commented 7 years ago

@awwright what was that in response to? My definition of validation keyword? I was not trying to change the classification but rather explain what I think makes that classification what it is. The "core keywords" are kind of a miscellaneous pile. If @epoberezkin wants to move "$ref" into validation, we need to decide what makes a validation keyword (beyond "it's in the validation spec").

epoberezkin commented 7 years ago

@handrews you are contradicting yourself. If "$ref schema recursion is no different than function recursion", as you wrote, than "$ref" is essentially a validation keyword that delegates validation to another schema rather than includes it (that's what function recursion does).

handrews commented 7 years ago

@epoberezkin this is a terminology problem. Your use of the term "validation keyword" does not make any sense to me. It doesn't affect the validation outcome. That's what validation keywords do (aside from metadata keywords which are their own separate thing). They affect validation. $ref does not. Use it or don't, assuming you unroll enough recursion to handle the specific instance, the validation outcome is the same.

How can $ref change the validation outcome? Ignoring your problem 2 about scope resolution because that is a thing of its own.

epoberezkin commented 7 years ago

@awwright I don't insist on moving $ref into validation category, it can remain in the core. What I am insisting on though is:

handrews commented 7 years ago

@epoberezkin I don't think your notion of inclusion is what anyone else thinks is going on here. We're not physically executing an inclusion (copying the referenced schema into the location).

epoberezkin commented 7 years ago

How can $ref change the validation outcome?

The result of $ref keyword validation is the same as the result of validation of the current data using referenced schema. As such it participates in the validation outcome, same as other keywords.

epoberezkin commented 7 years ago

I don't think your notion of inclusion is what anyone else thinks is going on here. We're not physically executing an inclusion (copying the referenced schema into the location).

That's great, because many comments to other issues implied that it's what's going on.

handrews commented 7 years ago

The result of $ref keyword validation is the same as the result of validation of the current data using referenced schema. As such it participates in the validation outcome, same as other keywords.

That definition explicitly states that $ref does not affect the validation outcome. "participates" is not the same as "affects." The annotation and hyperschema keywords "participate" in the validation in the sense that a validator has to ignore them. But they can never change whether an instance passes or fails validation.

Similarly, by this definition you give, $ref cannot change the outcome of the validation. The outcome is entirely dependent on the referenced and referencing schemas. And is the same whether they are connected by reference or inlined (one level at a time to avoid infinite recursion).

epoberezkin commented 7 years ago

I think "participates" vs "affect" is just the terminology, the meaning is the same. By using "participates" I was just referring to the fact that $ref, as well as other keywords, can be inside "oneOf", "not", etc., when the result of keyword validation can be different from the result of the whole schema validation.

handrews commented 7 years ago

No matter what terminology you use, referencing or inlining has no impact on validation from the point of view of the specification. This is a specification project, not an implementation. Therefore it does not make sense to call it a validation keyword, nor to describe its implementation as a "special validation" step.

Referencing references a thing. That's all it does. It doesn't change the validation outcome, and it doesn't physically inline the referenced schema.

Your concern seems to be that it do one of these things (validation outcome) or the other (physical inclusion) when in fact neither describes what is going on.

epoberezkin commented 7 years ago

Similarly, by this definition you give, $ref cannot change the outcome of the validation. The outcome is entirely dependent on the referenced and referencing schemas. And is the same whether they are connected by reference or inlined (one level at a time to avoid infinite recursion).

I don't follow. I think we are arguing the terminology here.

If we agree that $ref and copying in place are different things, as we seem to agree, that's a big achievement already.

If we also agree that $ref in JSON-schema world is equivalent to the function call in programming, as you say in your comment, so we seem to agree here as well, then the argument whether "$ref" is validation or not is the same as whether function call is an expression. In all languages I know function call is an expression, rather than a "structural manipulation".

Your concern seems to be that it do one of these things (validation outcome) or the other (physical inclusion) when in fact neither describes what is going on.

Ok, so can you explain what it is then? How do you define it algorithmically?

epoberezkin commented 7 years ago

How about "delegation"? :) We can define "$ref" as "a special keyword that delegates the validation of the current part of the data instance to the referenced schema and uses the validation result as the result of the current (sub)schema validation" (that is if we don't allow mixing).

epoberezkin commented 7 years ago

I don't think you can avoid the fact that this definition at least is not contradictory with anything, even if you don't like it. I would very much like to see the alternative definition, so far we are only agreeing on what "$ref" is NOT. We still need to agree on what it IS.

awwright commented 7 years ago

Since I filed this issue, we posted the new draft https://tools.ietf.org/html/draft-wright-json-schema-00. So as I interpret it, a schema with a "$ref" property means two things: First, set the URI base to the target of the $ref. Then, substitute the keywords of the target schema into the $ref object.

That is to say, it should always be the same as a simple substitution, except the addition the URI base changes to the remote document's URI.

Are there any problems with this definition?

handrews commented 7 years ago

If we agree that $ref and copying in place are different things, as we seem to agree, that's a big achievement already.

Yup. Yay! :-D

In all languages I know function call is an expression, rather than a "structural manipulation".

Ah, I see your concern here! Yes, if that is your analogy then I follow your confusion. I wasn't thinking about it in terms of expression-ness. Perhaps programming languages aren't entirely the right analogy.

The reason I am so emphatic about excluding $ref from the set of validation keywords is due to schema algebra. I have been working on a proposal for the "additionalProperties": false re-use conundrum (it's also in that email thread from #98 ). In order to get it to work in a way that does not violate any of the principles of JSON Schema that shot down other proposals, there is a lot of schema algebra necessary to explain how it interacts with the boolean validation keywords.

A lot of the JSON Schema principles that we discuss- context-free validation, applicability, etc. really apply only to keywords that impact validation. Because they are validation principles. So if we're going to reason about those principles, then they apply to a specific set of keywords. Dumping "$ref" into that set by calling it a special validation keyword muddles that. Similarly, having a validation spec for most validation keywords but calling "$ref" a validation keyword while it is in core muddies the notion of validation.

Basically, we have a definition of what validation means, and it is tied directly to one of the three proposed standards. We should not muddy that definition or get it caught up in the other standards- that defeats the purpose of having a separate validation proposal.

It also doesn't gain us anything. Calling "$ref" a validation keyword makes things confusing without having any impact on how it is used or implemented.

We still need to agree on what it IS.

We're almost there. Set aside the word "validation" for the moment. How is it insufficient to describe what $ref does as referencing?

epoberezkin commented 7 years ago

We're almost there. Set aside the word "validation" for the moment. How is it insufficient to describe what $ref does as referencing?

It is insufficient because "referencing" means nothing in terms "how it works". It is ambiguous and allows multiple interpretations and implementations. I see only two options here - inclusion (copy/paste) and delegation (function call). If it's delegation then $ref essentially does validation by delegating to another schema.

You say it's neither. Saying it's "referencing" is just an attempt to avoid the issue. If you have some third explanation/implementation for "$ref" I am all ears. If not, we need to choose between the two options we have.

Trying to solve some other problem by muddling what $ref is is going to create more problems than it solves, because $ref is a fundamental thing in JSON-schema and it should be clearly defined.

handrews commented 7 years ago

It is insufficient because "referencing" means nothing in terms "how it works". It is ambiguous and allows multiple interpretations and implementations.

This is a specification, so implementations (aside from whether they are possible) are not of concern here.

Multiple interpretations are a concern, but I do not see how there are conflicting interpretations possible. Can you provide an example where different interpretations that match what I wrote provide different outcomes? (without the problem 2 issue of resolution scope- haven't gotten to that yet!)

epoberezkin commented 7 years ago

Can you provide an example where different interpretations that match what I wrote provide different outcomes?

You still didn't write anything but "referencing" which in itself is ambiguous. Our options are inclusions (copy/paste), which we seem to agree it is not, and delegation. These interpretations are conflicting as they are producing different results.

This is a specification, so implementations (aside from whether they are possible) are not of concern here.

The specification is a guidance for implementations, so if it is not clear how to implement the spec, we will get exactly the kind of a mess we are now in, where there seems to be very few JavaScript validators that correctly implement the spec with regards to $ref resolution, in this way devaluing it and prompting some radical suggestions.

awwright commented 7 years ago

@epoberezkin @handrews Are there any specific problems with the new definition in https://tools.ietf.org/html/draft-wright-json-schema-00, or none?

epoberezkin commented 7 years ago

@awwright I don't see any problem immediately, it is still sufficiently vague :)

Resolved against the current URI base, it identifies the URI of a schema to use.

"Use" can be interpreted as both the inclusion and as the delegation, so we are in the clear with regards to the published spec I think :)

We still need to clarify what it is as neither the spec nor examples in it address problems 1 and 2. I.e., we may explicitly say in the spec that references can be recursive (problem 1). We can also specify how the refs should be resolved in referenced subschemas (problem 2).

But in any case what we have in the spec now doesn't contradict the usage practice, test-suite etc. As long as "use" means delegation :)

handrews commented 7 years ago

Are there any specific problems with the new definition in https://tools.ietf.org/html/draft-wright-json-schema-00, or none?

I don't see any, but I still don't understand where @epoberezkin sees the problem or need for additional terminology. Aside from the problem of a naive implementation recursing infinitely (which the spec explicitly points out as something that MUST be avoided), I don't see how different implementation approaches produce different outcomes (aside, perhaps, from problem 2, which I need to look at now in detail- @awwright I think it might be addressed by your wording as well, but I haven't yet worked through it).

epoberezkin commented 7 years ago

@awwright sorry, I just noticed this definition:

Since I filed this issue, we posted the new draft https://tools.ietf.org/html/draft-wright-json-schema-00. So as I interpret it, a schema with a "$ref" property means two things: First, set the URI base to the target of the $ref. Then, substitute the keywords of the target schema into the $ref object. That is to say, it should always be the same as a simple substitution, except the addition the URI base changes to the remote document's URI.

It's definitely a step in the right direction, although I think it does still have issues. I will write tomorrow, it is getting late here, sorry...

handrews commented 7 years ago

@awwright , @epoberezkin : I went back and looked at problem 2 and I think that @awwright 's "First, set the URI base to the target of the $ref" produces the correct behavior.

I have always thought of it a bit differently, but with the same outcome.

Rather than copying over the id needed to change things and then substituting the values, I have always thought of it as the validator simply "running" the validation from the referenced location. So the id and resolution stuff work just fine without needing to use id to mess with it. This is why I've never had any use for id and just find it confusing. "$ref" is just "go over there and continue validating, and when you're done come back here and keep going as if nothing unusual happened."

I have not worked through whether that conceptual view causes problems elsewhere- I'd be interested in that but not enough to push anyone else through another long discussion. As long as the specification is focused on the outcome rather than implementation I am happy.

[EDIT]: @epoberezkin while walking to and from the grocery store just now your rationale for describing "$ref" as involved in validation finally clicked. The way I think about "$ref" working is very similar to your description of "$ref" being defined to return the validation result of the thing being referenced. I still wouldn't call "$ref" a validation keyword, but I get what you meant now.

handrews commented 7 years ago

Literal "$ref" values

This was one of the original points of this issue, and I wanted to put a proposal on record even though I know there is movement to avoid the problem by restricting where "$ref" can be used (which I dislike).

Here is a proposal for how to have a literal "$ref" property name. Basically, you take the object that should have a literal "$ref" key and wrap it in another "$ref". So instead of a Reference object taking only a URI string value, it can either take a URI (current behavior) or an object (replace the reference with the literal object).

Only one level of literal escaping happens at a time. This is analogous to backslash escaping in many string formats- \ is a backslash token for escaping, \\ is a literal backslash, \\\ is a literal backslash followed by a backslash token for escaping, and so on.

The simplest form- strip off an outer "$ref" and treat the inner one as literal: {"$ref": {"$ref": "foo"}} => {"$ref": "foo"}

While proper "$ref" objects should not have other properties, the object including the literal "$ref" may: {"$ref": {"$ref": "foo", "x": "bar"}} => {"$ref": "foo", "x": "bar"}

Multiple levels evaluate one level at a time, so odd numbers leave the innermost reference as an actual reference: {"$ref": {"$ref": {"$ref": "#/foo"}}} => {"$ref": {"$ref": "#/foo"}}` where the result is an object with a literal "$ref" property, the value of which is a reference to "#/foo".

So if you want to define a property called "$ref" you do so like this (including additionalProperties-false to show that interaction since it is so often problematic):

{
    "type": "object",
    "properties": {
        "$ref": {
            "$ref": {"type": "string"},
            "stuff": {"type": "boolean"}
        }
    },
    "additionalProperties": false,
    "required": ["$ref"]
}

Note that the "$ref" in the required array is not a problem because only objects can be references.

This would translate to (with the remaining "$ref" now considered a literal property name):

{
    "type": "object",
    "properties": {
        "$ref": {"type": "string"},
        "stuff": {"type": "boolean"}
    },
    "additionalProperties": false,
    "required": ["$ref"]
}

The following instances validate against that schema:

{"$ref": "foo"} {"$ref": "bar", "stuff": true}

The following would not:

{"$ref": "foo", "x": 42} {"stuff": true} {}