Allow "$ref" to take a schema, to make inlining easier?

handrews commented 5 years ago

[EDIT: See better problem statement two comments below]

Now that $ref can have adjacent keywords, dereferencing/inlining is more complex. (for the purpose of this issue, we're just talking about situations where inlining is possible, e.g. no cyclic references- inlining such use cases is common, with numerous libraries dedicated to this operation).

In informal discussions, we've recommended replacing $ref with an allOf containing just the referenced schema, OR if there is already an adjacent allOf, appending the referenced schema to that allOf. This is rather cumbersome.

At the same time, we use runtime JSON Pointer-ish constructs that look like /properties/foo/$ref/properties/bar/$ref, etc., to record the runtime path as we traverse references.

What if we allowed replacing the $ref URI with the target schema? e.g. if {"$ref": "A"} points to {A}, then it can be replaced with {"$ref": {A}}

$ref here is effectively a no-op, it just allows inlining the target without having to re-arrange the context.

Pros:

It's much easier to explain
It matches how we report runtime paths even when not dereferenced
We're already changing $ref inlining, so now is the time to sort this out

Cons:

It's a change
Strongly typed languages may be unhappy about the string-or-object behavior
???

gregsdennis commented 5 years ago

To be clear, are you suggesting that we support schema objects as a value for $ref? If so, that feels weird and counter to one of the reasons for having references (avoiding duplication).

Or are you just talking about how references are explained without any functional changes? If so, this is probably a good explanation.

handrews commented 5 years ago

@gregsdennis Here's a better problem statement:

"Dereferencing" JSON Schemas is a common operation, for example JavaScript's json-schema-ref-parser gets over 500K downloads a week.
By allowing other keywords adjacent to $ref, we broke key assumptions those libraries rely on
- Although note that schemas that still avoid adjacent keywords will still work just fine
We need to provide guidance for this use case, because it is very substantial
Having a process based on a core keyword ($ref) depend on a non-core keyword (allOf) is a pretty bad design smell
- I should be able to work with core without needing any other vocabulary
- By definition, it can't have dependencies

A related thing that may or may not be considered a problem:

Conceptually, dereferencing does not change behavior, but in practice it changes the output
- This was not previously an issue due to not having an output spec in the first place
- Keeping the output the same preserves the mental model of embedded vs linked resources really being the same thing- the only difference is how many JSON documents are involved

The first/primary problem can be solved by having a Core keyword that is a no-op in-place applicator where you can put a dereferenced schema

One option would be to overload $ref
Another option would be to add a keyword, say $inline
There may be more solutions, these are the two that come to mind for me right now

If we agree that the second thing (the output changes based on whether it is reference or inline) is also a problem, then we should overload $ref.

If we believe that it is not a problem, then it doesn't matter.

If we believe that it would be a feature, with apps wanting to know the structure, then we should add a keyword such as $inline

My feeling is that dereferencing and inlining schemas is an implementation detail, and the consumer of said schemas should not be affected by it.

This is based on my experience with tools that collect large numbers of schemas and pack them into a single file before processing them. The author of the schemas is not necessarily even aware that the tooling takes this step.

Up until now, it didn't matter- there was no standardized output format for annotations or errors, and none of this changes the validation result. But now it will matter.

For a tool like Cloudflare's Doca, the expectation is that the schemas are spread across many files, but they are processed as if they were a single dereferenced file (this is obviously a significant limitation, in that circular references are not allowed, but they've been using it to produce their API documentation for years so it's a real use case). Doca in fact uses json-schema-ref-parser to do this.

So if inlining a schema changes the output, doca would start to produce surprising results. It would be just that much harder to correlate errors with input, because the dereference/inline step is hidden from the schema author.

handrews commented 5 years ago

I do not think anyone would ever write a $ref with a schema object by hand. It's only for automated processing. Although processed schemas might be distributed in a single packaged file, so users might see such $refs.

(and of course, people do weird stuff so someone somewhere will no doubt write it by hand, but I don't think that's a significant factor).

johandorland commented 5 years ago

For implementors like me and @gregsdennis schema inlining is not really on our mind. We must support the generic case and for that inlining obviously doesn't work. I remember some people sending in feature requests to support schema dereferencing, which I always shot down as it's more a feature of a schema processor than a validation library.

I can see the problem the new $ref mechanics will create for libraries like json-schema-ref-parser. I'm not too thrilled about allowing $ref to take a schema as it adds yet another corner case to JSON schema. I wouldn't mind the allOf solution, but I'm not much of a purist. Having said that it's probably the cleanest solution as it preserves the output. Also I'd rather overload $ref than add an $inline keyword. An extra keyword would make things only more complicated and it looks less clean than a $ref with a schema. Also it's easier to explain to someone that you can just copy and paste the content of where a $ref is pointing to than that you have to change the $ref to an $inline as well. Just keep things simple and overload $ref then.

handrews commented 5 years ago

@johandorland schema processors are as much our customers as validators / code generators / whatever. That is kind of the point of having modular vocabularies- a scheme processor can just depend on core and ignore everything else.

As far as the allOf solution, we need to keep future vocabularies in mind. For example, let's consider the following hypothetical new keywords:

{
    "$ref": "#/$defs/base",
    "refSemantics": "baseClass",
    "allOf": [
        {
            "$ref": "#/$defs/commonStuff",
        },
        {
            "$ref": "#/$defs/childSpecificStuff",
        }
    ]
}

In this example, the refSemantics keyword is an annotation explaining how code and doc generation tools should handle the adjacent $ref. In this case, it should be considered the base class for the schema containing the refSemantics keyword. That schema also uses an allOf to glue some re-usable bits together.

How should this be handled? A generic dereferencer won't know about refSemantics, and would produce:

{
    "refSemantics": "baseClass",
    "allOf": [
        {
            "$ref": "#/$defs/commonStuff",
        },
        {
            "$ref": "#/$defs/childSpecificStuff",
        },
        { the baseClass schema itself }
    ]
}

which no longer works because the annotation has been separated from the reference. However, if $ref can take a schema:

{
    "$ref": { the baseClass schema itself },
    "refSemantics": "baseClass",
    "allOf": [
        {
            "$ref": "#/$defs/commonStuff",
        },
        {
            "$ref": "#/$defs/childSpecificStuff",
        }
    ]
}

this continues to work just fine.

This is an interesting example because it actually depends on the ability to have a keyword adjacent to $ref. It's not just a convenience. While I just made up the refSemantics keyword on the spot, I do think that annotations adjacent to $ref will be a key part of how we build OO-related tools. Although it's worth noting that this is not an idea that anyone has yet validated.

gregsdennis commented 5 years ago

All of this sounds like an implementation problem, and not one that JSON Schema needs to deal with. If an implementation processes references by mechanically inlining them as a step, then they have to deal with how to model and report that dereference.

I think this is a non-issue for us (as JSON Schema).

awwright commented 5 years ago

This makes a little bit of sense.

Why not just jump to the logical conclusion and also say "A string found where anywhere schema is expected, is a URI Reference to a schema"

handrews commented 5 years ago

@awwright I feel like that's a lot to ask particularly with extensible vocabularies that may include their own applicators. String-or-schema is more of a challenge in strongly typed languages, and I see some value in $ref expressing the intent of referencing a schema that lives elsewhere.

If you see a $ref with a schema value, then you know that this was almost certainly processed from a URI Reference, which may be significant in some applications.

Perhaps more importantly, the annotation-adjacent-to-$ref is a significant use case. I brought up this issue on the OpenAPI slack, and @mkistler confirmed that his tooling special-cases annotations like examples when they are adjacent to $ref (even though such things are not technically allowed in OpenAPI Schema Objects either).

It's pretty straightforward to define the behavior of an annotation with respect to an adjacent $ref. Of course, allowing URI References in place of schemas for other keywords would not disallow keywords adjacent to $ref, assuming we kept $ref. But I think it encourages relatively inflexible structures. Because we are emphasizing the potential of adjacency semantics for building vocabularies for use cases such as code generators, I'm reluctant to enable another structure that could steer people away from that approach.

I do see your point with this idea, though. I think a good option would be to try it with $ref for this draft, and depending on feedback we could easily extend it to "wherever a schema appears" in the next draft (hopefully the next major draft will be the last).

handrews commented 5 years ago

@gregsdennis Validators are not the only type of implementation. And not all implementations need to support all features, as seen by many popular tools that work with only part of the spec. Including dereferencing and packaging schemas into single files, or single in-memory structures.

Dereferencing is a popular standalone feature. Not only is there the library that I mentioned (and several others), but I've noticed it requested as a feature on several validator repositories. It is often rejected on the argument that stand-alone dereferencing should be done as a separate implementation (@Julian has this come up with your implementation? I confess I'm too lazy to go look right now 😴 )

This really has nothing to do with how references are handled internal to a validator. This has to do with supporting implementations of the core spec as a base for other features, and implementing a tooling ecosystem around JSON Schema that includes modular utilities.

I am aware of multiple $ref implementations, as well as at least two implementations that transform schemas in various ways (one of which I wrote, and one which was written independently and I think earlier).

On our Slack workspace, you asserted that our customers are schema authors, not implementors. I disagree, particularly now that we are encouraging the development of further vocabularies, and "implementation" will have a more broad and flexible meaning.

Furthermore, schema authors may rely on tooling such as this to bridge the gap between the actual features that they wish to use in their schemas, and the features supported by specific tools. This is precisely what Cloudflare did with Doca.

We need to think about a much wider range of uses than simply writing a validation schema and validating instances with it.

handrews commented 5 years ago

@philsturgeon can probably also speak to the usefulness of utility libraries for things like transforming between OAS Schema Objects and various JSON Schema drafts, writing linters, etc.

Relequestual commented 5 years ago

I can see the argument and utility in this. As you've said, ideally we don't want to encourage transclusion, but people have their reasons for doing so, and it's clear people want to do it.

We often do not make changes to the spec where a solution already presents. In this case "Just use allOf would be the solution.

@handrews My feeling is your example which includes baseRef doesn't hold. A generic implementation would double wrap (allOf > allOf), and not simply add the $ref to an existing allOf (I assume). Can you check this with the existing library? The result would be insightful.

What's the primary problem we are trying to solve here?

Simply that it's cumbersome? If so, my feeling is there's little expectation that people will read these compiled schemas, and therefore preventing a cumbersome look of the schema is not enough of a reason.

If however there's a specific gain to implementations that do stuff like documentation generation, I'd like to hear the overview. (I think you have indicated there would be, but not specifically what. Generally my feeling is supporting evidence isn't required for your assertions like this, but I'd like to see and consider also.)

We should involve @JamesMessinger at this point, given his implementation's popularity for this purpose.

Ultimatly, if such a change wouldn't be supported by json-schema-ref-parser, we have to consider if it's worth doing.

My feeling is it would be preferable to my previously axed section which was originally bundled into the "$ref is delegation" change (ugh I don't have the link offhand, please feel free to add it here).

Can anyone speak to thier expectation on the cost of this change to validation implementations? @gregsdennis maybe?

I'm personally unconcerned with the issue around typed langauges, but I'm also willing to listen. I feel the pros outweigh the cons overall, so far.

mkistler commented 5 years ago

Just to throw in a new perspective -- our tooling treats referenced schema as semantically meaningful, in that these are the schema (for the most part) that will be turned into classes in our generated client libraries. So we don't start by dereferencing all the $refs in an API doc. In fact -- we do the opposite. We create named schemas and insert $refs for things like request bodies and response bodies because we want to produce named classes for these.

philsturgeon commented 5 years ago

General comment on this sort of talk:

I remember some people sending in feature requests to support schema dereferencing, which I always shot down as it's more a feature of a schema processor than a validation library.

Spectral is a validation (well, validation plus linting) tool, which uses a schema processor called json-ref-resolver to figure all this stuff out.

We also have a mocking library, which uses a schema processor called json-ref-resolver to figure all this stuff out.

Pretty much any tool has the job of "reading JSON Schema from files or URLs and they probably have $ref in there so lets figure out what to do with that, so if you don't code your own schema processor you probably need to bundle one, or your tool, whatever it is, isn't going to work.

Catching up on the rest of it.

@handrews it would be handy if you could post a refresher on why $ref can have siblings now. I think a lot of the case being made for this relies on that, so it'd be good to get that context.

Relequestual commented 5 years ago

Additionally, doing this would not advert the issues around URI shadowing when a subschema sets an $id. (This issue is not the space to discuss and debate URI shadowing.)

Relequestual commented 5 years ago

it would be handy if you could post a refresher on why $ref can have siblings now - @philsturgeon

This was mainly because people often expected it to work that way for the purposes of adding specific annotations when generic schemas were used in specific contexts. We also now had the applicability language, which made explaining HOW it works a LOT easier, and made behaviour in line with other applicability keywords.

handrews commented 5 years ago

@philsturgeon

it would be handy if you could post a refresher on why $ref can have siblings now. I think a lot of the case being made for this relies on that, so it'd be good to get that context.

Initially it was as @Relequestual says. More importantly, I gave an example above of how code generation functionality will depend on it (the refSemantics example). Do you see how that example relies on $ref-adjacent keywords? If so, is there something else missing here? If not, what do you see that example as doing?

handrews commented 5 years ago

@mkistler

So we don't start by dereferencing all the $refs in an API doc. In fact -- we do the opposite. We create named schemas and insert $refs for things like request bodies and response bodies because we want to produce named classes for these.

Right! So, if someone took a schema written for use with your tooling, and dereferenced it so that those $refs were removed and could no longer be detected as references, then your tooling would no longer be able to process it because important semantic information was lost in the $ref-removal.

That is the issue that we have right now. The way to dereference a $ref involves removing it and fiddling with an allOf. The reference is gone, and your tooling is now broken.

Obviously, people who are writing with your tooling in mind would not do this themselves. It would only happen in some sort of environment where other tools are also involved, and someone combines them in the wrong order or something like that.

But there is a key principle here (@Relequestual this is the most concise way to describe the problem so far):

Schema semantics should be robust to transclusion.

Secondarily, the transclusion process should be a function of the core vocabulary only, because only the core vocabulary is guaranteed to be supported by all implementations.

The allOf solution does not handle this.

It is not part of the core vocabulary (it is part of the applicator vocabulary)
It breaks annotations whose semantics depend on being adjacent to a reference, as I demonstrated in my example

I don't know what you mean by double-wrapping. Could you please use my example from earlier and show what a double-wrapped solution would look like?

handrews commented 5 years ago

Regarding involving @JamesMessinger, I would be thrilled for him to join the discussion.

However, I do not think that any one implementation gets veto power over a feature that has valid use cases. I'm pointing to that project because its existence shows the need for transclusion support in general. There are now new use cases around transclusion because of the adjacent annotation possibilities.

While I hope JSON Schema Ref Parser (JSRP) chooses to support those, it would also be valid for it to say that it only supports $ref in isolation. It would be quite easy to write a meta-schema that enforces that but is otherwise identical to our standard core vocabulary meta-schema, and say that the library only supports JSON Schema-ish documents (note that the library does not really depend on JSON Schema itself beyond $ref) that conform to that meta-schema. That is part of why we have vocabularies and meta-schemas in the new draft, to allow expressing those sorts of restrictions.

If JSRP decides to go that route (with or without an actual meta-schema involved), that does not invalidate the use case.

Transclusion should not break schema semantics.

I feel like I have clearly described a key use case around code generation for semantically significant $ref-adjacency.

Furthermore, @mkistler has confirmed that references are semantically significant for his tooling. While he and others making such tools could simply say "don't repackage schemas with transclusion", one problem with the allOf-plus-replace-$ref approach is that once it has been done, tooling cannot detect that it was done. So the result will be puzzling and incorrect behavior.

handrews commented 5 years ago

Relevant issues from @Julian 's project:

ucarion commented 5 years ago

Perhaps it was stated by other comments here, but in re:

it would be handy if you could post a refresher on why $ref can have siblings now. I think a lot of the case being made for this relies on that, so it'd be good to get that context.

Part of the reason is that it makes $ref work like all other keywords, i.e. it "ANDs" with all of its sibling keywords. Previously, it was the sole keyword to obviate its siblings.

ucarion commented 5 years ago

Just to throw in a new perspective -- our tooling treats referenced schema as semantically meaningful, in that these are the schema (for the most part) that will be turned into classes in our generated client libraries. So we don't start by dereferencing all the $refs in an API doc. In fact -- we do the opposite. We create named schemas and insert $refs for things like request bodies and response bodies because we want to produce named classes for these.

This is a legit point. But @mkistler, does the tooling you mention restrict itself to a subset/profile of JSON Schema? Perhaps the form of $ref in discussion here might be prohibited from the profile your tool supports? Just a thought.

handrews commented 5 years ago

@ucarion There is currently a Very Big Discussion going on in the OpenAPI steering committee about JSON Schema profiles using the forthcoming draft's vocabulary features (@mkistler's tooling is OpenAPI-related).

But the point I was making concerned the scenario where someone who understands the tooling requirement writes something using $ref, and then someone else who needs things dereferenced for some reason runs those schemas through a de-referencer, and then attempts to feed the dereferenced schemas back to @mkistler's tooling.

In the current set-up, there is no way to tell that the schema was dereferenced before sending it to the tool, because the $refs are removed entirely. It will just fail in some unexpected way. And it's not possible to catch that with a profile.

Under the proposal in this issue, the $ref keywords are still in the correct place and can provide the semantic signalling that @mkistler's tooling relies on, even if the schema has been dereferenced.

mkistler commented 5 years ago

@ucarion Re:

does the tooling you mention restrict itself to a subset/profile of JSON Schema?

We do currently only handle a subset of JSON Schema. Elsewhere we are discussing how to formalize that concept, as I think it will be a common case for tooling developers.

handrews commented 5 years ago

as I think it will be a common case for tooling developers.

Yup. Pretty much why we spent so much of the last year on vocabularies and the keyword taxonomy that lets us reason about keyword behavior more generally.

ucarion commented 5 years ago

@mkistler I failed to notice you work on OpenAPI. Thanks @handrews for cluing me in.

@handrews, I'm not entirely sure I see how the tooling in question would work. @mkistler's requirement, presumably, is that there is a string associated with the $ref that serves two purposes:

Unique identification, and
A name which can be used to generate a class/struct/datatype/etc

But in the case of a $ref pointing to A, A might not have an $id. So the tooling cannot dedupe/reuse what it's previously generated for A, as it doesn't have a unique key to dedupe off of. It also doesn't have any useful hint for what a human-friendly name for the generated struct might be.

ucarion commented 5 years ago

Concretely, a schema that starts like this:

{
  "definitions": {
    "user": {
      "properties": {
        "id": { "type": "string" },
        "display_name": { "type": "string" }
      }
    },
    "photo": {
      "properties": {
        "id": { "type": "string" },
        "display_name": { "type": "string" }
      }
    }
  },
  "properties": {
    "me": { "$ref": "#/definitions/user" },
    "favorite_photo": { "$ref": "#/definitions/photo" }
  }
}

Would be converted into:

{
  "definitions": {
    "user": {
      "properties": {
        "id": {
          "type": "string"
        },
        "display_name": {
          "type": "string"
        }
      }
    },
    "photo": {
      "properties": {
        "id": {
          "type": "string"
        },
        "display_name": {
          "type": "string"
        }
      }
    }
  },
  "properties": {
    "me": {
      "$ref": {
        "properties": {
          "id": {
            "type": "string"
          },
          "display_name": {
            "type": "string"
          }
        }
      }
    },
    "favorite_photo": {
      "$ref": {
        "properties": {
          "id": {
            "type": "string"
          },
          "display_name": {
            "type": "string"
          }
        }
      }
    }
  }
}

I'm not sure how you can usefully codegen off of the second case?

handrews commented 5 years ago

@ucarion thanks for the example, that is a good question.

There are really two sub-cases of the "remove references" use case.

Remove all $refs so that the implementation does not have to further resolve URI references at all
Pack multiple JSON Schema documents into a single JSON Schema document

Your example is related to the first sub-case, which is most likely to happen as some sort of internal processing step. This is what Cloudflare's Doca documentation generator does. It's also the sort of internal implementation detail that @gregsdennis points out is not all that relevant to schema authors. My argument against that was that there are use cases that are relevant to schema authors, and by that I meant the second sub-case.

The second sub-case is more broadly useful because managing and distributing a big tree of files can be a pain. Re-packaging the files into a single file helps with that, and in addition to JSRP you can find tools, including OpenAPI-specific tools that handle this exact use case.

In in the second use case, you will have $ids. You should (in fact, SHOULD per the spec) have $ids at the root of each separate schema file. Of course, in OpenAPI 3.0, $id is not supported, however, as we are actively working on converging OpenAPI 3.1 (or 4.0) and the forthcoming JSON Schema draft, this will likely change so that the bundling feature would work the same there.

And technically, even if a separate file does not have an $id, if it is bundled it should be assigned an $id from the retrieval URI (because that is how RFC 3986 requires calculating base URIs).

So the more interesting case here is the 2nd use case: bundling multiple files into a single file, without necessarily dereferencing internal references.

I hadn't really thought of this but it is a very important distinction and explains the difference in opinion between @gregsdennis and myself. Removing internal references is kind of a weird thing- there are use cases, but they mostly involve implementations that are limited in some way and do the internal de-referencing as a workaround.

The bundling use case is far more significant (and has gotten discussed in various ways repeatedly over the years in numerous issues in this repository).

I think the bundling use case, because it would involve $ids, is the one to focus on. And in that case, you would not lose the unique identifier necessary to support code generation.

gregsdennis commented 5 years ago

I still maintain that this is a detail of one or more implementations that don't conform to the spec and that this change is trying to accommodate these implementations. I think it's a bad move that opens the door to other changes that are initiated by implementors who don't want to do the work to conform to an established (almost, but not-quite) standard.

Will existing implementations (that do schema referencing correctly) be expected to now accept a schema (as well as a pointer) as a valid value of a $ref keyword as part of this change (suppose we allow this, and my implementation gets one of these pre-dereferenced schemas)? If so, that's where I have an issue with this.

If file bundling is really the problem that you're trying to solve, that's a transport/storage issue, and not one that JSON Schema needs to concern itself with.

Implementations that manage references need to do so in accordance with the spec. If they want to "bundle"/dereference the files into a single file as part of their process, then that's an implementation detail and the implementor's perogative. If they require the schema author to do so (whether manually or through some secondary tooling), then they need to clearly specify that in their documentation as a deviation from the spec (e.g. "External references are not processed by this validator...").

Additionally, my implementation operates in a similar way to @mkistler's in that it tracks each schema as a distinct object, each with it's own document path. The references link two documents.

handrews commented 5 years ago

@gregsdennis

I still maintain that this is a detail of one or more implementations that don't conform to the spec and that this change is trying to accommodate these implementations.

I find this confusing. $ref is part of the spec. Prior to this draft, it had a well-defined transclusion process for dereferencing. In the JSON Reference specification, the option of literally replacing the reference object was explicitly allowed.

You have decided that transclusion is not part of the spec, which goes against various discussions where we have talked about specifying the transclusion process, particularly the allOf approach in the spec. I can dig up issues if I really have to.

As far as I am concerned, based on wording in past drafts and discussions in issues during the past year, transclusion is part of the spec. Stop trying to invalidate other people's usages.

Bundling is NOT an implementation detail. It is something that schema authors/distributors care about. It is something that I have used in my professional use of JSON Schema. You don't get to dismiss it because you don't personally have a use for it.

handrews commented 5 years ago

This has already gotten too long and confusing. I am going to re-file and consolidate the various objections and arguments.

jdesrosiers commented 5 years ago

+1 for @awwright's idea about allowing a url anywhere a schema is allowed. If you do that, there is no need for $ref at all. More importantly, it completely solves the original problem. People were making incorrect assumptions about sibling properties of $ref, but a string can't have sibling anything, so there's no space for confusion.

The argument that it could be problematic having schemas be represented by more than one type is valid, but that ship has already sailed. A schema can already be represented as either an object or a boolean, so why not a string as well.

This also means that the semantics of references don't need to change. You don't need to define behavior with respect to allOf or anything else. References can work just like they always did except without the problematic bits. As a bonus, it would reduce the verbosity of JSON Schema. It's certainly worth consideration.

Here's an example of what it would look like if { "$ref": "url" } was replaced by "url".

{
  "type": "object",
  "properties": {
    "foo": "#/definitions/foo",
    "bar": {
      "title": "Bar",
      "allOf": ["#/definitions/bar"]
    }
  },
  "definitions": {
    "foo": { "type": "string" },
    "bar": { "type": "number" }
  }
}

gregsdennis commented 5 years ago

@jdesrosiers the only issue with that is @handrews's point that it's still important to know where a dereference occurred, especially with external references. Hence the "$ref as a no-op" thing.

handrews commented 5 years ago

I'm pulling together a more comprehensive proposal. This thread has shown where the vision is missing or poorly articulated, and I need to spend some time laying it out correctly.

handrews commented 5 years ago

I have included a CREF discussing the problem I'm trying to solve in this issue in #780.

If I can't get buy-in, we can ship draft-08 with that CREF rather than fix this. I think it will cause problems, but it's apparently going to take a lot to convince anyone else that this is a problem and I think I'm about at the "give up" stage. And really, when in doubt, get user feedback 😛

Relequestual commented 5 years ago

As a reference for others viewing this, I find @handrews previous comment, quite a compelling argument:

But there is a key principle here (@Relequestual this is the most concise way to describe the problem so far):

Schema semantics should be robust to transclusion.

Secondarily, the transclusion process should be a function of the core vocabulary only, because only the core vocabulary is guaranteed to be supported by all implementations.

The allOf solution does not handle this.
It is not part of the core vocabulary (it is part of the applicator vocabulary)
It breaks annotations whose semantics depend on being adjacent to a reference, as I demonstrated in my example

Relequestual commented 5 years ago

(also for my own reference): I agree with @gregsdennis that this change is to cater for implementations that need this, but also @handrews that schema authors and tooling wants this, and would make things easier.

I'll have to find and comment on the new associated issue or PR, but I want to hear more about the potenital cost to implementers. If it's minimal, it could be good.

My inclination is we SHOULD gather more feedback on such a change and not push for draft-8.

jdesrosiers commented 5 years ago

I realized that this issue is more about how the changes to $ref in draft-08 change the nature of inlining. That's why this change is necessary.

Before, the following two were equivalent.

{ "$ref": "urn:uuid:ee564b8a-7a87-4125-8c96-e9f123d6766f" } { "$id": "urn:uuid:ee564b8a-7a87-4125-8c96-e9f123d6766f", "type": "string" }

But, with the changes to $ref, that's not possible anymore. A reference is no longer { "$ref": "..." }, a reference is the value of a keyword called $ref (the URL). That changes the dynamics significantly. Now, to inline a reference means to replace the URL, not the whole object. Therefore, the following two are equivalent.

{ "title": "foo", "$ref": "urn:uuid:ee564b8a-7a87-4125-8c96-e9f123d6766f" } { "title": "foo", "$ref": { "$id": "urn:uuid:ee564b8a-7a87-4125-8c96-e9f123d6766f", "type": "string" } }

Given the new definition of $ref, I think this is the only way the inlining concept makes sense. It's not about appeasing any particular implementation. It's a side effect of the changes to $ref. Otherwise it would have to be the following, and that's not an inlining, that's a different schema.

{ "title": "foo", "$allOf": [{ "$id": "urn:uuid:ee564b8a-7a87-4125-8c96-e9f123d6766f", "type": "string" }] }

handrews commented 5 years ago

@jdesrosiers exactly. And (to address a question likely to be asked) the reason to follow through on this change to make inlining make sense again rather than roll back the change, is that allowing annotations in the same schema object as $ref allows us to use those annotations to layer further interpretation semantics on top of $ref. Which is particularly useful in generative use cases such as code generation, documentation rendering, or UI generation.

Preserving $ref after inlining (by allowing $ref to take the schema value) ensures that we can still benefit from this sort of semantic annotation even when the schema is inlined (a.k.a. transcluded in more formal hypermedia parlance).

about-code commented 5 years ago

I am sceptical about inlining dereferenced documents directly as the value of $ref.

Where I work we use $ref a lot to reference primitive (business data) types. These data types are very common accross applications and describe the required validation rules for basically every schema property. For example we have validation rules for types such as international bank account number (IBAN) or country codes.

Business data types are subject to reuse and thus being put into a separate schema. They are being referenced from the actual model schemas. One model can have multiple properties referencing the same business data type (see country example below). Some business data types are enumerations with over a hundreds of possible values, e.g. there are ~192 (?) countries on earth (and many more cultures). And this is just one business data type among many quite large enumerations.

Due to issues with OpenAPI tooling of handling $refs we were forced to write a very simple script to inline these business data types into our model schema (by replacing $ref). With our naive tool and the example schemas below inlining meant: every reference to the country type got its own inlined definition, right where it was originally referenced. But to no surprise such replication of business data type definitions resulted in an explosive growth. While the original model had only ~ 120 properties ($refs) and few more lines of code, the single-file schema had a hundred times more (twelve thousand (!) lines) mainly due to repititive (naive) inlining of large enumeration types.

If OpenAPI had better support for referencing primitive types, then our first attempt to reduce file size would have been to inline business type definitions just once into the definitions section and rewriting original $refs. I know this is not going to work for the Spec without resolving questions like namespacing and probably others. But just to make the point: isn't it becoming a serious scalability problem if inlining means making dereferenced schemas the value of $ref?

I wonder if there is anything I miss that no one proposes a keyword like $includes and rewriting of the formerly external $ref to an internal $ref as a means of inlining. .

entities.schema.json

{
  "$schema": "...",
  "type": "object",
  "description": "Customer",
  "properties": {
     "iban": { "$ref": "./business-data-types.schema.json#/definitions/iban" },
     "country_of_birth": {"$ref": "./business-data-types.schema.json#/definitions/country"},
     "residence": { "$ref": "#/definitions/Residence" },
     "...many more": "..."
   },
   "definitions": {
       "Residence": {
         "street": { "$ref": "./business-data-types.schema.json#/definitions/streetname"},
         "country": {"$ref": "./business-data-types.schema.json#/definitions/country"},
         "...many more": "..."
      }
   }
}

business-data-type.schema.json

{
  "$schema": "...",
  "description": "Business Data Type Definitions",
  "definitions": {
    "iban": {
      "type": "string",
      "regex": "..."
    },
    "country": {
      "type": "string",
      "oneOf": [
        { "enum": ["en_US"], "title": "USA"},
        { "enum": ["en_GB"], "title": "Great Britain"}
        { "enum": ["de_DE"], "title": "Germany"},
        { "enum": ["de_AT"], "title": "Austria"},
        { "enum": ["de_CH"], "title": "Switzerland"},
        { "...hundreds more": "..." }
      ]
    }
  }
}

Do you really want this? (just dereferenced country for this example) entities.single.schema.json

{
  "$schema": "...",
  "type": "object",
  "description": "Customer",
  "properties": {
    "iban": { "$ref": "./business-data-types.schema.json#/definitions/iban" },
    "country_of_birth": {
      "$ref": {
        "type": "string",
        "oneOf": [
          { "enum": ["en_US"], "title": "USA"},
          { "enum": ["en_GB"], "title": "Great Britain"}
          { "enum": ["de_DE"], "title": "Germany"},
          { "enum": ["de_AT"], "title": "Austria"},
          { "enum": ["de_CH"], "title": "Switzerland"},
          { "...hundreds more": "..." }
        ]
      }
    },
    "residence": { "$ref": "#/definitions/Residence" },
    "...many more": "..."
  },
  "definitions": {
    "Residence": {
      "street": { "$ref": "./business-data-types.schema.json#/definitions/streetname"},
      "country": {
        "$ref": {
          "type": "string",
          "oneOf": [
            { "enum": ["en_US"], "title": "USA"},
            { "enum": ["en_GB"], "title": "Great Britain"}
            { "enum": ["de_DE"], "title": "Germany"},
            { "enum": ["de_AT"], "title": "Austria"},
            { "enum": ["de_CH"], "title": "Switzerland"},
            { "...hundreds more": "..." }
          ]
        }
      },
      "...many more": "..."
    }
  }
}

handrews commented 5 years ago

@about-code please take a look at #788 which I just posted last night. I think it will allay your fears, as I think I found a way to make clear that we are not trying to encourage what you are worried about.

Rather, there is a specific type of multi-file vs single-file transformation that is worth supporting.

BTW I am still thinking on this and might come up with yet another option that removes the need for a $ref with a schema value, so #788 might not be the last word after all.

handrews commented 5 years ago

@about-code OK I was able to remove the need for a more complex $ref. It might have useful properties, but I can solve the parts that I need to solve without it for this draft- if it comes up again due to feedback we'll reconsider (re-reconsider?)

See the latest commit added to #780 and also PR #792

json-schema-org / json-schema-spec

Allow "$ref" to take a schema, to make inlining easier? #779