Flexible re-use: deferred keywords vs schema transforms

handrews commented 6 years ago

NOTE: The goal of this is to find something resembling community consensus on a direction, or at least a notable lean in one direction or another from a large swath of the community.

We are not trying to discredit either idea, although we all tend to lurch in that direction from time to time, myself included. What we need is something that more people than the usual tiny number of participants would be willing to try out.

The discussion here can get very fast-paced. I am trying to periodically pause it to allow new folks, or people who don't have quite as much time, to catch up. Please feel free to comment requesting such a pause if you would like to contribute but are having trouble following it all.

This proposal attempts to create one or more general mechanisms, consistent with our overall approach, that will address the "additionalPropeties": false use cases that do not work well with our existing modularity and re-use features.

TL;DR: We should look to the multi-level approach of URI Templates to solve complex problems that only a subset of users require. Implementations can choose what level of functionality to provide, and vocabularies can declare what level of support they require.

Existing implementations are generally Level 3 by the following list. Draft-07 introduces annotation collections rules which are optional to implement. Implementations that do support annotation collection will be Level 4. This PR proposes Level 5 and Level 6, and also examines how competing proposals (schema transforms) impact Level 1.

EDIT: Deferred keywords are intended to make use of subschema results, and not results from parent or sibling schemas as the original write-up accidentally stated.

Level 1: Basic media type functionality. Identify and link schemas, allow for basic modularity and re-use
Level 2: Full structural access. Apply subschemas to the current location and combine the results, and/or apply subschemas to child locations
Level 3: Assertions. Evaluate the assertions within a schema object without regard to the contents of any other schema object
Level 4: Annotations. Collect all annotations that apply to a given location and combine the values as defined by each keyword
Level 5: Deferred Assertions. Evaluate these assertions across all subschemas that apply to a given location
Level 6: Deferred Annotations. Collect annotations and combine them with existing level 4 results as specified by the keyword. Deferred annotations may specify override rules for level 4 annotations when it comes to level 4 annotations collected from subschemas

A general JSON Schema processing model

With the keyword classifications developed during draft-07 (and a bit further in #512), we can lay out a conceptual processing model for a generic JSON Schema implementation.

NOTE 1: This does not mean that implementations need to actually organize their code in this manner. In particular, an implementation focusing on a specific vocabulary, e.g. validation, may want to optimize performance by taking a different approach and/or skipping steps that are not relevant to that vocabulary. A validator does not necessarily need to collect annotations. However, Hyper-Schema relies on the annotation collection step to build hyperlinks.

NOTE 2: Even if this approach is used, the steps are not executed linearly. $ref must be evaluated lazily, and it makes sense to alternate evaluation of assertions and applicability keywords to avoid evaluating subschemas that are irrelevant because of failed assertions.

Process schema linking and URI base keywords ($schema, $id, $ref, definitions as discussed in #512)
Process applicability keywords to determine the set of subschema objects relevant to the current instance location, and the logic rules for combining their assertion results
Process each subschema object's assertions, and remove any subschema objects with failed assertion from the set
Collect annotations from the remaining relevant subschemas

There is a basic example in one of the comments.

Note that (assuming #512 is accepted), step 1 is entirely determiend by the Core spec, and (if #513 is accepted) step 2 is entirely determined by either the Core spec or its own separate spec.

Every JSON Schema implementation MUST handle step 1, and all known vocabularies also require step 2.

Steps 3 and 4 are where things get more interesting.

Step 3 is required to implement validation, and AFAIK most validators stop with step 3. Step 4 was formalized in draft-07, but previously there was no guidance on what to do with the annotation keywords (if anything).

Implementations that want to implement draft-07's guidance on annotations with the annotation keywords in the validation spec would need to add step 4 (however, this is optional in draft-07).

Strictly speaking, Hyper-Schema could implement steps 1, 2, and 4, as it does not define any schema assertions to evaluate in step 3. But as a practical matter, Hyper-Schema will almost always be implemented alongside validation, so a Hyper-Schema implementation will generally include all four steps.

So far, none of this involves changing anything. It's just laying out a way to think about the things that the spec already requires (or optionally recommends).

To solve the re-use problem, there are basically two approaches, both of which can be viewed as extensions to this processing model:

Deferred processing

To solve the re-use problems I propose defining a step 5:

Process additional assertions (a.k.a. deferred assertions) that may make use of all subschemas that are relevant at the end of step 4. Note that we must already process all existing subschema keywords before we can provide the overall result for a schema object.

EDIT: The proposal was originally called unknownProperties, which produced confusion over the definition of "known" as can be seen in many later comments. This write-up has been updated to call the intended proposed behavior unevaluatedProperties instead. But that name does not otherwise appear until much later in this issue.

This easily allows a keyword to implement "ban unknown properties", among other things. We can define unevaluatedProperties to be a deferred assertion analogous to additionalProperties. Its value is a schema that is applied to all properties that are not addressed by the union over all relevant schemas of properties and patternProperties.

There is an example of how unevaluatedProperties, called unknownProperties in the example, would work in the comments. You should read the basic processing example in the previous comment first if you have not already.

We could then easily define other similar keywords if we have use cases for them. One I can think of offhand would be unevaluatedItems, which would be analogous to additionalItems except that it would apply to elements after the maximum length items array across all relevant schemas. (I don't think anyone's ever asked for this, though).

Deferred annotations would also be possible (which I suppose would be a step 6). Maybe something like deferredDefault, which would override any/all default values. And perhaps it would trigger an error if it appears in multiple relevant schemas for the same location. (I am totally making this behavior up as I write it, do not take this as a serious proposal).

Deferred keywords require collecting annotation information from subschemas, and are therefore somewhat more costly to implement in terms of memory and processing time. Therefore, it would make sense to allow implementations to opt-in to this as an additional level of functionality.

Implementations could also provide both a performance mode (that goes only to level 3) and a full-feature mode (that implements all levels).

Schema transforms

In the interest of thoroughly covering all major re-use proposals, I'll note that solutions such as $merge or $patch would be added as a step 1.5, as they are processed after $ref but before all other keywords.

These keywords introduce schema transformations, which are not present in the above processing model. All of the other remaining proposals ($spread, $use, single-level overrides) can be described as limited versions of $merge and/or $patch, so they would fit in the same place. They all still introduce schema transformations, just with a smaller set of possible transformations.

It's not clear to me how schema transform keywords work with the idea that $ref is delegation rather than inclusion (see #514 for a detailed discussion of these options and why it matters).

[EDIT: @epoberezkin has proposed a slightly different $merge syntax that avoids some of these problems, but I'm leaving this part as I originally wrote it to show the progress of the discussion]

If $ref is lazily replaced with its target (with $id and $schema adjusted accordingly), then transforms are straightforward. However, we currently forbid changing $schema while processing a schema document, and merging schema objects that use different $schema values seems impossible to do correctly in the general case.

Imposing a restriction of identical $schemas seems undesirable, given that a target schema maintainer could change their draft version indepedent of the source schema maintainer.

On the other hand, if $ref is delegation, it is handled by processing its target and "returning" the resulting assertion outcome (and optionally the collected annotation). This works fine with different $schema values but it is not at all clear to me how schema transforms would apply.

@epoberezkin, I see that you have some notes on ajv-merge-patch about this but I'm having a bit of trouble following. Could you add how you think this should work here?

Conclusions

Based on my understanding so far, I prefer deferred keywords as a solution. It does not break any aspect of the existing model, it just extends it by applying the same concepts (assertions and annotations) at a different stage of processing (after collecting the relevant subschemas, instead of processing each relevant schema on its own). It also places a lot of flexibility in the hands of vocabulary designers, which is how JSON Schema is designed to work.

Schema transforms introduces an entirely new behavior to the processing model. It does not seem to work with how we are now conceptualizing $ref, although I may well be missing something there. However, if I'm right, that would be the most compelling argument against it.

I still also dislike that arbitrary editing/transform functionality as a part of JSON Schema at all, but that's more of a philosophical thing and I still haven't figured out how to articulate it in a convincing way.

I do think that this summarizes the two possible general approaches and defines them in a generic way. Once we choose which to include in our processing model, then picking the exact keywords and behaviors will be much less controversial. Hopefully :-)

erayd commented 6 years ago

I like deferred keywords as a concept, but they do not obviate my need for schema transforms.

My primary use-case for transforms is re-use of a schema fragment, with the ability to override some of the keywords. To take a trivial example, using {"type": "integer", "maximum": 5}, but with a higher maximum, is currently impossible and requires a lot of copy / paste that reduces maintainability.

erayd commented 6 years ago

Also for the record, I think that $ref should not be related in any way to schema transforms. It should be an immutable delegation (i.e. essentially a black-box function call).

handrews commented 6 years ago

@erayd I don't see that type of transform- arbitrarily slicing up and combining schema fragments- as within the scope of JSON Schema. Although that view is certainly debatable.

To apply arbitrary transforms to JSON like that has nothing to do with JSON Schema. There is no awareness needed of the source or target being schemas or having particular keyword behavior. You're just manipulating JSON text at a raw level. That is why I see it as out of scope- there is simply nothing that requires it to be part of JSON Schema at all.

This is different from $ref where it's simply not possible to have a usable system without some mechanism for modularity and cyclic references. The media type would be useless for any non-trivial purpose without it. However, it's always possible to refactor to avoid schema transforms, and frankly if anyone submitted a PR on a schema doing "re-use" by what is essentially textual editing, I'd send it back.

The violation of the opacity of $ref (which it seems at least you, @epoberezkin, and me all prefer) means that it is inviting a huge class of unpredictable errors due to unexpected changes on the target side. Your result across a regular delegation-style $ref may change in ways that you can't see or predict, but you have established an interface contract- I am referring to whatever functionality is identified by the target URI.

With arbitrary editing, there is no contract. Your snipping a bit of JSON and doing something with it, which may or may not have anything to do with its original purpose in the target document. It still just makes no sense to me.

handrews commented 6 years ago

Hopefully others can talk about how their use cases line up with these proposals. The primary use cases that I remember (OO-style inheritance for strictly typed systems, and disambiguating multiple annotations) can both be solved by deferred keywords.

So I would be particularly interested in use cases that stop short of "I want to be able to do arbitrary transforms regardless of schema-ness" but are beyond what can be addressed with deferred keywords.

erayd commented 6 years ago

@handrews

I don't see that type of transform- arbitrarily slicing up and combining schema fragments- as within the scope of JSON Schema.

It doesn't have to be. I think it just makes more sense to define it as part of JSON schema in order for JSON schema to have a standard and consistent way of solving the problem. To my mind, this is fundamentally a preprocessing step, and could easily be defined as a separate, referenced standard (e.g. perhaps JSON schema specifies that the last step of core processing before applying $ref is to transform based on Transform Spec XYZ). That would solve the underlying problem, but without cluttering up the JSON schema spec with it.

With arbitrary editing, there is no contract. Your snipping a bit of JSON and doing something with it, which may or may not have anything to do with its original purpose in the target document. It still just makes no sense to me.

I guess I see it as forming a new contract at the point of reuse, rather than trying to preserve whatever that piece of schema may have been doing before.

As an OOP example, defining a child class and then overriding one of the parent methods does not result in a child class that is guaranteed to behave in the same manner as the parent - but it allows for multiple children that share some of their behavior without having to redefine that behavior inside every child class.

...OO-style inheritance for strictly typed systems... can be solved by deferred keywords.

Are you able to clarify that a bit? Because even in strictly typed OO inheritance, the behavior in a child class can still override the parent and break whatever behavioral assumptions you may be making based on how the parent works. The only guarantee you have is that the types are the same [Ed: and that the methods etc. exist].

In my ideal world, any reuse mechanism would be applied before $ref is processed. This enforces the '$ref-is-a-black-box` approach, and makes the outcome much easier to reason about.

erayd commented 6 years ago

Also for what it's worth, I care more about $ref being opaque than I care about having a transform mechanism. If it comes down to it, I'd rather have no transform mechanism at all than compromise $ref.

handrews commented 6 years ago

@erayd I don't consider violations of the Liskov Substitution Principle to be proper OO modeling. Once you break the parent's interface contract you're just doing random stuff and the programmer can't reason about the type system in any consistent way.

I'd like to avoid going down a rathole on this before anyone else has had a chance to weigh in. These issues rapidly get too long for most people to read, and this one is long to start with. If you want to argue about type systems let's take it to email (you can get mine off the spec, at the bottom of the document) and see if we can leave space for others to cover their own use cases here.

erayd commented 6 years ago

Fair call - let's switch to email.

Anthropic commented 6 years ago

@handrews so the TL;DR would be "I want to add a step to the theoretical processing sequence so in future we can peg new keywords to that point in execution"? You mentioned deferred Default and unknownProperties, do you have many other examples/ideas for use cases?

handrews commented 6 years ago

@Anthropic basically, yeah. LOL my TL;DRs need TL;DRs.

It's not really intended to be theoretical- we would do this to add keywords immediately. I just want to settle on a why and how because I feel like arguing over all of the concrete keywords in this area didn't get us anywhere useful. Just a huge pile of conflicting proposals that people voted for in weird patterns that didn't resolve anything.

unknownProperties is pretty compelling on its own. How much time have we spent on "additionalProperties": false + "allOf" not working the way people think it should? unknownProperties solves that. I mean, we all agreed after the vote-a-rama that solving that alone would be sufficient to publish a draft-08. It's why people were OK with deferring the discussion out of draft-07.

I did make up the deferredDefault thing as a way to think about how this would solve the $use use cases (#98). One problem with default is that if you end up applying different default values to the same property across multiple branches of an allOf, which default do you use? deferredDefault would say "ignore any regular defaults that might have been stuffed in there somewhere and use this.deferredDefaultis not a good name and the use case is not well-developed, but it's relevant. Same issue fortitleanddescription`. I can see ways to solve those without deferred keywords, but there's a possible class of things to consider there.

handrews commented 6 years ago

Here's an example of the overall process:

{
  "title": "an example",
  "description": "something that can be a number or a string",
  "anyOf": [
    {
      "description": "the number is for calculating",
      "type": "integer",
      "examples": [42]
    },
    {
      "description": "strings are fun, too!",
      "type": "string",
      "examples": ["hello"]
    }
  ]
}

NOTE: Again, this is not necessarily how an implementation would or should work in terms of step order

So for step 1, there's nothing to do b/c there are no $id or $ref keywords (nothing's changed about this step so I'm leaving it out).

Step 2 is to determine what's applicable, which means looking for keywords like anyOf. In this case, we have three schema objects that are applicable: each of the objects within the anyOf, plus the parent object containing the anyOf. If we identify these with URI fragment JSON Pointers, the set is ("#/anyOf/0", "#/anyOf/1", "#")

Step 3 is to evaluate assertions. Let's assume an instance of 100.

An integer is valid against "#/anyOf/0", so keep it
It is not valid against "#/anyOf/1", so remove that schema object from the set
anyOf ORs it's results, so overall the instance is valid against "", so keep the root object

So now our set is ("#/anyOf/0", "#")

Step 4 is to collect annotations. By default, multiple annotations are put in an unordered list, while examples values are flattened into a single list (this is all in draft-07). So if we made a JSON document out of annotations it would be something like:

{
  "title": ["an example"],
  "description": [
    "something that can be a number or a string",
    "the number is for calculating"
  ],
  "examples": [42]
}

I'll do another example showing the deferred keyword stuff next.

handrews commented 6 years ago

This example illustrates how deferred keywords work, using unknownProperties. You should read the previous comment's example first.

{
  "type": "object",
  "properties": {
    "required": ["x"],
    "x": {"type": "boolean"}
  },
  "allOf": [
    {
      "if": {
        "properties": {
          "x": {"const": true}
        }
      },
      "then": {
          "properties": {
            "required": ["y"],
            "y": {"type": "string"}
          }
      },
      "else": {
        "properties": {
          "required": ["z"],
          "z": {"type": "integer"}
        }
      }
    },
    {
      "patternProperties": {
        "^abc": true
      }
    }
  ]
}

Assuming an instance of {"x": true, "y": "stuff", "abc123": 456}, after going through our first three steps, we end up with the following schema objects in the set:

("#/allOf/0/if", "#/allOf/0/then", "#/allOf/0", "#/allOf/1", "#")

Now of course, if we put "additionalProperties": false in the root schema, the whole thing falls apart. We can't have a valid instance without "x", but depending on "x" we're also required to have either "y" or "z". But that addlProps would only 'see' property "x", so having either "y" or "z" would fail validation. So there are no valid instances if you do that. But what if we have deferred keywords and unknownProperties?

{
  "type": "object",
  "properties": {
    "required": ["x"],
    "x": {"type": "boolean"}
  },
  "allOf": [{...}, {...}],
  "unknownProperties": false
}

So now we once again consider our set that we have after step 3. There are no annotation keywords in this schema document, so there's nothing to do for step 4. But we have a deferred keyword, so we have a step 5 to consider.

Unlike immediate keywords at step 3, which can only work in each schema object separately, deferred keywords can look across the whole set of relevant schema objects.

This is because we cannot know the full relevant set until after step 3 is complete. So step 3 can't depend on knowing the set that it determines.

However, step 5 can. We go into step 5 knowing our full set of relevant schema objects. So, as specified by unknownProperties in the first comment of this issue, we take a look at the union of all properties and patternProperties:

"#/anyOf/0/if" defines "x"
"#/anyOf/0/then" defines "y"
"#/anyOf/1" defines the pattern "^abc"
"#" defines "x"

So the known properties are "x", "y", and any property matching pattern "^abc".

This means that our instance

{"x": true, "y": "stuff", "abc123": 456}

is valid, but

{"q': "THIS SHOULDN'T BE HERE", "x": true, "y": "stuff", "abc123": 456}

is not. Which is the behavior people have been asking for LITERALLY FOR YEARS.

handrews commented 6 years ago

Another idea for implementing deferred keywords is to have a core keyword, $deferred, which is an object where all deferred keywords live. I'm not sure if that actually makes implementation (including choosing an implementation level that may stop short of deferred keywords) easier or not. But I'll leave it here in case folks have thoughts on it.

{
    "allOf": [{...}, {...}],
    "$deferred": {
        "unknownKeywords": false
    }
}

instead of

{
    "allOf": [{...}, {...}],
    "unknownKeywords": false
}

handrews commented 6 years ago

With $deferred you could even use the same keyword as an immediate assertion and just apply it across all relevant schemas. This wouldn't make a difference for most keywords (e.g. maximum has the same effect whether immediate or deferred), but additionalProperties and additionalItems would have well-defined modified behavior, as explained for the proposed unknownProperties and unknownItems.

Again, not sure if this is more or less confusing. Just thinking out loud about different ways to manage this, so that folks have some more concrete options to consider.

handrews commented 6 years ago

@erayd and I have been having a fantastic side discussion about OO design, subtyping, merge/patch, and other related ideas. He'll post his own summary when he gets a chance, but I wanted to copy over some key points about why merge/patch as optional functionality is hard even though we're perfectly happy to have format be optional for validating.

TL;DR:

format is still useful to applications even when validated, and failing to validate it has no further impact on the current processing model
failing to validate format can impact deferred keywords if we add them, but it is easy to add an extensablity hook for applications to register their own handlers (and many implementations do this)
$merge/$patch cannot safely be ignored, as the document may not make any sense, and the implementation cannot make assumptions about how useful the document is even if it is otherwise a valid schema
Because of lazy evaluation and $ref, you can't make a simple callback to implement schema transforms in your application- it needs to be something more like a co-routine which is much more difficult.

Annotating Assertions

format, contentMediaType, and contentEncoding are what I call annotating assertions, where the assertion part of the functionality is optional. Since we have never had a formal specification about what to do with annotations before draft-07, that's more or less been viewed as making the whole keyword optional.

But the nature of format and content* is that even if the validator ignores them, they still convey all of the information needed to validate them up to the application. The application can choose to do its own validation. So even when validation is not implemented, these keywords are still useful.

Validating them is also somewhere between challenging and impossible (for instance, there is no perfect regex for validating email addresses). So even when format is supported it's not as strong of a guarantee as something like maxLength. And content* is even harder to validate in any general sense.

Callback Extensibiilty

Annotating assertions are handled at steps 3 (assertion) and 4 (annotation) of the processing model. Most existing implementations provide only steps 1-3. Instead of step 4 (only defined in draft-07, and still optional), most implementations assume the application will find and use annotations however it wants to.

Let's say we have this schema (yes, I know that oneOf would work and avoid at least one problem, but it doesn't illustrate my point as well, just roll with it please):

{
  "type": "string",
  "anyOf": [
    {
      "type": "string",
      "format": "email",
      "title": "email username"
    },
    {
      "pattern": "^[a-z]\\w*[\\w\\d]$",
      "title": "basic username"
    }
  ]
}

If we have handrews as the instance, then a level 3 implementation will correctly accept that as valid, whether it supports validating the "email" format or not.

A level 4 implementation that validates the "email" format will return an annotation set of

{"title": ["basic username"]}

while one that does not validate format will return an annotation set of

{"title": ["basic username", "email username"], "format": ["email"]}

(recall that format is also an annotation, and annotation values are collected as unordered arrays).

So we see that not implementing format can cause a problem in a level 4 implementation. However, this can be avoided in implementations that allow registering a callback or something similar for format. The implementation makes the callback while processing level 3, and then moves on to level 4 just fine. There are interoperability concerns, but basically this is easily managed if we want to manage it.

Extra Level Extensibility

The whole deferred keyword proposal (level 5) relies on the idea that adding a later processing step is an easy extension model. For that matter, so did defining an algorithm for collecting annotations (level 4) in draft-07. All existing level 3 (assertions) implementations are still valid without having to change anything at all. They can add support for the new levels or not, and it's easy to explain what level of support is provided.

Level 1 Extensibility Challenges

This doesn't work when you change level 1, which is what schema transforms such as $merge and $patch do. You can only process level 2 (applicability) once you have resolved references and executed any transforms. Because references require lazy evaluation, so do transforms, and you are likely to bounce back and forth between the two. Your transform almost always references at least one part by $ref, and that part may itself include another transform which uses a $ref, etc.

So you can't just ignore the transforms, because the schemas you pass to level 2 are flat out wrong. But you can't just provide a simple callback for the keyword because level 1 processing is more complex- your application-side callback would need to call back into the JSON Schema implementation when it hits $ref.

Also, real implementations will go back and forth among levels 1, 2, and 3, because you can't find all $refs without examining applicability keywords, and you can't determine which subschemas are worth recursing into without checking assertions. Inserting schema transform processing into this as an application-side extension would be very challenging.

This, in addition to conflicting with $ref-as-delegation, is why $merge and $patch are not suitable for handling as extensions. Obviously you can do it (see ajv-merge-patch), but it's complex (see ajv-merge-patch's disclaimers about evaluation context).

epoberezkin commented 6 years ago

@handrews re $merge/$patch: it is a pre-processing step, so it's not step 1.5, it's step 0 that should happen before anything else. Ignore the way it's defined in ajv-merge-patch, it uses $refs to essentially include schemas, which is not consistent with the delegation model. So if we add it it should have a different syntax.

@erayd some of the re-use ideas can be better implemented with $params (#322) than with $merge.

unknownProperties is the same idea as banUnknownProperties mode, but as a schema keyword. The presence of compound keywords (anyOf etc.) complicates the definition of "known properties" though. The way @handrews proposes it, it seems that what is known will depend on the actual data, that from my point of view leads to non-determinism, potential contradictions (I need to think about the example) and, at the very least, inefficiency. For example, if the idea is that property is known only if it passes validation by subschema where the property is defined, then ALL branches of "anyOf" should be validated, you cannot short-circuit it (probably the idea of collecting annotations suffers from the same problem).

I think that for all real use-cases it would be sufficient (and better) if "unknownProperties" operated on the predefined set of properties that does not depend on the validated data and can be obtained via the static schema analysis that would require traversal of referenced schemas as well, but would not traverse them more than once to correctly handle recursion. In this case we would avoid deferred processing at all and keep it simple while achieving the same benefits.

The example above would treat as known properties x, y, z, abc*, regardless the data, and if some additional restrictions need to be applied (e.g. make y and z mutually exclusive) it can easily be achieved by adding some extra keywords.

If we defined unknownProperties based on static schema analysis we would break shallowness principle but at least not the processing model.

Still, some pre-processing syntax I find more useful and less ambiguous than deferred data-dependent keywords and even than statically defined keywords that require deep schema traversal to define their behaviour, even though it can result in invalid schemas (e.g. from combining different drafts). It can be either generic $merge or more specialised syntax, either for extending properties of the nearest parent schema or for merging properties from the child subschema.

If my memory is correct $merge also received the most votes. I guess people like it because it is simple, clearly defined, introduces no ambiguity in the results and solves both the extension and other problems.

handrews commented 6 years ago

@epoberezkin I'm going to respond in pieces to several of your points. Feel free to lay out new points here, but for back and forth on things we've stated here let's please do that on email to avoid overwhelming everyone else (a frequent complaint we've both heard rather often). We can each come back with summaries of the offlist discussion, as @erayd and I are also doing.

handrews commented 6 years ago

@epoberezkin said:

For example, if the idea is that property is known only if it passes validation by subschema where the property is defined, then ALL branches of "anyOf" should be validated, you cannot short-circuit it (probably the idea of collecting annotations suffers from the same problem).

Yes, that limitation on short-circuiting has been in the spec explicitly for two drafts now, and has always been implicit in the definition of the meta-data keywords. We've just never required validators to collect annotations (nor do we in draft-07, we just state how to do it for implementations that wish to do so).

The no-short-circuit requirement explicitly defined for validation in draft-07 of Validation, Section 3.3: Annotations, in particular Section 3.3.2 Annotations and Short-Circuit Validation. I do hope you at least skimmed the table of contents during the full month that spec was posted for pre-publication feedback. There were at least three PRs on the topic, at least two of which were open for the standard 2 week feedback period.

In draft-06 it was in the section on [defining how hyper-schema builds on validation], which I would not particularly have expected you to pay attention to as you don't implement Hyper-Schema. But it's really always been there for annotations. For example, if you want to find a possible default, you have to look everywhere for it.

Validation has never been required to do this and still is not required. That is the point of the opt-in multi-level proposal. A Level 3 validator such as Ajv can be much faster than a Level 4 annotation-gathering validator. That's great! Many people would rather have speed. The set of people who need complex annotation gathering is relatively small, and implementation requirements for validation should not be constrained by their use cases.

However, all hyper-schema implementations need to be Level 4. Or else they just don't work. I can go into this in more detail, but static analysis produces incorrect results. While I'm generally willing to defer to you on validation itself, you do not implement hyper-schema and have never expressed any interest in doing so. I have put a lot of thought into that. So if you want to convince me that static analysis is sufficient, you are going to have to dig deep into Hyper-Schema (which, essentially, is just a rather complex annotation) and demonstrate how it could work statically.

But I only have a link if the instances matches the relevant schema. That's been part of Hyper-Schema since the beginning. I'm just making the implications more clear.

handrews commented 6 years ago

@epoberezkin

I think that for all real use-cases it would be sufficient (and better) if "unknownProperties" operated on the predefined set of properties that does not depend on the validated data and can be obtained via the static schema analysis

You are asserting that this would be better without explaining any such benefit to the end user. Every non-trivial example I have ever seen (or wished worked myself) requires dynamic evaluation. That's the whole point. People show up all the time saying things like "how do I get X to be legal with Y but illegal with Z" in some complex arrangement that prevents them from just writing that out directly.

If you want to propose an alternative, that's fine, but you need to solve the problems that people have. And we have a lot of data on that.

handrews commented 6 years ago

@epoberezkin asserts

$merge/$patch: it is a pre-processing step, so it's not step 1.5, it's step 0 that should happen before anything else. Ignore the way it's defined in ajv-merge-patch, it uses $refs to essentially include schemas, which is not consistent with the delegation model. So if we add it it should have a different syntax.

Please provide an algorithm for how this would work. @erayd and I have spent some time trying to make it work and have not been able to find an approach that is consistent with lazily-evaluated $ref-as-delegation. I'm not going debate a vague assertion that will work with "a different syntax." You need to provide the details.

If my memory is correct $merge also received the most votes.

It was both strongly popular and strongly anti-popular: in other words, sharply divisive. Additionally, I was clear up front that the vote was not a binding majority rule vote. The goal of voting was to determine which were viable and which were not, and it succeeded in removing several non-viable proposals.

handrews commented 6 years ago

@epoberezkin I want to emphasize again that the multi-level opt-in approach is specifically designed to allow performance-sensitive validators to ignore all of this by simply implementing up to Level 3, and noting that expensive Level 4 and 5 features do not fit with that implementations goals.

Similarly, validators that wish to implement Level 5 can note that they will be slower than Level 3 validators, with the tradeoff of offering additional functionality.

Users who already need to incur costs for Level 4 (notably Hyper-Schema) will find that additional cost to implement Level 5 is negligible.

Users who like to have some extra checks during development but want speed in production can use a Level 5 validator while testing, and a Level 3 validator in production.

This approach provides an easily-describable range of tradeoffs for both users and implementors, which will allow a variety of implementation choices WITHOUT fragmenting the specification in incompatible ways. We see this working with URI Templates.

We use Level 4 URI Templates. JSON Home uses Level 3 as it does not have a real use case for the 4th level's features. Other place use Level 2, or the very easy and fast to implement Level 1. But it works well as a cohesive standard because the tradeoffs are clear at each level.

I'll stop here, as I think I've responded to the key points.

epoberezkin commented 6 years ago

The simple syntax for $merge

{
  "properties": {
    "foo": true
  },
  "additionalProperties": false,
  "$merge": ["#/definitions/bar", "#/definitions/baz"],
  "definitions": {
    "bar": {
      "properties": {
        "bar": true
      }
    },
    "baz": {
      "properties": {
        "baz": true
      }
    }
  }
}

The above schema after processing $merge will become:

{
  "properties": {
    "foo": true,
    "bar": true,
    "baz": true
  },
  "additionalProperties": false,

}

No $refs will be processed during merge. The decision to make is whether the references inside inserted blocks should be changed to full uris (to keep them pointing to the same locations) or if they should be left as they are (to allow different definitions used with the same schema). Maybe both options can be supported.

That solves the problem without introducing extra complexity on top of level 4.

handrews commented 6 years ago

@epoberezkin thanks, that's helpful. I need to think it through but I do think I see what you're getting at here.

handrews commented 6 years ago

@erayd would @epoberezkin's new $merge proposal above work with your use case as a strict preprocessor/build step as we tried (but failed) to get the original $merge proposal to do? That would be a very useful real data point either way.

epoberezkin commented 6 years ago

The alternative simple syntax for "x&y or x&z" problem:

{
  "baseProperties": {
    "x": true
  },
  "anyOf": [
    {
      "extraProperties": {
        "y": true
      },
      "additionalProperties": false
    },
    {
      "extraProperties": {
        "z": true
      },
      "additionalProperties": false
    }
  ]
}

where "baseProperties" is not doing anything on its own and "extraProperties" includes properties for whatever purposes (validation, annotation or whatever) from "baseProperties" in the parent schema.

erayd commented 6 years ago

@handrews Provided the following points are both true, then yes:

$merge runs as a preprocessing step when a schema document is loaded, including external $ref targets. This means that it must run within the implementation, rather than as an external step, otherwise there is a risk of $ref targets then containing an unhandled $merge.
Overriding properties is allowed - e.g. if a later $merge conflicts with an earlier one, or with the base being merged onto, then the later one prevails.

@epoberezkin I'm not sure whether this is what you meant or not, as the example you provided is ambiguous on those points - are you able to clarify?

epoberezkin commented 6 years ago

$merge runs as a preprocessing step when a schema document is loaded, including external $ref targets.

Yes (assuming you mean that $merge in referenced schemas will be processed).

Overriding properties is allowed

I thought the schemas of the same property will be deep-merged too. Although shallow merge is fine too - it solves all real problems.

Alternative syntax for merge:

{
  "$merge": {
    "schema": <schema or uri-reference>,
    "with": <uri-reference or array of uri-references>
  }
}

The main problem with the original proposal for $merge is that it needed $ref as part of its syntax and it made it very confusing.

handrews commented 6 years ago

@epoberezkin baseProperties doesn't scale well with arbitrarily complex schemas where you have multiple levels with the possibility of additionalProperties anywhere. I examined this extensively in #119 $combine and it was horrific to implement.

handrews commented 6 years ago

@epoberezkin to clarify, your real argument here is that you don't like the way deferred properties have to be implemented correct? Or do you see merge as solving real, commonly needed (e.g. can't just be refactored) problems that deferred properties cannot address?

I want to keep some clear focus on which concerns are about implementation and which concerns are about functionality.

epoberezkin commented 6 years ago

Both. Implementation is doable, but unnecessary complex (+2 complex steps) and makes simple validation impossible. $merge is more generic and solves more problems by adding a simple pre-processing step (+1 simple step).

handrews commented 6 years ago

@epoberezkin It only adds 1 not-all-that complex steps for implementations that collect annotations.

"solves more problems" needs to be more clear. "Can arbitrarily scramble json into any random garbage" is not, to me, a feature or a problem that needs solving.

erayd commented 6 years ago

As @handrews mentioned above, we've been having a bit of a side discussion on this via email - here's a summary of some of the major points that came out of that discussion.

Most cases that would warrant a fully-fledged merge / patch implementation are more sensibly solved via refactoring. There are actually very few real-world cases that actually require it, and those issues may be political in origin, rather than due to technical limitations.
Adding merge / patch breaks some OO principles of being able to treat the child as a parent instance and assuming behavioral equivalence. I personally feel that being able to occasionally violate principles in favour of easier maintainability is a handy thing to have, but I've changed my mind about its necessity. After some fairly thorough discussion, my position now is that, while useful, I don't see it as badly needed enough to push hard for its inclusion, and if JSON schema were to continue without it I would be OK with that outcome.
Merge / patch, if implemented, cannot be strictly preprocessed - it needs to be lazily evaluated, because $ref is lazily evaluated. This means it needs to be part of the implementation, rather than something the user can do beforehand. The closest it would be possible to come to a preprocessor-type implementation would be to preprocess each schema document at load-time (i.e. when an external $ref target is loaded, it is then preprocessed for merge / patch). This has more overhead than handling each merge instance where it's encountered, but is easier to reason about - @handrews, does this solve some of the concern you had regarding that?
Deferred keywords can handily solve the "additionalProperties": false problem, without requiring merge / patch at all (i.e. by limiting unhandled properties).
Merge / patch cannot degrade gracefully - unlike format, it can't be delegated to the user as an annotation, and it can't be ignored without breaking what is likely to be critical schema functionality. Realistically, this means that if it becomes part of the spec, then the choice is either to implement it, or hard fail upon encountering it.
If separately declared vocabularies were to become a thing, then merge / patch could be specified as a separate vocabulary, which would make things much clearer in terms of what is required. A schema containing merge / patch would need to declare upfront that it relied on that vocabulary.

@handrews, if I've missed (or misunderstood) anything significant from our discussion in the points above, please feel free to jump in with that.

epoberezkin commented 6 years ago

for implementations that collect annotations

which is none of the validators need to do for validation.

"solves more problems" needs to be more clear

@erayd wrote about replacing keywords - essentially parametrisation. I wrote above about the ability to use the same schema with different definitions.

Can arbitrarily scramble json into any random garbage

That reminds me an argument about why Function constructor should never be used in JavaScript. As long as you know what you are doing, $merge will produce the results you expect. It is definitely easier to understand and predict the behaviour of the schema generated by $merge than to predict which properties are "known" in the new paradigm.

epoberezkin commented 6 years ago

@erayd: Merge / patch, if implemented, cannot be strictly preprocessed

You answered yourself how it can be:

to preprocess each schema document at load-time (i.e. when an external $ref target is loaded, it is then preprocessed for merge / patch)

erayd commented 6 years ago

@epoberezkin

Yes (assuming you mean that $merge in referenced schemas will be processed).

Yes, this is what I meant.

I thought the schemas of the same property will be deep-merged too. Although shallow merge is fine too - it solves all real problems.

In order to actually be useful for me, merge / patch would need to be capable of either overriding an existing property definition (rather than deep-merging it), or provide some kind of whiteout mechanism. Basically, any approach that allows me to say "ignore the original definition of propertyX and use this one instead".

erayd commented 6 years ago

You answered yourself how it can be

That's not strict preprocessing; it's dependent preprocessing. If it were strictly a preprocessing step, the user would be able to preprocess it before invoking the validator on it. Because of $ref, that's impossible.

epoberezkin commented 6 years ago

Because of $ref, that's impossible.

There are two possibilities here:

You know all your schemas in advance so you can preprocess them all before validation.
You analyse all $refs in the schema before validation, then load all schemas that are needed, then preprocess them all, again, before validation

handrews commented 6 years ago

for implementations that collect annotations

which is none of the validators need to do for validation.

Yes, @epoberezkin, we're aware of your hostility to other uses of JSON Schema. Do recall that some of us do things other than validation. The reason for separating this into levels is to allow you to continue to do the parts you care about while allowing the rest of us to move forward.

epoberezkin commented 6 years ago

That's not a hostility. I've just pointed out that it is not one, but two more steps, because validators neither do nor need to collect annotations.

handrews commented 6 years ago

It is definitely easier to understand and predict the behaviour of the schema generated by $merge than to predict which properties are "known" in the new paradigm.

That runs counter to all of the people who have asked about or supported solving this in exactly the way that the "known" approach works. This is just the first time anyone's managed to lay out an algorithm for what it means. But people have been demanding it for years. You can also fix it with a buzz saw but that doesn't make it better.

erayd commented 6 years ago

You know all your schemas in advance so you can preprocess them all before validation.

You analyse all $refs in the schema before validation, then load all schemas that are needed, then preprocess them all, again, before validation

Both of those are well into the realm of dependent preprocessing, and as such logically belong inside the implementation, rather than something that we can reasonably expect the user to do. If it were easily doable as pure preprocessing, none of us would be having this argument in the first place!

epoberezkin commented 6 years ago

@erayd yes, I agree it should be inside implementation,

handrews commented 6 years ago

Paging some other maintainers who have implemented draft-06: @erosb, @santhosh-tekuri, @gregsdennis, @korzio

handrews commented 6 years ago

As a reminder to everyone (myself very much included), the goal is not to bludgeon one idea or the other into submission, it's to find a clear community preference for one direction or the other (or some combination, or some new idea). We're all basically locked in this "room" (a.k.a. draft-08) until that happens :-)

As part of that, those of us who post a lot need to try to show some restraint and allow other folks to catch up and comment. I think I've made my points for this round- @epoberezkin and @erayd I encourage/request that y'all make at most one more go of it today and then let's let others consider it for a few days before we pick back up again. We can throw more arguments at each other over email in the meantime, of course!

gregsdennis commented 6 years ago

@erayd

To take a trivial example, using {"type": "integer", "maximum": 5}, but with a higher maximum, is currently impossible and requires a lot of copy / paste that reduces maintainability.

Can we not put the $ref keyword alongside any of the other keywords? I seem to recall seeing some tests that do that. (I may be lying.) If that's the case, then wouldn't defining maximum alongside a $ref to your original definition act as an override?

erayd commented 6 years ago

@gregsdennis Pretty sure the spec says that if $ref is present, then all its siblings must be ignored.

handrews commented 6 years ago

@erayd @gregsdennis let's hold off on $ref syntax stuff, or put it over in #514. If #514 resolves in favor of delegation, then there's no longer any reason to ban adjacent properties, but please let's keep these issues focused. They are complex as it is.

handrews commented 6 years ago

@gregsdennis: @erayd and I discussed that case a lot offlist. It can be solved with refactoring in 95% or so of cases. This particular example involved legal limitations, which I argue are rare enough and case-by-case unique enough to be outside of what the spec really needs to address.

gregsdennis commented 6 years ago

@handrews

This proposal attempts to create one or more general mechanisms, consistent with our overall approach, that will address the "additionalPropeties": false use cases that do not work well with our existing modularity and re-use features.

Do you have some examples of how additionalProperties does not work well? I'm having trouble understanding what the problem actually is and, therefore, how a tiered approach resolves it. There was a little light shed with your discussion of unknownProperties, but a bit more problem description (or links to where that description is) would be appreciated.

handrews commented 6 years ago

@gregsdennis LOL. The additionalProperties thing is the single most common complaint that we get here so I forget that there is anyone in the JSON Schema universe that isn't plagued by this.

This is the "problem" (it actually works exactly as intended, but many users just don't like it):

{
    "type": "object",
    "allOf": [
        {"properties": {"foo": true}},
        {"properties": {"bar": true}}
    ],
    "properties": {"baz": true},
    "additionalProperties": false
}

This instance is valid: {"baz": 1234} This instance is not: {"foo": "a", "bar", "xyz", "baz": 1234}

This is because current schema assertions (which I call Level 3 in my processing model) only take their immediate schema object into account.

This is irrelevant to most assertions: {"minLength": 10} doesn't interact with any other keyword. You can evaluate it without knowing anything else. It doesn't even interact with "maxLength". You evaluate them independently. If you set "maxLength" to be less than "minLength" then all strings will fail validation, but you can calculate that by checking each on its own. You don't first analyze the schema and see that they are inverted- you just check them against the instance separately.

However, "additionalProperties" is defined in terms of "properties" and "patternProperties". (Likewise, "additionalItems" is defined in terms of "items", but it's hardly ever used so it's rare to get complaints about it).

In the example above, that "additionalProperties": false can only "see" the "properties" keyword in its immediate schema object, and that keyword only defines "baz". It cannot "see" the "properties" keywords in the "allOf" subschemas.

People tend to write schemas like the above because they want validation to fail if there are misspelled or unexpected properties. They want to use "allOf" as something like OO inheritance (which is not actually what it does- it ANDs constraints and therefore reduces the set of valid schemas, while OO inheritance usually adds functionality). And in strongly typed languages they want to nail down the set of properties ahead of time.

"additionalProperties" is really intended to describe objects allowing any property names but having a uniform constraint on the values. When that constraint is the false boolean schema, the result is that no properties are allowed. Its interaction with "properties" and "patternProperties" allows special-casing specific properties or properties matching a particular name, and applying a uniform constraint on all other properties.

It's very rare that people actually use "properties" + "patternProperties" + "additionalProperties" for that general-constraint-with-special-casing that it was actually designed to do.

So the issue here is that in order to do what people often want, we need to be able to "see" all of the other applicable schema objects and define a keyword in relation to the contents of all of those schema objects.

Level 4 (collection annotations) already defines a way to find all of those relevant schema objects, so the proposal of a Level 5 builds on that by adding assertion keywords that work across all of those objects (regardless of which of the objects contains the Level 5 keyword). These keywords are "deferred" in the sense that they are not evaluated as assertions until after the (optional) Level 4 work of collecting relevant schemas. I'm not attached to the "deferred" terminology, it just made more sense than the other names I could come up with.

This is attempting to apply the Rule of Least Power to the problem. There are some related problems around disambiguating annotations (if multiple default values are relevant, but are not the same, which is really the default behavior?)

The alternative solution approach is schema transforms, of which there have been numerous proposals. $merge and $patch are the original ones. If you're feeling brave, you can read #15 for that backstory (you'll see that I was at one point in favor, but then @awwright convinced me otherwise, in part by citing the Rule of Least Power.

Schema transforms let you slice out any bit of a schema and combine it with any other bit of schema. They are potentially infinitely powerful, depending the exact proposal ($patch relies on the application/json-patch+json media type processing rules and is the most powerful variation- $merge, at least as originally proposed, relied on application/merge-patch+json). Their interaction with $ref has always been problematic, going all the way back to https://github.com/json-schema/json-schema/issues/120 which preceded #15 (that repo became locked when, largely as a result of arguing over $merge/$patch, the previous group of editors abandoned the project.

Let me know if the problem is still unclear.

json-schema-org / json-schema-spec