Flexible re-use: deferred keywords vs schema transforms

json-schema-org / json-schema-spec

The JSON Schema specification

http://json-schema.org/

Other

3.62k stars 257 forks source link

Flexible re-use: deferred keywords vs schema transforms #515

Closed handrews closed 6 years ago

handrews commented 6 years ago

NOTE: The goal of this is to find something resembling community consensus on a direction, or at least a notable lean in one direction or another from a large swath of the community.

We are not trying to discredit either idea, although we all tend to lurch in that direction from time to time, myself included. What we need is something that more people than the usual tiny number of participants would be willing to try out.

The discussion here can get very fast-paced. I am trying to periodically pause it to allow new folks, or people who don't have quite as much time, to catch up. Please feel free to comment requesting such a pause if you would like to contribute but are having trouble following it all.

This proposal attempts to create one or more general mechanisms, consistent with our overall approach, that will address the "additionalPropeties": false use cases that do not work well with our existing modularity and re-use features.

TL;DR: We should look to the multi-level approach of URI Templates to solve complex problems that only a subset of users require. Implementations can choose what level of functionality to provide, and vocabularies can declare what level of support they require.

Existing implementations are generally Level 3 by the following list. Draft-07 introduces annotation collections rules which are optional to implement. Implementations that do support annotation collection will be Level 4. This PR proposes Level 5 and Level 6, and also examines how competing proposals (schema transforms) impact Level 1.

EDIT: Deferred keywords are intended to make use of subschema results, and not results from parent or sibling schemas as the original write-up accidentally stated.

Level 1: Basic media type functionality. Identify and link schemas, allow for basic modularity and re-use
Level 2: Full structural access. Apply subschemas to the current location and combine the results, and/or apply subschemas to child locations
Level 3: Assertions. Evaluate the assertions within a schema object without regard to the contents of any other schema object
Level 4: Annotations. Collect all annotations that apply to a given location and combine the values as defined by each keyword
Level 5: Deferred Assertions. Evaluate these assertions across all subschemas that apply to a given location
Level 6: Deferred Annotations. Collect annotations and combine them with existing level 4 results as specified by the keyword. Deferred annotations may specify override rules for level 4 annotations when it comes to level 4 annotations collected from subschemas

A general JSON Schema processing model

With the keyword classifications developed during draft-07 (and a bit further in #512), we can lay out a conceptual processing model for a generic JSON Schema implementation.

NOTE 1: This does not mean that implementations need to actually organize their code in this manner. In particular, an implementation focusing on a specific vocabulary, e.g. validation, may want to optimize performance by taking a different approach and/or skipping steps that are not relevant to that vocabulary. A validator does not necessarily need to collect annotations. However, Hyper-Schema relies on the annotation collection step to build hyperlinks.

NOTE 2: Even if this approach is used, the steps are not executed linearly. $ref must be evaluated lazily, and it makes sense to alternate evaluation of assertions and applicability keywords to avoid evaluating subschemas that are irrelevant because of failed assertions.

Process schema linking and URI base keywords ($schema, $id, $ref, definitions as discussed in #512)
Process applicability keywords to determine the set of subschema objects relevant to the current instance location, and the logic rules for combining their assertion results
Process each subschema object's assertions, and remove any subschema objects with failed assertion from the set
Collect annotations from the remaining relevant subschemas

There is a basic example in one of the comments.

Note that (assuming #512 is accepted), step 1 is entirely determiend by the Core spec, and (if #513 is accepted) step 2 is entirely determined by either the Core spec or its own separate spec.

Every JSON Schema implementation MUST handle step 1, and all known vocabularies also require step 2.

Steps 3 and 4 are where things get more interesting.

Step 3 is required to implement validation, and AFAIK most validators stop with step 3. Step 4 was formalized in draft-07, but previously there was no guidance on what to do with the annotation keywords (if anything).

Implementations that want to implement draft-07's guidance on annotations with the annotation keywords in the validation spec would need to add step 4 (however, this is optional in draft-07).

Strictly speaking, Hyper-Schema could implement steps 1, 2, and 4, as it does not define any schema assertions to evaluate in step 3. But as a practical matter, Hyper-Schema will almost always be implemented alongside validation, so a Hyper-Schema implementation will generally include all four steps.

So far, none of this involves changing anything. It's just laying out a way to think about the things that the spec already requires (or optionally recommends).

To solve the re-use problem, there are basically two approaches, both of which can be viewed as extensions to this processing model:

Deferred processing

To solve the re-use problems I propose defining a step 5:

Process additional assertions (a.k.a. deferred assertions) that may make use of all subschemas that are relevant at the end of step 4. Note that we must already process all existing subschema keywords before we can provide the overall result for a schema object.

EDIT: The proposal was originally called unknownProperties, which produced confusion over the definition of "known" as can be seen in many later comments. This write-up has been updated to call the intended proposed behavior unevaluatedProperties instead. But that name does not otherwise appear until much later in this issue.

This easily allows a keyword to implement "ban unknown properties", among other things. We can define unevaluatedProperties to be a deferred assertion analogous to additionalProperties. Its value is a schema that is applied to all properties that are not addressed by the union over all relevant schemas of properties and patternProperties.

There is an example of how unevaluatedProperties, called unknownProperties in the example, would work in the comments. You should read the basic processing example in the previous comment first if you have not already.

We could then easily define other similar keywords if we have use cases for them. One I can think of offhand would be unevaluatedItems, which would be analogous to additionalItems except that it would apply to elements after the maximum length items array across all relevant schemas. (I don't think anyone's ever asked for this, though).

Deferred annotations would also be possible (which I suppose would be a step 6). Maybe something like deferredDefault, which would override any/all default values. And perhaps it would trigger an error if it appears in multiple relevant schemas for the same location. (I am totally making this behavior up as I write it, do not take this as a serious proposal).

Deferred keywords require collecting annotation information from subschemas, and are therefore somewhat more costly to implement in terms of memory and processing time. Therefore, it would make sense to allow implementations to opt-in to this as an additional level of functionality.

Implementations could also provide both a performance mode (that goes only to level 3) and a full-feature mode (that implements all levels).

Schema transforms

In the interest of thoroughly covering all major re-use proposals, I'll note that solutions such as $merge or $patch would be added as a step 1.5, as they are processed after $ref but before all other keywords.

These keywords introduce schema transformations, which are not present in the above processing model. All of the other remaining proposals ($spread, $use, single-level overrides) can be described as limited versions of $merge and/or $patch, so they would fit in the same place. They all still introduce schema transformations, just with a smaller set of possible transformations.

It's not clear to me how schema transform keywords work with the idea that $ref is delegation rather than inclusion (see #514 for a detailed discussion of these options and why it matters).

[EDIT: @epoberezkin has proposed a slightly different $merge syntax that avoids some of these problems, but I'm leaving this part as I originally wrote it to show the progress of the discussion]

If $ref is lazily replaced with its target (with $id and $schema adjusted accordingly), then transforms are straightforward. However, we currently forbid changing $schema while processing a schema document, and merging schema objects that use different $schema values seems impossible to do correctly in the general case.

Imposing a restriction of identical $schemas seems undesirable, given that a target schema maintainer could change their draft version indepedent of the source schema maintainer.

On the other hand, if $ref is delegation, it is handled by processing its target and "returning" the resulting assertion outcome (and optionally the collected annotation). This works fine with different $schema values but it is not at all clear to me how schema transforms would apply.

@epoberezkin, I see that you have some notes on ajv-merge-patch about this but I'm having a bit of trouble following. Could you add how you think this should work here?

Conclusions

Based on my understanding so far, I prefer deferred keywords as a solution. It does not break any aspect of the existing model, it just extends it by applying the same concepts (assertions and annotations) at a different stage of processing (after collecting the relevant subschemas, instead of processing each relevant schema on its own). It also places a lot of flexibility in the hands of vocabulary designers, which is how JSON Schema is designed to work.

Schema transforms introduces an entirely new behavior to the processing model. It does not seem to work with how we are now conceptualizing $ref, although I may well be missing something there. However, if I'm right, that would be the most compelling argument against it.

I still also dislike that arbitrary editing/transform functionality as a part of JSON Schema at all, but that's more of a philosophical thing and I still haven't figured out how to articulate it in a convincing way.

I do think that this summarizes the two possible general approaches and defines them in a generic way. Once we choose which to include in our processing model, then picking the exact keywords and behaviors will be much less controversial. Hopefully :-)

gregsdennis commented 6 years ago

@handrews thank you, I understand what people are complaining about, but now that I do, I wonder if maybe they're just doing it wrong.

If I wanted to write a schema that ensured that

foo and bar were existing properties with specific definitions
baz can be a property with a specific definitions
No other priorities are allowed

I'd write my schema differently:

{
    "properties" : {
        "foo" : true,
        "bar" : true,
        "baz" : true
    },
    "required" : [
        "foo",
        "bar"
    ]
}

Why are we trying to redefine schema when it already supports what people want?

To me, as an implementor, the intended behavior of this keyword is sufficient. Maybe it needs to be clarified in the spec, but I don't see a reason to change or amend it.

gregsdennis commented 6 years ago

I just realized two things that make my version wrong:

First, your example doesn't require foo and bar, it simply declares requirements on them if they do exist.

Second, this won't work if the allOf requirements are defined in separate schemas and referenced into this one.

{
    "type": "object",
    "allOf": [
        {"$ref": "http://reference.to/foo"},
        {"$ref": "http://reference.to/bar"}
    ],
    "properties": {"baz": true},
    "additionalProperties": false
}

In this case, the schema cannot be collapsed as I suggested.

handrews commented 6 years ago

@gregsdennis assume that the "allOf" subschemas are $ref'd in from elsewhere. It's just tedious to write three schema examples and ref them all together. If all of your schemas are inline and not intended for re-use, you can do anything you want with them.

However, think of the allOf subschemas as re-usable components located in a shared schema definitions file of some sort. The first project I used JSON Schema on had several engineering departments building a complex configuration system that involved about 20 schema files, which averaged 10-ish schema objects that were assembled in various ways with heavy use of anyOf/allOf/oneOf.

Each re-usable schema object is an encapsulated re-usable validation unit. For instance, it tells you whether something is a proper representation of a car or not. This is then used in 15 other schemas spread across four or five engineering teams.

So you can't do the kind of refactoring you suggest without massive and unmaintainable duplication.

As for me personally, I don't actually care about any of this- I never use "additionalProperties": false with re-usable schema objects. But I'm very much in the minority. There are probably 300+ github issue comments on this topic. #15 and #98 alone have about 100 each, and there are quite a few other issues related to it in this repo and the old one. [Edit- and this issue is already over 50 comments, so it's probably closer to 400 or even 500 comments and counting]

So... why are we doing it? Because people will seriously not stop complaining about it, and it's often cited as a reason to not use JSON Schema.

This is related to people using JSON Schema's Validation vocabulary for code generation. It is not suited to that task, so that's a huge challenge (as the OpenAPI folks can tell you). My solution to that is to modularize the pieces of the vocabulary (see #512 and #513), resolve complicated arguments around $ref and meta-schemas (#514), and have separate vocabularies for validation and code generation. There's even a proposal for code generation, but I have not had the bandwidth to engage with it properly (and neither has anyone else).

handrews commented 6 years ago

@gregsdennis LOL looks like you figured it out 30 seconds before I posted the explanation :-D

gregsdennis commented 6 years ago

I do see what you mean about the OO inheritance, though. Supposing we had these schemas:

{
    "$id" : "http://reference.to/foo",
    "properties" : {
        "common" : true,
        "foo" : true
    }
}

{
    "$id" : "http://reference.to/bar",
    "properties" : {
        "common" : true,
        "bar" : true
    }
}

Then we combined them with the referenced version I made above, only a schema with the properties common and baz would pass, whereas OO inheritance behavior would say that all four properties are allowed.

While unknownProperties would permit all four properties to exist, how far are we willing to take this? What if one of the subschemas also defines additionalProperties:false or unknownProperties:false? Does unknownProperties work on the document level, or only at the current level and those nested within it?

handrews commented 6 years ago

Deferred keywords in general work across the set of applicable schemas, which does not care about document boundaries. There are quite a few details to work out but they are, I think, manageable.

The purpose of this issue is not to work out all of the details (either with deferred keywords or with schema transforms), the purpose is to pick a direction.

To sketch out how some of those details would be resolved, multiple instances of unknownProperties in the same applicable schema set are fine, as they all work across the same set. It's like if you have patternProperties defined in multiple applicable schemas with the same patterns. The difference is that for Level 3 keywords, their outcomes are combined according to allOf/anyOf/oneOf/etc. rules, while Level 5 keywords are applied after those rules are processed. So the results of two different unknownProperties schemas in the same applicability set would be ANDed (this is the default behavior of JSON Schema result combination throughout the existing processing model).

I would expect schema systems that use "unknownProperties": false to avoid "additionalProperties": false, as they are really used for different things. Most uses of "additionalProperties": false would likely go away.

For something like "additionalProperties": {"type": "integer"}, "unknownProperties": false, we'd have to decide whether "additionalProperties" is ANDed with "unknownProperties" (because remember, "unknownProperties" takes a schema so it's not necessarily false), or whether the presence of "additionalProperties" means that all properties are now "known" and "unknownProperties" would be ignored.

PLEASE DO NOT DEBATE THIS DETAIL IN THIS ISSUE

If we go with deferred keywords, we'll sort all of this out. If we go with schema transforms or decide that we will never implement either of these options, it's a moot point.

gregsdennis commented 6 years ago

One aspect of this that I haven't seen commented is which concept (deferred keywords or transformations) affords the least abuse.

It seems that a transformation (at least one based on basic JSON transformations like JSON Patch) can be easily abused to deform a schema into something that's not even a valid schema. This would be extremely bad.

Because deferred keywords work within the context of the schema without modifying it, there is little chance that it will be used in a way that degrades the purpose of the schema.

handrews commented 6 years ago

which concept (deferred keywords or transformations) affords the least abuse.

That's basically a different (but very useful) perspective on the Principle of Least Power- the reason you go for least power is that it is the least abusable. Usually :-)

It seems that a transformation (at least one based on basic JSON transformations like JSON Patch) can be easily abused to deform a schema into something that's not even a valid schema. This would be extremely bad.

This is pretty much my entire argument against schema transforms, but I took like 40 pages to say it. In my defense, I (and others) have said what you're saying here before and the responses tend to be "why does that matter" or "I don't care" or "but I actually prefer that".

@erayd identified the only use case that I know of that would truly require schema transforms, and that is when there are legal/ownership concerns that prevent refactoring. The last time we discussed that use case, we agreed it was not sufficient on its own to motivate adding schema transforms to the spec. We've kicked around various attempts at a pre-processing or extension model, but schema transforms are not really conducive to that, as discussed above.

Because deferred keywords work within the context of the schema without modifying it, there is little chance that it will be used in a way that degrades the purpose of the schema.

Yes, that was my goal :-)

It is definitely more expensive, but unlike schema transforms it is also possible to do as an opt-in extension, just as URI Template implementations may be anything from Level 1 (simple replacement of variables with simple values with no awareness of URI syntax) to Level 4 (expanding compound values in various ways that fit with URI syntax).

There's a definite implementation and performance cost to deferred keywords (and to collecting annotations for that matter). This is a large part of @epoberezkin's concerns about deferred keywords (and I believe this concern is independent of his reasons for supporting $merge/$patch but correct me if I'm wrong). It's a very valid concern, but it may be the best tradeoff.

epoberezkin commented 6 years ago

Another interesting discussion...

@handrews while OO inheritance usually adds functionality ... And in strongly typed languages they want to nail down the set of properties ahead of time.

That is consistent with what I wrote above - people want something like OO inheritance, i.e. some way to extend the properties defined in the inherited (=referenced) schema with additional (!) properties without triggering additionalProperties error, regardless where this additionalProperties: false is defined, in the parent (EDIT: that can be closed for changes) or in the child. The key difference between what people want and what you suggest is that they want a static way to extend properties (which is what OO is), independent of the actual data.

@gregsdennis: While unknownProperties would permit all four properties to exist, how far are we willing to take this? What if one of the subschemas also defines additionalProperties:false or unknownProperties:false? Does unknownProperties work on the document level, or only at the current level and those nested within it? @handrews: Deferred keywords in general work across the set of applicable schemas, which does not care about document boundaries. There are quite a few details to work out but they are, I think, manageable. @handrews: For something like "additionalProperties": {"type": "integer"}, "unknownProperties": false, we'd have to decide whether "additionalProperties" is ANDed with "unknownProperties" (because remember, "unknownProperties" takes a schema so it's not necessarily false), or whether the presence of "additionalProperties" means that all properties are now "known" and "unknownProperties" would be ignored. ... If we go with deferred keywords, we'll sort all of this out. If we go with schema transforms or decide that we will never implement either of these options, it's a moot point.

Without going into details of how, exactly the fact that we will have to work it out and it is highly unlikely that we will, makes me very unwilling to go to that direction. Unfortunately we cannot make the decision that "deferred keywords" is a good thing without working out all these details. So somebody has to work them out and present how it's going to work before the decision can be made.

@handrews "Can arbitrarily scramble json into any random garbage" @gregsdennis It seems that a transformation (at least one based on basic JSON transformations like JSON Patch) can be easily abused to deform a schema into something that's not even a valid schema. This would be extremely bad. @handrews This is pretty much my entire argument against schema transforms, but I took like 40 pages to say it. In my defense, I (and others) have said what you're saying here before and the responses tend to be "why does that matter" or "I don't care" or "but I actually prefer that".

I don't understand why the problems that are only relevant for $patch idea are automatically associated with $merge. This argument makes little sense to me because:

I don't think it is possible to extend a valid JSON schema with a fragment of another JSON schema and have an invalid JSON schema as the result. @handrews can you provide an example when it can happen?
Even if it is possible in some theoretic cases, I am still in "but I actually prefer that" camp. It is much better to fail early because the schema is invalid than to have your validation incorrectly fail (or even worse - incorrectly pass) for some specific data where the dynamic collection of "known" properties does not behave like you expect. The number of tests you have to write for schemas would substantially increase and it will be very difficult to debug and maintain.

All the arguments against switch keyword when a schema from declarative document becomes an imperative one apply here as well. unknowKeywords: false is not a declaration, unfortunately, it's a very imperative instruction to collect properties from all downstream schemas based on whether the (sub-)instance fails or passes and make a decision based on this dynamic list of properties.

@handrews It is definitely more expensive, but unlike schema transforms it is also possible to do as an opt-in extension, just as URI Template implementations may be anything from Level 1 (simple replacement of variables with simple values with no awareness of URI syntax) to Level 4 (expanding compound values in various ways that fit with URI syntax). ... It's a very valid concern, but it may be the best tradeoff.

@handrews The analogy is an invalid one. This statement contradicts what you wrote before ~~and has a single purpose to support the agenda that you drive~~. If it is true that a lot of people ask for it (and I believe it is true), then providing this solution on an opt-in basis is unacceptable. This should be part of the validation spec and it MUST be supported by all draft-0X compliant validators without requiring them to do dynamic stage-4 collection of anything. Otherwise it creates more problems than it solves.

@handrews: This is attempting to apply the Rule of Least Power to the problem.

I think it's the opposite. Deferred data-dependent keywords have more power and fundamentally more complex (as their behaviour cannot be determined without the data instance) than $merge which is defined statically and consistent for all data instances.

To summarise, I think we are trying to limit the choices of the solution to the inheritance problem. The real choice at this point is not between unknownProperties and $merge, it is between "deferred keywords" (dynamic, data dependent) and "static extension". I do not understand why a simple $data that is highly predictable and has a lot of usage and support, was debated for such a long time on the grounds that it is data-dependent, at the same time, a hugely ambiguous and potentially non-deterministic dynamic data-dependent concept of "deferred keywords" has to be decided on here and now.

"deferred keywords" is an interesting idea to consider, we can work on it and even maybe adopt at some point in the future, but it is absolutely not a solution to the problem of extension of "defined" properties. At the same time $merge is simple, static, deterministic, already used and solves this problem and some others. I don't see why the spec cannot follow implementations and instead, as @sam-at-github wrote in another issue, tries to drive them.

handrews commented 6 years ago

@epoberezkin Thank you for a really useful and detailed response. Before I address the contents I would like to note:

has a single purpose to support the agenda that you drive

I'm puzzled and somewhat distressed by this accusatory statement. I've put a great deal of effort into surveying as much of the user base as possible and recruiting people, particularly implementors, into this discussion.

There are basically two approaches dating back to the immediate post-draft-04 work, one of which (schema transforms in the form of $merge and $patch) was fairly clearly specified, and another ("ban unknown properties mode") was quite vague, at least as I read it. So I spent a great deal of effort coming up with a workable specification for how to implement it so we can choose between them.

Yes, I have a direction that I prefer. So do you. So do most people who comment on these issues. I don't appreciate the implication that I'm trying to force something on anyone. I originally wrote this up purely with deferred keywords, but decided that the only fair approach was to put the two major options in the same issue and focus the debate. So I did that to the best of my ability, knowing that you and other schema transform fans would be here to bolster the opposing view.

Mediating these discussions is a giant headache and I would dearly prefer to not do it. If I were dead set on pushing an "agenda" I would not be putting myself through this.

epoberezkin commented 6 years ago

I got carried away :) It was a contradiction in reasoning - "everybody wants it but let's provide it as opt-in". Sorry.

handrews commented 6 years ago

I got carried away :)

Thanks- I'm not really in a position to throw stones there, I suppose :)

handrews commented 6 years ago

contradiction in reasoning - "everybody wants it but let's provide it as opt-in"

I see your point. What I really mean here is that a great many people want something like this, but quite a few people are indifferent. And as we saw in the draft-07 vote-a-rama, some would actually prefer that this not be "solved".

In my view, both solutions have significant costs, although the nature of the cost is different for each. A solution that allows those who want a solution to opt-in to the cost while allowing those who do not care for it (such as myself) to opt-out seems ideal. This was my rationale for defining annotation collection and making it optional- some uses, such as Hyper-Schema, require annotation collection in some form, but users purely interested in validation would no doubt prefer faster validation over mandatory annotation collection.

There are a lot of strong and conflicting views here, driven by various use cases and best-practices concerns. I'm trying to balance the different positives and negatives without fracturing the community, which is what I believe would happen if you and I flipped a coin or something and just went with one or the other direction.

I don't see why the spec cannot follow implementations

To the best of my knowledge, yours is the only implementation that has added $merge and $patch, so that directive does not hold up. Ajv is popular, but it is not the only validator, or even the only JavaScript validator. In the .NET world, @gregsdennis's Manatee.Json package apparently averages 17 downloads a day and he does not appear to have been plagued by users demanding the feature (although I don't know how many of those users of the overall package are using the JSON Schema part- @gregsdennis do you have any data on that?)

Also, of course Geraint's tv4 validator implemented "ban unknown properties mode" instead. While it doesn't look like he's updating it anymore, it was quite popular at one time.

Furthermore, important downstream systems, such as OpenAPI, seem unlikely to pick up something like schema transforms given the restrictions they already place on JSON Schema. I am hoping to get some of their steering group to comment here- I believe we should be trying to converge with them rather than diverge, so an optional feature would make it easier for groups like OpenAPI to have real compatibility rather than having to issue caveats about unimplemented required features.

In the other discussion you referenced (about following implementations), multiple implementations had already implemented the proposed change (and I suspect quite a few more implementations out there do so as well, whether they really thought about it or not). This is a very different situation. It's been controversial for years, and there is not an existing consensus among implementations about how to solve it.

erayd commented 6 years ago

@handrews

...some would actually prefer that this not be "solved"...

I must admit I'm also trending in that direction, as the more I think about it, the more the costs of either path look unacceptably high.

After that email discussion with you, I'm fairly firmly of the opinion that schema transforms are unnecessary. Making this even more of an issue, and something that I don't think has been raised yet (although with the amount of comment on this I may well have missed it) is that it's not possible to validate the resulting schema as a naive JSON instance. Depending on the rules around what and how things may be notified may mitigate that to some extent (for example, if both the transform source and target can be independently validated), but even if that were the case, it still places significant restraints on the future development of the spec. For example, if the spec later decides to require all or none of keywords in an interdependent set (e.g. if / else) - that's not a rule we could specify, because any schema which defines those keywords on different merge branches can no longer be validated as a normal JSON instance.

On the other hand, deferred properties such as unknownProperties require expensive deep evaluation of the schema, which may be considerably more expensive than the complexity of the instance being validated really warrants, and requires implementations to have a deep collection stage even if all they care about is validation.

Of the two, the deferred properties approach seems to me to carry significantly less risk and causes fewer problems, but the cost factor worries me enough I'm hesitant to endorse that route without more thought about how that load might be reduced.

gregsdennis commented 6 years ago

@handrews do you have any data on that?

I have no data on who's using the current versions. Manatee.Trello is still pre-.Net-Standard, and as such it's relying on an older version. That usage doesn't involve schema at all, and that version only supports draft 4.

Is the desire for an OO inheritance usage the only thing we're trying to resolve here? If that's the case, maybe we can go about it differently.

Take into consideration my comment in #514 about the $ref keyword. If a reference schema could be clearly defined as a mere link to a separate, self-contained schema document, then allowing overrides on the other properties within a ref-schema could be a solution.

{
  "$schema" : "[draft 7]",
  "$ref" : "http://my.base/schema",
  "properties" : {
    "newProperty" : { "type" : "integer" }
  }
}

In this model, the $ref acts as a base, or starting point, and the other properties either augment or override those in the base.

The one caveat to this that I can think of is that every property would have to define its behavior when adjacent to a $ref. For this example, $schema would be overwritten, while properties would be augmented.

Further, we could add a couple additional special keywords that only activates when adjacent to $ref:

override - this would be an array of strings, each of which indicate a schema property that would normally augment, but we want to overwrite.
augment - this would be an object. Each of the keys would be a schema property that would normally override that we want to augment. The value would be a string indicating which collection schema property (e.g. allOf, oneOf, etc.) to use.

I think that it fits better in a declarative syntax, in many cases it is more simplistic and intuitive, and it provides the functionality that people apparently desire and have been trying to bend the rules to achieve.

As a real-world example, the nonNegativeIntegerDefault0 definition from draft 7 could be rewritten from

{
  "allOf": [
    { "$ref": "#/definitions/nonNegativeInteger" },
    { "default": 0 }
  ]
}

{
  "$ref": "#/definitions/nonNegativeInteger",
  "default": 0
}

I do submit, however, that this does not address the "mulitple inheritance"-like schema that @handrews provided earlier.

handrews commented 6 years ago

@erayd I pretty much agree with everything you're saying (in https://github.com/json-schema-org/json-schema-spec/issues/515#issuecomment-349764011)

Regarding the cost factor of deferred properties, I think there are a two main cases:

Use cases that otherwise only need the kind of validation available now. For these, the difference in cost is substantial.
Use cases that already require collection of annotations (including hyper-schema and UI generation). With these, the difference in cost is negligible for the general case. If you are dealing with annotations that may be scattered across multiple schema objects, you have to find them somehow- the cost may be structured differently, but it's still there.

One thing to consider is that the vast majority of schemas aren't that complicated. Yes, I've seen some schemas where collecting all relevant schema objects would be quite expensive due to a dizzying array of nested *Of / conditional / not / whatever techniques that select among alternatives and add fields to each alternative, etc. But most schemas are pretty simple- select one of the following and add a field to it. Or maybe a short "inheritance"-ish chain of three schema references, each adding a field or two.

So I think it's important to make a distinction between the worst-case performance and the typical case. We don't have hard data on this, of course, but anecdotally I think the typical case is not that bad.

handrews commented 6 years ago

@gregsdennis if $ref is delegation then adjacent keywords don't require any special handling. It's always been the case that all assertions in a schema object are evaluated, and their results are ANDed together. If $ref is a delegation then it is an assertion like any other keyword and is handled the same way. There's no overriding, though. I don't want to get too deep into this as I'm really trying to keep #514 and #515 separate given the complexities of each.

As for override or augment or similar things, if you go through all of the various proposals linked to this one you'll find pretty much every conceivable variation on this sort of thing.

All of the proposals boil down to either schema transforms or something resembling deferred keywords. Most are schema transforms with varying degrees of power or limitations.

This is why I'm focusing this issue on picking one general approach over the other. Having a 10-way discussion of minor variations, each of which is someone's pet syntax, was... not working out well.

gregsdennis commented 6 years ago

@handrews my point is that there could be overriding/augmenting as a new feature to support the functionality desired. Maybe we need a new issue specifically to address the OO inheritance concept, but I thought we were trying to resolve that here.

You opened this issue with:

This proposal attempts to create one or more general mechanisms, consistent with our overall approach, that will address the "additionalPropeties": false use cases that do not work well with our existing modularity and re-use features.

Neither potentially destructive transformations nor deferred keywords strike me as particularly "consistent with our overall approach."

We need to work on a solution to the presented problem as a priority. These continued discussions over the presented solutions reveal that neither of these options is desired, and we need to start looking at something else.

If we can devise a solution that fits within the existing validation paradigm, then I assert that we don't need deferred keywords or potentially destructive transformations, and we can avoid (or at least defer [free-of-charge dad joke]) having to have the discussion of which is better at all. Maybe neither is better; maybe they're just different.

Granted, my $ref solution sides a bit toward transformations, but I don't think that it allows a schema author to corrupt schema in any way, which was the primary argument against transformations.

I'm not saying that my solution/syntax is perfect, and I'm not necessarily attached to it. However, I'd prefer a solution that accomplishes the goal of providing this functionality within the existing understanding of how schema validation works, as was stated in the opening comment. While none can be found, I concede that we need to explore options outside of the existing paradigm, but transformations and deferred keywords are the only options that we've seen, and I'd like to see if we can come up with a few more.

I also agree with @epoberezkin in that we need to work out a solution before we can agree on it.

... we cannot make the decision that "deferred keywords" is a good thing without working out all these details.

erayd commented 6 years ago

@gregsdennis Unfortunately overrides doesn't actually solve the whole issue. The biggest roadblock to that kind of mechanism is the problem of catching instance properties which are not addressed in some way by the schema. Deferred keywords manages that, at significant cost. If the only issue is extension, then I'd argue that allowing siblings to a delegated $ref is enough, and doesn't need new keywords.

gregsdennis commented 6 years ago

@erayd you are correct in saying that my solution doesn't address finding all of the properties that could be in a JSON instance. However, it raises the question of whether, with a feature like this, we really need to perform that property aggregation to validate it.

As I read it, "additionalProperties" : false means that if a property is not listed in properties, it's not allowed. Period. There's no way to sneak in some kind of conditional property without also listing it in properties.

Currently, to conditionally add a properties, an author may expect to be able do something like

{
  "properties" : {
    "x" : { "type" : "string" },
  },
  "if" : {
    "properties" : {
      "x" : { "minLength" : 5 },
    }
  },
  "then" : {
    "properties" : {
      "y" : { "type" : "integer" },
    }
  },
  "else" : {
    "properties" : {
      "y" : { "type" : "boolean" },
    }
  },
  "additionalProperties" : false
}

But this would disallow property y because it's not listed in properties. However, simply adding "y" : true to the property list yields the expected behavior without having to perform any kind of deferred analysis. Yes, it's another property listing, but why it's required makes complete sense to me, so I don't complain about it.

In the classic example,

{
    "type": "object",
    "allOf": [
        {"properties": {"foo": true}},
        {"properties": {"bar": true}}
    ],
    "properties": {"baz": true},
    "additionalProperties": false
}

you can add foo and bar to the property list in the same way. additionalProperties will then recognize their existence, and the allOf restrictions (assuming they're not actually true) would perform further validations.

I'm just saying that there may be alternatives to enable desired functionality that don't require the schema to bend over backward to support said functionality.

erayd commented 6 years ago

@gregsdennis patternProperties is a thing. However, I agree with the general point you're making, and personally I'm quite happy with additionalProperties continuing to work exactly as it does now. However, I know a number of people feel very strongly that the current behavior is problematic - so it's important to bear that in mind when considering this issue.

I'm still not sure why you're suggesting new keywords though. Can you explain what is conceptually different between your suggestion, and simply allowing siblings to $ref?

gregsdennis commented 6 years ago

override and augment were just a thought to consider keyword collision scenarios between the $ref'd schema and the siblings.

And, yes, patternProperties could also be used to in place of explicitly listing them.

gregsdennis commented 6 years ago

@handrews what do you think of the idea of adding a possible value type to additionalProperties or propertyNames?

Currently, additionalProperties has to be a valid schema, including the true and false schemas. But prior to boolean schemas (i.e. drafts 3 & 4), the keyword was defined to either be a boolean or a valid schema, so maybe this idea isn't too much of a stretch.

If one of these keywords could also support an array of strings, those strings could be the names of the extra properties that would otherwise have to be discovered.

(If you can't tell, I'm pushing for a non-discovery/-deferred way to resolve this.)

epoberezkin commented 6 years ago

@handrews: A solution that allows those who want a solution to opt-in to the cost while allowing those who do not care for it (such as myself) to opt-out seems ideal.

Unfortunately it would be not for the users, but for validators implementers to decide whether they opt-in or not. And the cost here is not only performance, but also implementing and maintaining this feature. As ajv will not have it, none of its users will be able to opt-in.

@handrews: ...some would actually prefer that this not be "solved"... @erayd: I must admit I'm also trending in that direction, as the more I think about it, the more the costs of either path look unacceptably high.

That is very different from the community sentiment. The problem presented by the inability to extend properties and unexpected behaviour of additionalProperties is number one source of issues for ajv (74 out of 440 issues mention additionalProperties, FAQ is of little help) and the major subject in SO questions (search for json-schema tag). So maybe some people here don't want it solved but for users it seems to be the number one deterrent from using JSON Schema.

@erayd: it's not possible to validate the resulting schema [of merge] as a naive JSON instance.

That is not true. It is always possible to validate, in some cases the validation may fail though because of incompatible changes between specifications. And as I wrote failing early is better than incorrectly pass validation which would be the biggest problem with deferred keywords.

@handrews: In the .NET world, @gregsdennis's Manatee.Json package apparently averages 17 downloads a day and he does not appear to have been plagued by users demanding the feature

I don't know whether 17 is many or few for .NET, ajv has a few more, ajv-merge-patch (the plugin that defines keywords $merge and $patch) has 100 downloads a day.

@handrews: I believe we should be trying to converge with them rather than diverge, so an optional feature would make it easier for groups like OpenAPI to have real compatibility rather than having to issue caveats about unimplemented required features.

It would be good if they converged, but unless they want it, it won't happen. OpenAPI uses JSON Schema, so it's for them to decide which subset they implement. For some reason they don't seem to be interested and instead define keywords like "nullable" (which is just a syntax sugar for "type") and "discriminator" (which is very unusual). Version 3 of their spec has converged towards JSON Schema though.

@gregsdennis: If a reference schema could be clearly defined as a mere link to a separate, self-contained schema document, then allowing overrides on the other properties within a ref-schema could be a solution.

It is much cleaner not to overload $ref with the second meaning of allowing overrides and instead have another keyword for exactly this purpose. E.g. $merge or $include. But I agree with the approach in principle, as long as it's not $ref.

Granted, my $ref solution sides a bit toward transformations, but I don't think that it allows a schema author to corrupt schema in any way, which was the primary argument against transformations.

It is a transformation. It does not allow to corrupt the schema, in the same way $merge does not. They are the same, really.

@erayd Unfortunately overrides doesn't actually solve the whole issue. The biggest roadblock to that kind of mechanism is the problem of catching instance properties which are not addressed in some way by the schema.

Why is that? To me it seems that it perfectly does, because it changes the list of known properties inside "properties" and therefore changes which properties will be handled by "additionalProperties" keyword.

@gregsdennis: We need to work on a solution to the presented problem as a priority. These continued discussions over the presented solutions reveal that neither of these options is desired, and we need to start looking at something else. @gregsdennis: I'm not saying that my solution/syntax is perfect, and I'm not necessarily attached to it. However, I'd prefer a solution that accomplishes the goal of providing this functionality within the existing understanding of how schema validation works, as was stated in the opening comment.

I agree. ~~@handrews should stop driving the deferred agenda!~~ ;)

Can we maybe already ~~bury the "deferred keywords" idea and~~ focus on the problem indeed and how we solve it? Maybe even try to define it first, to begin with, in a static rather than dynamic way (i.e. "known" properties should not depend on the data instance)?

So, what people want? I believe it is "the ability to extend the properties in such a way that properties from both the current and from the inherited (=referenced, included, whatever) schemas were taken into account when additionalProperties: false is defined in either current or in the inherited schema (that can be closed for changes)."

What we want (not sure about @handrews, but people creating validators seem to want it):

ideally change nothing
if something changes, at least stay within the current paradigm (because changing it means re-writing a large part of the validator and nobody would want it, would they?):
- do not make "known" property list dynamic / data-dependent
- ideally limit the traversals to the current object
- if not limit to the current object (as it wouldn't work for the users), then have it as a pre-processing because in this way the changes are self-contained and don't require re-writing the whole validator.

I honestly don't think there exists a solution that would satisfy everybody (excluding @handrews maybe;), that does not involve some sort of special syntax for transformation/inclusion/extension that would be processed independent of $ref and before validation starts.

gregsdennis commented 6 years ago

@epoberezkin

It is a transformation. It does not allow to corrupt the schema, in the same way $merge does not. They are the same, really.

That may be the case. I haven't had the opportunity to read into your proposal as yet, so it's entirely possible that you and I independently developed the same solution (like Calculus)

@handrews if two implementors can arrive at the same (or similar) solution completely independently, wouldn't that indicate a closer look is required? (Not to say that it hasn't been explored; I'm just unaware of the results of that exploration.) I think we can nail down this transformation thing in such a way that you're satisfied with it. And to put it out there, I really wouldn't have to change much to support this.

It is much cleaner not to overload $ref with the second meaning of allowing overrides and instead have another keyword for exactly this purpose. E.g. $merge or $include. But I agree with the approach in principle, as long as it's not $ref.

I disagree that $ref is being overloaded with a second meaning. I think the existing definition would be changed to allow this. Any siblings of a $ref further define/constrict the referenced schema, while an "only-child" $ref would use the referenced schema in its native state (the current behavior).

If we define $merge to mean "reference and modify," and the author doesn't modify, then it's no different than a $ref, and you have two keywords that function identically.

epoberezkin commented 6 years ago

@gregsdennis

I disagree that $ref is being overloaded with a second meaning. I think the existing definition would be changed to allow this.

That's exactly what I am against - changing the existing definition. It would just add to the confusion around $ref given that there are validators and schemas that allow siblings of $ref to mean just additional assertions without modifying the referenced schema.

Also the current meaning is to delegate the validation to another schema. Your proposed meaning is "to transform the referenced schema and use it to validate against". Hence "the the second meaning" of $ref.

The main distinction is whether to include the fragment into the current context (that's the semantics of $merge) or to bring the extension from the current into the target context (the semantics you propose). I wrote about this distinction above:

The decision to make is whether the references inside inserted blocks should be changed to full uris (to keep them pointing to the same locations) or if they should be left as they are (to allow different definitions used with the same schema). Maybe both options can be supported.

Whether we want only one of two options or both doesn't change the fact that it is cleaner with a separate dedicated keyword, whatever it is: $extend, $include or $merge.

epoberezkin commented 6 years ago

If we define $merge to mean "reference and modify," and the author doesn't modify, then it's no different than a $ref, and you have two keywords that function identically.

The fact you can use different keywords to express the same meaning is neither new nor it means you don't need different keywords - the question is what is their core meaning.

handrews commented 6 years ago

@gregsdennis

These continued discussions over the presented solutions reveal that neither of these options is desired, and we need to start looking at something else.

This is where we have been for five years now. If you truly have a "something else" that does not reduce down to one of these options, I would love to hear it. But everything so far is either a schema transform or works with the larger set of properties across all relevant schemas. We've gone through 5 or 10 proposals over the years (there are six linked above, plus other things scattered all over the old repo and various forums). These are the options:

Don't do anything
Take all applicable schemas into account
Schema transforms

There are minor variations on these but I've never seen anything that does not boil down to one of the three.

@epoberezkin

Can we maybe already bury the "deferred keywords" idea and focus on the problem indeed and how we solve it? Maybe even try to define it first, to begin with, in a static rather than dynamic way (i.e. "known" properties should not depend on the data instance)?

I do not appreciate these repeated attempts to silence my proposals. I am not doing any such thing with your proposals. If no one wants to talk about deferred keywords then that will happen naturally enough.

handrews commented 6 years ago

@gregsdennis

I think we can nail down this transformation thing in such a way that you're satisfied with it.

That's exceedingly unlikely. If there is a clear enough desire for it in the community (not just Ajv's community) then I will put it in the spec despite thinking that it is a horrible idea that will have significant negative impacts on JSON Schema as a system. Because I'm editing the spec, not dictating it.

@epoberezkin

As ajv will not have it, none of its users will be able to opt-in.

Which means that if people value the keyword, they'll move to another implementation that supports it. And if they don't, they won't, and if no one else uses it we'll drop it from a subsequent draft. We dropped quite a few things that no one used from Hyper-Schema after draft-04.

For now, several of the main voices against schema transforms have not had a chance to weigh in yet, so this discussion is far from done.

gregsdennis commented 6 years ago

@handrews

@epoberezkin The main distinction is whether to include the fragment into the current context.

I understand and appreciate your efforts to separate the inclusion/delegation issue from this one, and it makes sense for the deferred keywords argument. But I think for transforms, they are inexorably linked.

This is where we have been for five years now. If you truly have a "something else" that does not reduce down to one of these options, I would love to hear it. But everything so far is either a schema transform or works with the larger set of properties across all relevant schemas.

You've invited several people to this discussion who may or may not (me) be familiar with the previous arguments. You have to expect there'd be some rehash of the discussion.

Also, I proposed a solution that can be implemented within the current definition of schema as well as an alternate solution that doesn't fall into your three categories, but only @erayd has bothered to respond, and then only in passing.

handrews commented 6 years ago

@gregsdennis

You've invited several people to this discussion who may or may not (me) be familiar with the previous arguments. You have to expect there'd be some rehash of the discussion.

True, although I did link all of the previous issues. Perhaps I should have summarized them all here as well.

Also, I proposed a solution that can be implemented within the current definition of schema as well as an alternate solution that doesn't fall into your three categories, but only @erayd has bothered to respond, and then only in passing.

If you mean the part where you re-state various properties with a true schema to make additionalProperties aware of them, that's just because it didn't register as new to me- I apologize for missing it. #119 explores that approach in detail. It either becomes unmaintainable very quickly, or if you want to automatically make that happen (which is what #119 was about) you get into very complex schema algebra issues which are far more complicated to implement than deferred keywords.

@gregsdennis you say that you want "all details" worked out for deferred keywords. What gaps do you see? The point of deferred keywords is to provide a mechanism for defining keywords across all relevant schemas. We can define those keywords to work however we want within that restriction. What part is too unclear to you to allow further consideration?

handrews commented 6 years ago

@gregsdennis note that if you're going to duplicate property names to placate "additionalProperties": false, then it's much easier to just use "propertyNames" with an enum.

So this example:

{
    "type": "object",
    "allOf": [
        {"properties": {"foo": true}},
        {"properties": {"bar": true}}
    ],
    "properties": {"baz": true},
    "additionalProperties": false
}

is often better expressed as:

{
    "type": "object",
    "propertyNames": {"enum": ["foo", "bar", "baz"]},
    "allOf": [
        {"properties": {"foo": true}},
        {"properties": {"bar": true}}
    ],
    "properties": {"baz": true},
}

(this makes more sense when the various property definitions are scattered over files, but that's the case with the "additionalProperties" version as well)

gregsdennis commented 6 years ago

@handrews I agree that using propertyNames is better than amending the definition of additionalProperties, and it fits within the current implementation.

epoberezkin commented 6 years ago

@handrews Transforms and deferred keywords are not the only two options mentioned.

I wrote it several times in this issue and I can repeat again - any syntax that would collect known properties based on the schema only, without taking data into account, is much more acceptable than dynamic deferred keywords.

I understand the concerns about schema transforms but deferred keywords is a much bigger evil. No validators implemented the collection of annotations that you added to the validation spec without ever discussing it in the issues, actually not too many implemented even draft-06. So making the solution to the biggest request over years dependent on this collection is very wrong. It should be independent.

For statically defined (not data dependent) keywords there are two possible approaches.

An additional keyword unknownProperties: <schema> that applies the schema to all properties that were not used in the current schema and in the all sub-schemas and referenced schemas, that would take into account properties used in “properties” and “required” keywords but ignore “patternPropeties”, to keep things manageable and predictable.
An additional keyword inheritProperties: true that will change additionalPropeties in the current schema to ignore properties defined in sub/referenced schemas.

In both cases the list of known properties should be determined independent of the data being validated. If you compare it with your “deferred” approach, the list of the known properties that this approach proposes is always either wider or the same as the list of properties proposed by “deferred” approach, which is ok, as the problem we want to solve is making the schema less restrictive. There are a lot of ways to make it more restrictive again if necessary.

epoberezkin commented 6 years ago

My preferred approach would be unknownProperties as defined above for two reasons: 1) keeps additionalProperties unchanged 2) matches the meaning of the words known/unknown - whatever is named directly, either in properties or in required, is known. Properties that may match some patterns are not yet known, strictly speaking.

With this approach we would both get the keyword users wanted for ages and also with better performance than additionalPropeties as it won’t have to match every single property against all possible patterns.

handrews commented 6 years ago

you added to the validation spec without ever discussing it in the issues

I am very upset at this allegation. I did put it in issues. It was developed through multiple PRs. I repeatedly sent out requests for people to review them. I begged and pleaded for input, and you and others ignored me.

Draft-07 sat in feature-frozen final review for a solid month. Again, I sent out emails, chat messages, every possible thing I could think of to get people to pay attention. I drew attention to specific areas.

I made a point not to change validation too much in particular (and to some degree core) so that people would not be overwhelmed by those while most of the activity was on Hyper-Schema.

To have you accuse me of being insufficiently inclusive and sneaking features in is beyond infuriating. You are now accusing me, directly and unambiguously, of working in bad faith as an editor.

I did nearly everything aside from reviews for draft-07. I wrote the PRs, I managed the issues.. I created milestones so others could track the release scope. I announced key stages in the process on the mailing list. @dlax was amazing with the reviews, providing the vast majority of review feedback. And of course others did jump in as well.

But I have done every possible thing I can think of to have a more inclusive process. A community process. I've done every possible thing I can think of to make sure that there is a check on my ideas since I am both the primary author and editor for the past few months (which is not an arrangement that I like at all).

I don't know how to handle this allegation. I am literally sitting stunned in my seat that you think this of me. It may be best for me to step away from the project if this view is shared.

I will definitely step away for at least the next few days. I do not know how to process this. I've poured so much into this project and I just get accusations of acting in bad faith. I need to re-evaluate my involvement.

Julian commented 6 years ago

This issue has so much activity on it (and so many words) that for me it's been impossible to follow.

Worse yet, every random message that I do catch out of the corner of my eye happens to have weird ad hominem attacks in it...

I think you all need to take a bunch of breaths to be honest :)

I would love to say I have an honest informed opinion here -- I don't personally yet, other than that I at least at this point, from reading as much as I can, can understand some of the reasons not to think $merge is the obvious solution. I don't yet see whether deferred keywords is an obvious one either....

Will still need to put in some more time, just thought I'd at least put out that I'm not totally ignoring this...

awwright commented 6 years ago

I concur with @Julian, I keep re-reading this but I'm not sure where to attack first.

I'd like to see new issues opened proposing specific language changes, letting us close this issue in favor of talking about concrete effects.

epoberezkin commented 6 years ago

With the proposed unknownProperties keyword the schema:

{
  required: ['x'],
  if: {
    properties: {
      x: {const: true}
    }
  },
  then: {required: ['y']},
  else: {required: ['z']},
  unknownProperties: false
}

will allow objects {x,y} or {x,z} depending on the value of x and always allow {x, y, z}. With “deferred” approach the latter is never allowed which is kind of not logical - they are all known to the schema author, why it should fail? And there are actually use cases that need to allow all three properties that “deferred” approach does not allow for.

If the schema author indeed wants to prohibit {x, y, z} it can be easily achieved by adding not: {required: [y, z]} to the schema. As a side note, quite a few people were asking for “present” as an alias for “required” to make not/required read closer to why it actually means (not present at the same time).

gregsdennis commented 6 years ago

@awwright I think separate issues has been tried, and @handrews' intent on creating this issue was to consolidate them. Unfortunately, the topic has degraded into a "my solution is better because it is" argument that I don't think anyone will win.

To be honest, I wonder if it's the implementors who need to be commenting on this. I think we should have the authors and consumers weigh in since they're the ones really using the features. As implementors, we're merely supplying a mechanism for them to use it.

Julian commented 6 years ago

FWIW I don't know what concrete step I'd be recommending we take next here even at the minute, but I can tell you that it is extremely difficult to answer even some basic questions like "what pros and cons are we discussing on each side" without needing to follow a chain back and forth with 500+ word comments each round :)

I do think though that the ad hominem stuff really has no place here, if we're all feeling heated, we just need to cool it down and come back to it in a few days. AFAIK, there really shouldn't be any reason to pressure this decision -- it's a big problem for a lot of people, but it's been a big problem for the past ~X years, a few more days and people will live.

erayd commented 6 years ago

I agree that ad-hominem attacks have no place here. And however people feel about others' contributions to the spec, please remember that a lot of work has gone into it and sometimes things get lost in conversation - engaging in good faith is always better than letting loose with outright accusations, especially in a public forum like this.

This discussion is becoming far too heated - I think that we should all step back and leave this thread the hell alone for a few days and let things cool off before continuing the discussion. There is no rush.

epoberezkin commented 6 years ago

@handrews no need to take it personally. If it was discussed and I missed it, it’s clearly my problem, no need to develop a statement you believe to be wrong into the accusations I didn’t make. The way I saw the issues that mentioned annotations, it was all hyper-schema related.

Relequestual commented 6 years ago

Given the volume of chatter on this issue, I'm going to lock it (probably for 7 days) to give others (and myself) a chance to read it and catch up. If you take issue, with this, please tweet me or DM on slack (for those who are in the JSON Schema slack... it will be open to all soon!)

(Those with write access can still comment, but I ask you refrain unless it's imperitive to peoples comprehension of the issue)

Relequestual commented 6 years ago

Unlocking this issue after a reasonable amount of time to allow The @philsturgeon to post his summary, which I think says all that needs to be said. I have a few more disussions to have before I know where I totally stand on possible solutions, but I feel that further discussions about individual potential solutions can be disucssed in slack and summarised / some limited small discussions on new issues.

philsturgeon commented 6 years ago

Thanks @Relequestual.

This issue starts off with the intent of picking an approach, with a great writeup on some of the competing ideas and attempts. Deferred keywords and Schema Transformations (and some of the many variations thereof) were given a fair outline in the introduction, and in the comments that came shortly after. Good, fair content, with a few understandable omissions from @handrews on things that were expected to be common knowledge. Descriptions and opinions were clearly separated, and corrections were accepted from @handrews, meaning this tricky situation of "I'm going to mediate but also I have a preference" was started very well. Let's all remember to be grateful for that. :)

Sadly, like many decisions about competing approaches, it ran off the rails (to no one persons fault.)

As somebody who has spent most of the last two days reading through this thread, I would like to summarize some things to get us heading in a constructive direction.

Conversational Points

1. $merge could be a deep merge but author is open to shallow merge too

Fantastic. We're not sure which approach is being recommended, but open to changing it based on feedback. Maybe lets figure that out.

2. "Merge / patch, if implemented, cannot be strictly preprocessed"

A mega post from @erayd gets us started, with this strong point being just one of many. There is some back and forth on if merge/patch are "preprocessing" or "dependent preprocessing".

If it were strictly a preprocessing step, the user would be able to preprocess it before invoking the validator on it. Because of $ref, that's impossible.

@epoberezkin suggests two possibilities

You know all your schemas in advance so you can preprocess them all before validation.

You analyse all $refs in the schema before validation, then load all schemas that are needed, then preprocess them all, again, before validation

The reply to that being:

Both of those are well into the realm of dependent preprocessing, and as such logically belong inside the implementation, rather than something that we can reasonably expect the user to do. If it were easily doable as pure preprocessing, none of us would be having this argument in the first place!

This thread was never resolved. "Dependent preprocessing" is not a fun topic, but it needs discussion. I think @epoberezkin is coming at this discussion like someone who is only interested in validation, and seems to focus on functionality that makes validation quick and efficient.

It's not bad to want speed, but things like hyper-schema already work in a similar way, and deferred annotations do not add much on top of what hyperschema is doing. There are many more use cases in this world than just making a really fast validator, and that's where the other folks are getting their concerns.

3.) Nuclear vs Surgical

Adding merge / patch breaks some OO principles of being able to treat the child as a parent instance and assuming behavioral equivalence. I personally feel that being able to occasionally violate principles in favour of easier maintainability is a handy thing to have, but I've changed my mind about its necessity. After some fairly thorough discussion, my position now is that, while useful, I don't see it as badly needed enough to push hard for its inclusion, and if JSON schema were to continue without it I would be OK with that outcome.

The consideration here, is that whilst merge/patch is a fantastically effective solution to some use cases, it could also be considered fairly nuclear, or a "buzz saw" as @handrews later suggests. This is not meant to be disrespectful, it's just about taking care of the use cases with The Rule of Least Power, instead of handing everyone a shotgun and hoping they use it carefully.

Fixing your parents desktop computer by reinstalling windows will certainly work, but it's not the ideal solution. If it's the only solution you have available to you then you'll do it, but that's now how we want to handle the design for JSON Schema.

4.) "Merge / patch cannot degrade gracefully"

Another great point from @erayd:

Merge / patch cannot degrade gracefully - unlike format, it can't be delegated to the user as an annotation, and it can't be ignored without breaking what is likely to be critical schema functionality. Realistically, this means that if it becomes part of the spec, then the choice is either to implement it, or hard fail upon encountering it.

This valid early point was never resolved as far as I can tell. The response from @epoberezkin was so minimal that it seemed like small parts were cherry picked, and the conversation flopped off in other directions.

We pick this back up much later with @gregsdennis realizing that deferred keywords "afford the least abuse"

Because deferred keywords work within the context of the schema without modifying it, there is little chance that it will be used in a way that degrades the purpose of the schema.

Henry goes on to explain this a lot more in this issue, but the most important paragraph:

It is definitely more expensive, but unlike schema transforms it is also possible to do as an opt-in extension, just as URI Template implementations may be anything from Level 1 (simple replacement of variables with simple values with no awareness of URI syntax) to Level 4 (expanding compound values in various ways that fit with URI syntax).

This point needs more discussion. Right now it seems like deferred keywords as an optional later step can be considered to degrade successfully, whilst merge / patch cannot.

This seems important. We're just gonna have this keyword sat there, making the schema 100% useless for any implementation that doesn’t support it. Alternatives have been offered that add levels of logic later, but messing with earlier levels will cause a lot of complications for everyone. If the benefits offered by the complications were worthwhile, that'd be one thing, but point 1 and point 4 make it feel like merge/patch is not the worthwhile complication you think it is...

5.) Cherry-picking and saltiness

When a giant thoughtful post covering multiple points is made, it takes time (👆 & 👇) so when a response is incredibly short, it can be off-putting and derailing for the conversation. Not to play the blame game, but to evaluate a problem, here is the response to @eryad's six point initial summary.

    for implementations that collect annotations

which is none of the validators need to do for validation.

    “solves more problems” needs to be more clear

@erayd wrote about replacing keywords - essentially parametrisation.
I wrote above about the ability to use the same schema with different definitions.

    Can arbitrarily scramble json into any random garbage

That reminds me an argument about why Function constructor should never be used in JavaScript. As long as you know what you are doing, $merge will produce the results you expect. It is definitely easier to understand and predict the behaviour of the schema generated by $merge than to predict which properties are “known” in the new paradigm.

These are a few very short responses, which (whilst probably are intended to be concise and save spilling ink) skips a lot of the points being made. This can appear disrespectful to the others in the conversation. Consider a face to face conversation. If I'm talking to you and you ignore a lot of what I say, and hand wave off one or two points, and walk off, I'm not going to feel very engaged in that conversation.

Lets try not to do this. If typing is tough, and you've got some serious one-on-one to do, grab a Skype/Hangout/Slack/IRC whatever and hash it out with a keyboard or maybe face-to-face through the interpipes. Conversations are more constructive when they're respectful, and seeing somebody's face generally makes you be more respectful.

6.) WHAT COLOUR SHOULD THE BIKESHED CARPET BE

There is some trouble in this thread whilst @gregsdennis points out a few concerns with @handrews proposals, and @handrews outlines some of the potential solutions would could be hashed out further later. The idea of resolving some details seems to concern @epoberezkin.

Without going into details of how, exactly the fact that we will have to work it out and it is highly unlikely that we will, makes me very unwilling to go to that direction. Unfortunately we cannot make the decision that "deferred keywords" is a good thing without working out all these details. So somebody has to work them out and present how it's going to work before the decision can be made.

Sure we should not just handwave fundamental concerns with a direction away before committing to it, but that's not what's happening here.

Let's decide if the bikeshed should have a wood or carpet before worrying about what colour carpet or type of wood...

7.) optional for implementers == optional for users

When talking about deferred keywords being an optional later step for folks to use if they want it, there was a question from @epoberezkin

Unfortunately it would be not for the users, but for validators implementers to decide whether they opt-in or not. And the cost here is not only performance, but also implementing and maintaining this feature. As ajv will not have it, none of its users will be able to opt-in.

They're the same thing. Implementers can chose what level of support they want to add for these various steps, and diligent implementers may even make it configurable.

Either way, seeing as there are a multitude of JSON Schema implementations, there will be a scale of interests for these validators which range from "the quickest validator in the world" to "feature complete but doing more stuff takes longer". The user gets to pick which covers their needs.

Lets keep this in mind when discussing things, especially if you have an implementation.

8.) Popularity is a data point, not a decider

There is talk on number of downloads, initiated by @handrews:

To the best of my knowledge, yours is the only implementation that has added $merge and $patch, so that directive does not hold up. Ajv is popular, but it is not the only validator, or even the only JavaScript validator. In the .NET world, @gregsdennis's Manatee.Json package apparently averages 17 downloads a day and he does not appear to have been plagued by users demanding the feature

This appeared to be misunderstood by @epoberezkin as a popularity contest:

I don't know whether 17 is many or few for .NET, ajv has a few more, ajv-merge-patch (the plugin that defines keywords $merge and $patch) has 100 downloads a day.

What I believe @handrews meant was more: If Example Implementation A has 17 downloads a day and they've not had their door kicked in with feature requests for feature A, then it might be that the specific feature is not in high demand.

When surveying users for feature requests, you ask what the experiences are, what problems they have, then use that data to make decisions about improvements. You then ask a control group and see if those experiences are improved or worsened. That's what is being described with us picking an approach, seeing how it goes, then responding.

Using the fact that ajv gets a lot of usage of that plugin is basically like testing to see how many of your websites users use JavaScript, using only JavaScript to detect it. That'll be a 100% success ratio. 😅

Making decisions based on the popularity of AJV seems a little like running an A/B test where you literally only run A.

Offering a list of plugins to people then having them use them does not confirm that it is a good solution, let alone "the best", or one we definitely need to go with. It does show that people will use the tools you give them, but again that was point 3 ("the buzz saw"). If people only have the chainsaw they're gonna use it to make ice statues, but if we give them a chisel they might find that more appropriate to the job.

9.) Changes to $ref

The conversation gets a bit confuddled with #514 which discusses changes/clarification of the implementation and meaning of $ref. This has been nicely summarized and a vote has been requested.

10.) Ad-hominanations

I know everyone is sad about the ad-hominem attacks, but to be very clear on this point: whilst there have been assertions that Henry was too quick to move, or did not get sufficient help, I remember seeing multiple emails, DMs, GitHub mentions, and messages in at least three Slack communities across five or six different channels. Any suggestions that Henry did anything other than literally beg for feedback are patently false, so let's move on.

Let us remember that we're not trying to find in this thread:

The best solution
The best solution for any one contributor
The best solution for any particular implementation

We should consider ourselves on the hunt for a direction which a majority can get behind trying out.

Schema transformations have been trialed in a plugin for AJV, but deferred keywords have not been attempted yet. This should not be a reason for picking the only one that's been trialed, instead we should be asking: do we like the idea of deferred keywords enough to try them out in an upcoming draft? If they are a complete flop, taking them back out is not particularly devastating, especially as they come towards the end of the implementation logic and are considered optional.

If we are not keen on trying out deferred keywords, what are the actual concerns with it, other than "I like Foo better." This isn't the US Presidential vote where every campaign is "Vote A because B sucks!" 😭

As this is not a two-party system, we should absolutely consider alternatives to deferred keywords and schema transformations. There have been a lot of attempts at schema transformations, with the merge/patch stuff outlined in the intro being just one. If anyone would care to recap those attempts with links and summaries that would be fantastic.

We also need to be very clear on the use cases. Deferred keywords have had their uses cases explained a fair bit, and would solve the vast majority of people's concerns about additionalProperties not allowing properties which they think should be allowed... This is one of the biggest complaints for many in not using JSON Schema, as what they want to do is literally not possible. We need to give that some weight.

So we have use cases for deferred keywords, but what are the use-cases that actually require schema transformation? There were some older ones mentioned in #98 but the vote, but they can be done in other ways, meaning they're not a use case that actually requires schema transformation. To be even and fair we should understand the use cases that proposals are trying to solve, and confirm they can’t already be handled viably though current functionality.

To recap:

Should $merge deep merge or shallow merge? People care.
Is dependent preprocessing really less of a ache than deferred keywords?
Do we make multiple smaller features that cover specific use cases that we know about, or hand out shotguns so everyone can do whatever they want?
Can we find a way for merge / patch to degrade gracefully, or confirm that we don’t care about its inability to do so?
Summarize alternatives for those without context or photographic memory
Post your use cases, especially anything that requires schema transformation

epoberezkin commented 6 years ago

@handrews I didn't intend to make any accusations, btw. I'm sorry you took my statements in this way, they were intended to represent my position based on my memory of the events. I will try to be more precise.

epoberezkin commented 6 years ago

@philsturgeon I agree with the concerns about $merge. The strongest is probably the one that @erayd raised:

breaks some OO principles of being able to treat the child as a parent instance

While $merge "patches" the problem with the additionalKeywords, it doesn't really seem to solve it. Users seem to want some inheritance based approach indeed.

What I think you've missed is the main distinction between how "unknownProperties" and the list of "known" properties can be defined:

based on the validated data & schema(s), can be determined only in the process of validation
based on the schema(s) only, can be determined before the validation starts.

OO inheritance implies the 2nd approach, i.e. the list of properties defined only by the schema(s), independent of the data. In this case "known" means "known to the schema author". In the 2nd case the list of known properties can be defined independent of annotation collection process and is:

more deterministic, easier to test and debug
more performant (can be done once per schema)
does not require any deferred processing

gregsdennis commented 6 years ago

@epoberezkin in #514, we all agreed that $ref is delegated which means that you have to perform validation on an instance in order to "dereference" it.

How do you intend on determining the known properties before validation starts if you don't know what properties the $refs define? As mentioned before, the problem is cyclical schemas.

Sure, an implementation in a language which understands object references (such as yours and mine) could handle this by various mechanisms supported by that language, but we (as participants in schema spec design) can't assume that a given implementation will be written in a language that has this kind of support.

We need to separate ourselves from our implementation and look at JSON Schema for what it is, which (admittedly) is easier said than done.

An analog to this problem is the classic factorial method used to teach recursion:

int Factorial(int x)
{
    if (x <= 1) return 1;
    else return x * Factorial(x - 1);
}

If we try to determine an expansion without a value, we'll be replacing Factorial() calls indefinitely because we never hit the base case. We only hit the base case when we have a value.

Similarly, we need to be able to hit a base case for cyclical $refs.

(I understand for this case, there is a mathematical transformation that yields a non-recursive solution, but that can't necessarily be done for JSON schemas.)

handrews commented 6 years ago

@epoberezkin thanks, I appreciate it.

Regarding static analysis, it does not solve the right problem, even aside from the concerns raised by @gregsdennis. There is no possible valid way to statically account for oneOf and anyOf. If I want to error on properties that are not accounted for, given this schema:

{
  "type": "object",
  "oneOf": [
    {
      "properties": {
        "foo": {"const": true},
        "bar": {"type": "integer"}
      }
    },
    {
      "properties": {
        "foo": {"const": false},
        "baz": {"type": "string"}
      }
    }
  ]
}

Then I want the instances {"foo": true, "bar": 10} and {"foo": false, "baz": "stuff"} to be valid, but {"foo": true, "bar": 10, "baz": stuff} to be invalid.

You might be able to work this out because of the very simple use of a const property to select the branch, but you can't do that in the general case.

Static analysis does not address the primary use case.

Previous Next