Reducing JSON Schema's Complexity

json-schema-org / json-schema-spec

The JSON Schema specification

http://json-schema.org/

Other

3.48k stars 250 forks source link

Reducing JSON Schema's Complexity #710

Closed ucarion closed 5 years ago

ucarion commented 5 years ago

Note: Opinions expressed herein are entirely my own and not the views of my employer.

Whereas previous drafts of JSON Schema have focused on extending, bugfixing, or generalizing JSON Schema, I would like to propose that the focus on the next iteration of JSON Schema be on reducing complexity.

Why simplify

On the current track, it would take a nontrivial amount of time for JSON Schema to reach the high bar of formality and clarity that the IETF RFC process requires. But the industry needs JSON Schema now. This is a testament to the importance of what this project is working on today.

By focusing on simplifying JSON Schema, and focusing on those problems we know we can solve for users, we will be able to make something people really need. Consider the following:

Almost all people who will use JSON Schema have had everything they need since draft-04. For most people, we could have stopped there.
Very few people need a sophisticated, extensible, hypermedia-driven validation framework that's IETF-standardized. Lots of people need a standardized, reliable schema language that works on all of their different platforms and systems identically.
Most implementations of JSON Schema are out of date and buggy. For example, almost none of them support $ref 100% correctly. format is super unreliable. A ton of implementations are stuck on draft-04.
It's very hard to create a new implementation of JSON Schema. The spec, when read from A-to-Z, is confusing -- and takes very long to read, since the spec is now a multi-thousand-line formalization spanning three documents.

Time is not on our side here. JSON Schema is nine years old. With each passing draft, we are creating a new generation of divergent, out-of-date implementations. As time passes, those implementations will ossify and require a new generation of deprecations and re-writes.

Many contributors to this project note, aptly, that this project is a volunteer effort, and that it's impossible to punctually achieve our ambitious aims entirely on our spare time. The solution is not to take another few years to get this project done. The solution is to focus on what's already out there, formalize that, and wrap this thing up.

One alternative approach

Note: this suggested approach is merely illustrative. It is not formally part of what I'm proposing in this issue, but does prove a point.

I have implemented a simplified approach to JSON Schema through the form of a test suite, and two implementations which pass it:

For a detailed overview of what the differences in this approach are, see:

https://github.com/ucarion/json-schema-spec-comparison

The above document focuses on differences in test suites. But JSON Schema has many details which it does not concretize in tests. On the approach I've implemented, we could take the following actions to make the spec simpler:

Remove Hyper-Schema entirely.
Remove annotation entirely. You can still have annotations, it's just not a standardized thing.
Have a single suggested output format for errors.
Unify the "core" and "validator" documents.
Remove $id outside of root documents.
Stop having $ref disable its sibling keywords.
Remove format, contentMediaType, and contentEncoding.

Doing so would leave us with something that's backwards-compatible with what most people are using JSON Schema for today. The biggest pain-point will be for people who use $id inside sub-schemas -- they will have to spread their schemas across multiple documents.

This is just an illustration of the idea, which I've complemented with working code, because we reject kings and presidents. The point here is that simplification can be achieved, and it can be done in a way that doesn't unduly harm our core constituency.

Nobody can ever be forced to change. But on my proposal, those who elect to will likely not find that much of anything has changed. And those upgraders will be joined by a new generation of enterprise users, who cannot use JSON Schema today for lack of formalization and off-the-shelf implementations.

Conclusion

This ticket is not an open-ended diatribe. This ticket asks the following question: shall we change the overarching objective of this project to be cutting scope and simplifying? Shall we make our prime directive be to have, by the next draft, something that can be accepted as an IETF RFC?

Afterword

In summary, the answer to the above question is "no", to the extent that anything based on rough consensus can ever be decided. In more detail:

The intention of this issue was to discuss whether JSON Schema should make IETF standardization its prime directive, and focus on simplification as the instrumental means of achieving that end.
JSON Schema remains ultimately a project on the basis of rough consensus. And there does not today exist many people on this project with enthusiasm for wrestling with standards bodies.
Nor is it evident that JSON Schema can or ought to dramatically cut scope. Though there are many people who could live with just a small subset of JSON Schema that the project has long supported, there are also many people who want everything that's in the spec present, imminent, and future.
Therefore, JSON Schema shall not change its focus. The current trajectory -- of making a sophisticated, generalizable, extensible system for validating and annotating JSON-like data -- shall remain the course.

Relequestual commented 5 years ago

I'm locking this issue to avoid speculatie discussion before I've had a the chance to form a proper response.

Relequestual commented 5 years ago

Thank you @ucarion for your thoughts and previous contributions to JSON Schema.

I want to start of by using an illustration. Most of the issues you have raised show that you are seeing things only from one perspective, and not the whole picture.

The illustration: https://twitter.com/semestasains/status/1081106334634786816

Consider the following:...

Almost all people who will use JSON Schema have had everything they need since draft-04. For most people, we could have stopped there.

Almost all people? What are you basing this on. If that were true, we needn't have created and released draft-5 though 7, or put in a tone of work on draft-8. Given I've spent the last few years following peoples questions on StackOverflow tagged with JSON Schema, and I set up and monitor the official slack, I'm pretty sure you're assertion is wrong. I could give many examples... It's our responsibility as editors to look for consenus based on the community.

Very few people need a sophisticated, extensible, hypermedia-driven validation framework that's IETF-standardized. Lots of people need a standardized, reliable schema language that works on all of their different platforms and systems identically.

Very few people...? Again, what is your evidence of this? I have been been engaging with the community for the past 5 years on a nearly daily basis, and haven't seen this. Often the questions that come in, present the more complex and extensible schemas.

Most implementations of JSON Schema are out of date and buggy...

Evidence? Here's a list that support at least draft-5, and usually draft-7. http://json-schema.org/implementations.html

It's very hard to create a new implementation of JSON Schema. The spec, when read from A-to-Z, is confusing -- and takes very long to read, since the spec is now a multi-thousand-line formalization spanning three documents.

I've watched 3 people create a new implmentation on and off over the past 6 months or so. It is hard because it's complex, and it's complex because data is complex, and data is complex because that's real life sometimes.

Before addressing each of your possible suggestions...

What we are hearing is, "[You've made this too complex since draft-4, and haven't listned to what the community needs.]"

What are you basing this statment on? Have you looked at how much discussion and listening has gone on to make some really key decisions lately? Have you looked at how we listen to the community, validate feedback, and make changes to the specification documents?

Your statement is hurtful to the team and given the amount of work they have put into carefully listenerning to the community, which has sometimes resulted in adding things they are not individually happy with. SO MUCH of the work we do is a direct result of community needs based on real people asking real questions.

Your justification for raising your issue doesn't present any evidence, and makes several wide sweeping statments about the community and implementations.

This ticket asks the following question: shall we change the overarching objective of this project to be cutting scope and simplifying? Shall we make our prime directive be to have, by the next draft, something that can be accepted as an IETF RFC?****

Simply, no. There are still isues that need resolving that the community needs, and we cannot address all of them in one draft. In addition, draft-8 is "nearly done"(tm) and adds a whole load of new things (such as vocabularies) which is going to be really important and interesting for the community.

It's a much needed change, evident in what we've seen in the community. @handrews has done an epicily fantastic job of putting it together, and it's going to solve many problems across multiple current use cases.

We will need people to test and feedback on this to make any required alterations in draft-9.

Now, let's look at your suggestions:

Remove Hyper-Schema entirely.

Hyper Schema is a separate specification, which is why you find it in a separate document. Anyone creating tools to work with JSON Schema may never read Hyper Schema. Heck, their JSON may never be connected to the internet (and there are several in production use cases).

Remove annotation entirely. You can still have annotations, it's just not a standardized thing.

Why? Annotations are useful to many. That's like saying "Remove the ability to have comments from [programming language]".

Have a single suggested output format for errors.

We have done this for draft-8, but provide several different formats, all of which have valid use cases, and are not all required to be compliant.

Unify the "core" and "validator" documents.

The issue here goes back to your perception and limited use case (that illustration link at the top). Core provides many applicator key words, which can be used by other specifications (like Validation and Hyper Schema) to provide additional functionality.

This is where understanding vocabularies is really important. Your use case is validation... Well what about the use case of creating forms / UI? Persoanlly, I'm not so interested in form generation, but the community is doing this, at large, and have many different additional key words to support various things. We get a question on form generation at least 2-3 times a week (See, listening to real community needs).

Say an organisation or group want to form a JSON Schema Form standard, which extends JSON Schema, and is uninterested in Validation. If you had a unified Core and Validation spec, they would have to unpick the bits they required from it for applicators and annotations. Yuck.

Wouldn't it be better if they could extend a... core part of how JSON Schema works, in terms of applicability, and could ignore all the Validation key words?

Now of course, you might not care about form generation, but the community does.

Remove $id outside of root documents.

We've talked about this a lot recently, but it still makes sense to allow this, and there are many in production use cases.

It CAN be and often IS quite confusing. The easiest way to think about it is, if a non root schema has an $id, that whole object is like an iframe in HTML.

Consider a set of schemas which combine together using $refs to create an effectivly single schema file.

If I want to automagically transclude schemas into a single schema, I need to include the $ids from any child schemas, because their internal references might rely on them. I might want a single schema, to, as you mention elsewhere, avoid having to request schemas from multiple locations, some of which I might not be able to reach when I'm ready to do the validation (previously mentioned offline validation).

Stop having $ref disable its sibling keywords.

The reason $ref behaves this way was a number of legacy reasons, including the original idea that it was a replacement for the current object. It was designed to be a pointer to another object. Some people wrote transcluders or compilers, to transclude references schemas.

For draft-7, you can work round this by wrapping $ref and other schema objects in an allOf.

For draft-8, we HAVE actually made this change, because it was a frequent issue, specifically when you wanted to add to the annotations of a generic type you were referencing.

See https://github.com/json-schema-org/json-schema-spec/issues/523 for details. I worked on this myself.

Remove format, contentMediaType, and contentEncoding.

Why? I might want to encode an image or other file type data in a JSON. I should be able to express what the format is.

format is annotation with optional validation. It's not required to support it, with most libraries making a "best effort". I do think we may need to carefully consider the future of format, so that is kind of open for discussion.

I hope you're starting to see a little bit of the "other side of the picture" here.

Finally, I think we should look at the document you linked to regarding your implementation. Specifcally the things you avoid doing for various reasons.

json-schema-go strives to avoid insecure, poorly-defined, or confusing behavior.

Therefore, json-schema-go avoids:

Auto-fetching schemas (insecure),

Wrong. JSON Schema does not assume you should be able to access the internet at all, nor that a URI is a network locator, which means "accessable on the network / internet".

The URI is not a network locator, only an identifier. A schema need not be downloadable from the address if it is a network-addressable URL, and implementations SHOULD NOT assume they should perform a network operation when they encounter a network-addressable URI.

$ref: https://tools.ietf.org/html/draft-handrews-json-schema-01#section-8.3

Auto-assigning IDs to schemas (insecure),

Not sure what you mean, as you don't elaborate.

Changing base URIs (poorly-defined),

You've conflated two issues here. We've talked about the use of $id outside of the root schema, so I shan't revisit. The issue you've linked to (https://github.com/json-schema-org/json-schema-spec/issues/687), I agree, we need to resolve this, and I have a few ideas. It's still an open issue. We are listening.

Having $ref disable sibling keywords (confusing),

As mentioned, this is changed for draft-8. For draft-7 there is a workaround.

By implementing the specification incorrectly, you can expect to be causing problems for other people using your library.

Just because you don't know or understand the use case, it doesn't mean there isn't one. https://xkcd.com/1172/ - May be amusing, but also true.

Please don't claim you support draft-7 and yet deliberatley implement things incorrectly. I've rejected adding implementations to the sites listing for this reason on numrous times. People will come to us and or you with bug reports. Just please do not do this.

The format keyword (poorly-defined),
The contentMediaType and contentEncoding keywords (poorly-defined),

Explain what you mean by "poorly". It's optional anyway.

Auto-promoting numbers to Bignum (confusing),

Sounds like something implementation specific to me. eh shrugs

Emulating ECMAScript regexes (insecure),

That's up to you, but again, schemas should be portable, and if you can't or choose not to support a key word, you should throw an error doing so, and not silently ignore or "fix" things.

I'd like to also address some other comments...

Doing so would leave us with something that's backwards-compatible with what most people are using JSON Schema for today.

I hope I've shown you why that assertion is simply false. You do not have visilbity on "most people using JSON Schema for today".

The point here is that simplification can be achieved, and it can be done in a way that doesn't unduly harm our core constituency.

As above.

On the current track, it would take a nontrivial amount of time for JSON Schema to reach the high bar of formality and clarity that the IETF RFC process requires. But the industry needs JSON Schema now. This is a testament to the importance of what this project is working on today.

You're not the first person to say "we need this as an RFC now please". We get that. What experience do you have of the IETF and the road to RFC? Have you done this with another specification before? What is the "high bar" you talk about that you feel is preventing this spec getting to RFC? (These are not rhetorical questions, please let us know.)

There are likely muiltple Github issues relating to RFC status. It's not a priority for us now. We tried to start it, and were kicked back, due to huge missunderstandings. None of us have the energy to start that again. We have a spec with issues that need fixing.

I hope you feel my response is balanced and fair. If you have any evidence to provide supporting various statements, that would be great.

As I've menitoned, some of the issues you've raised have merit, or have already been worked on, so that's not to say that nothing here is useful!

awwright commented 5 years ago

Sometimes standards efforts find themselves on the wrong track, and in these cases it's appropriate to submit a long-form response to re-consider the problem being solved and how to solve it. But you need to be very careful, where we are right now is the result of many years of efforts of trying to create a solution that's well-defined and works for the most number of cases.

@Relequestual has good answers to most of the issues you raise. I've got just a few footnotes:

It's perfectly appropriate to be complicated in some cases, the whole point of JSON Schema is so that applications don't have to implement their own validation routines. If authoring documents is complicated, then try to identify a specific problem, research if it's been brought up before, otherwise file an issue if not.

Deliberately varying from the specified behavior is in poor taste, if you think there's a problem, raise an issue; or else specify that it's for research purposes and not for use in production.

Finally, half the point of this GitHub organization is to be a forum for implementors so we can converge on behavior. Make sure you're part of the discussion when the discussions happen!

(The other half is to get to RFC, but there's not much we'd actually get out of an RFC number besides a registered media type.)

ucarion commented 5 years ago

Hi @Relequestal,

I'll address your points one-by-one, and then finish by pressing the issue once more. I'll remind you that my central argument is that:

JSON Schema is not IETF RFC-ready because it is too complex,
JSON Schema ought to seek formalization quickly,
JSON Schema is currently becoming more complex, not less, and ergo
JSON Schema ought to change course, and focus on simplifying.

I'll avoid collapsing this point, because it's important:

Your statement is hurtful to the team [...]

Your justification for raising your issue doesn't present any evidence

I am not here to be hurtful. I have at all times been tactful yet forthright. We are here discussing technical ideas, not insulting one another.

Indeed, in an effort to be pithy, I did not weigh down my opening remarks with tomes of evidence. But this reply does contain such evidence, which I hope we can engage with and that you will find apt and satisfactory.

To your point, however, asides such as these:

(See, listening to real community needs)

Are wholly unnecessary, and perhaps a bit unprofessional.

In Reply

Now, to address your comments:

On your retort to "Why Simplify", point-by-point

You open by stating that I don't furnish proof for my claims in "Why simplify", and by way of retort, appeal to your experience through Slack and StackOverflow. Consider: 1. You mostly hear from people that aren't satisfied with what's already out there. Many *companies*, let alone individuals, build on top of JSON Schema, but will *never* engage with the community for the same reason they don't engage with the makers of the programming language they use. JSON Schema is just another tool for them. 1. Slack and StackOverflow are poor examples of why JSON Schema needs to add a bunch of stuff. Where in the [highest-voted questions][1] do you see something we haven't solved since `draft-04`? [1]: https://stackoverflow.com/questions/tagged/jsonschema?sort=votes&pageSize=15 > Almost all people? What are you basing this on. If that were true, we needn't have created and released draft-5 though 7, or put in a tone **[sic]** of work on draft-8. I *am* indeed questioning the direction we are taking with draft-08. There are many people who don't understand why JSON Schema is still a spec, or why it's going where it is. To name a few places where people are complaining: * https://github.com/whosonfirst/whosonfirst-json-schema * https://news.ycombinator.com/item?id=16407001 * https://www.tbray.org/ongoing/When/201x/2016/04/30/JSON-Schema-funnies I don't think I need to go further here. Between StackOverflow (linked above), and just Googling for people with gripes with JSON Schema, *far* more people are complaining about complexity than about lacking functionality. > Evidence? Here's a list that support at least draft-5, and usually draft-7. Check out the GitHub issues for the projects listed there. Almost all of them have open tickets about bugs related to `$ref`. > I've watched 3 people create a new implmentation on and off over the past 6 months or so. It is hard because it's complex, and it's complex because data is complex, and data is complex because that's real life sometimes. The real world is complex, but JSON Schema is not helping. Citing an article listed above: > Actually, I could be wrong; the spec is *really* hard to read; and I say that as one with much more experience in spec-reading and schemas than most. > > Source: https://www.tbray.org/ongoing/When/201x/2016/04/30/JSON-Schema-funnies Moving on, you state: > What we are hearing is, "[You've made this too complex since draft-4, and haven't listned **[sic]** to what the community needs.]" I am not here to say you're not listening to people. More correct would be to say that I think JSON Schema is trying to be all things to all people. Consider Vonnegut: > Write to please just one person. If you open a window and make love to the world, so to speak, your story will get pneumonia. On that metaphor, I'm asking if we could shut the window. There are lots of problems to solve out there -- how about we solve just one, but really really well?

On your retort to "One alternative approach", point-by-point

Now, to address your comments on my proposed way foward: > Hyper Schema is a separate specification, which is why you find it in a separate document. Yes. But my suggestion is that it be divorced from this project, so that it does not interfere with JSON Schema validation, which is the crown jewel of this project. > Why? Annotations are useful to many. That's like saying "Remove the ability to have comments from [programming language]". Again -- you can still *have* annotations. Just like comments, they don't do anything. Most programming languages don't start off natively supporting special behavior in comments. They're just ignored. That's what I'm suggesting we do. I don't see anything in the annotations that require formalism within the spec. There isn't anything concrete we can formalize about `readOnly` or `writeOnly`, for example. > We have done this for draft-8, but provide several different formats, all of which have valid use cases, and are not all required to be compliant. Yes. I'm saying better would be to have *one* output format that actually is widely-supported, instead of *four* formats which everyone will do a desultory job of implementing. > Your use case is validation... Well what about the use case of creating forms / UI? I have not forgotten about UI generation from JSON Schemas. I have colleagues are who doing exactly this as part of their job. My contention is that we don't need to formalize keywords like `description` or `title` any further than perhaps noting their common use in contexts beyond validation, such as generating UIs. No need to attempt to formalize how annotation works beyond that. It's acceptable if the annotation use-case is achieved in an ad-hoc fashion, as typically it ends up needing to be closely integrated with things outside of JSON Schema's purview, like external data sources or particular UI technologies. I'll therefore ignore comments suggesting that my proposal would regress or abandon UI generation. I believe it does not. > Say an organisation or group want to form a JSON Schema Form standard, which extends JSON Schema, and is uninterested in Validation. If you had a unified Core and Validation spec, they would have to unpick the bits they required from it for applicators and annotations. Yuck. "Yuck" would be to muddle JSON Schema in order to solve for problems nobody has yet. Let's fix *real* problems, that people *today* have. As Oakeshott would say: let's prefer the familiar to the unknown, the sufficient to the superabundant, present laughter to utopian bliss. > It CAN be and often IS quite confusing. The easiest way to think about it is, if a non root schema has an `$id`, that whole object is like an `iframe` in HTML. It's more than merely confusing. On its present definition, `$id` is *ill-defined* in many cases. See: https://github.com/json-schema-org/json-schema-spec/issues/687 Our intention is to fix this by formalizing what is and is not a sub-schema, a solution at odds with our attempts to make JSON Schema generalizable, because we'll end up locking down all possible "applicator" keyword fprms. My suggestion is that we instead cut this infernal Gordian knot. > If I want to automagically transclude schemas into a single schema [...] I believe this is another instance of a problem nobody really has. It is *not* a terrible burden upon implementations to support taking a list of schema objects, instead of only supporting a single object. > For draft-8, we HAVE actually made this change, because it was a frequent issue, specifically when you wanted to add to the annotations of a generic type you were referencing. I'm aware -- I was making a point about draft-07, since draft-08 remains a moving target. I would have clarified this, my bad. > Why? I might want to encode an image or other file type data in a JSON This strikes as another instance of being everything to all people. Do you expect *all* validators to support *all* MIME types and content encodings?

On your retort to json-schema-spec-comparison, point-by-point

> Wrong. JSON Schema does not assume you should be able to access the internet at all, nor that a URI is a network locator, which means "accessable on the network / internet". Indeed, the spec says that. But the test suite, which is where the rubber meets the road, disagrees. It expects that validators somehow know how to assign an `$id` to schemas which lack one -- and that `$id` happens to be the network location of the schema. The test suite therefore presumes that implementations auto-assign `$id`s to schemas, and do so on the basis of where they fetch the schemas. The test suite therefore requires that validators do *precisely* what the spec suggests they should not do. > Not sure what you mean, as you don't elaborate. I do elaborate, [here][3]. But the reply above explicates this as well. [3]: https://github.com/ucarion/json-schema-spec-comparison#automatically-fetching-schemas > Explain what you mean by "poorly". It's optional anyway. A "validation" which is "optional" is a *poor* validation. In that sense, it is defined poorly. > That's up to you, but again, schemas should be portable, and if you can't or choose not to support a key word, you should throw an error doing so, and not silently ignore or "fix" things. Our comments on a recommended subset of regular expressions are where the real benefits lie in practice. But that's not my point there. I'm saying that the spec should avoid requiring insecure behavior, such as emulating the behavior of a regeular expression langauge that is susceptible to ReDoS.

Prior Art

Finally to address:

What is the "high bar" you talk about that you feel is preventing this spec getting to RFC?

Simplicity is the high bar this project presently fails to meet. This has been stated many times to the authors of this project:

Most obvious: There are multiple pieces of software out there that claim to implement JSON Schema, and their behavior is really inconsistent, in my experience. [...]

One area where I observe inconsistencies is in the handling of the “$ref” construct. Irritated, I decided to go check the official spec. “$ref” is not defined (but is used) in JSON Schema Core. Same for JSSON Schema Validation. Same for JSON Hyper-Schema. Same for the Core/Validation Meta-Schema and the Hyper Meta-Schema

Actually, I could be wrong; the spec is really hard to read; and I say that as one with much more experience in spec-reading and schemas than most.

[...]

I had a horrible experience with JSON Schema [...] the implementations were inconsistent with each other, the error messages out of the validators were simultaneously verbose and unhelpful, and I was looking at it trying to figure how I'd get better-quality messages, it seemed sort of intrinsically difficult. I would want to see existence proof of a validator that produced high-quality error messages before I could really get behind a design.

[...]

I don’t actually hate JSON schema, was just disappointed with the specification and the tooling.

That's @timbray (among other things, one of the co-authors of the original XML spec)

JSON Schema is an attempt to provide a general purpose schema language for JSON, but it is still work in progress, and the formal specification has not yet been agreed upon. Why this could be a problem becomes evident when examining the behaviour of numerous tools for validating JSON documents against this initial schema proposal: although they agree on most general cases, when presented with the greyer areas of the specification they tend to differ significantly.

That's Pezoa et al, "Foundations of JSON Schema": http://gdac.uqam.ca/WWW2016-Proceedings/proceedings/p263.pdf

This proposal is about simplifying the spec, removing theoretic and unnecessary abstractions from it. Less is more.

After I implemented it all in Ajv I can guarantee you that however you change the language, there is very little chance that the existing logic, however simple and logical it may seem to you, will be consistently supported - it is VERY complex to implement. All other authors I was discussing the issue with had the same opinion. That's the area of Ajv code that I stopped understanding long time ago; it's quite convoluted and I only rely on the many test cases when I need to improve/fix it :).

That's @epoberezkin (in #160 -- that entire thread is damning, though.)

I hope that the fact that I know of all these examples might serve to alleviate your concerns that I might not be appropriately informed. I've come late, but I've done my homework.

In Conclusion

To conclude:

I believe it would be highly ill-advised for the maintainers of this project to continue to ignore concerns that @timbray, a most preeminent IETF editor, and @epoberezkin, the author of the most popular JSON Schema implementation, both seem to have independently arrived at.

I therefore press my case again:

Are we sure we don't want to pursue simplification?

Julian commented 5 years ago

I'm not one for the long discussions that we tend to have these days, with lots of back and forth and a huge number of comments to follow, but just on one little tiny piece here (and I will probably then unsubscribe to be honest, because I don't really find this issue helpful):

Indeed, the spec says that. But the test suite, which is where the rubber meets the road, disagrees. It expects that validators somehow know how to assign an $id to schemas which lack one -- and that $id happens to be the network location of the schema.

The test suite therefore presumes that implementations auto-assign $ids to schemas, and do so on the basis of where they fetch the schemas.

The test suite therefore requires that validators do precisely what the spec suggests they should not do.

I think as the maintainer of the test suite I can say I agree with neither your premise nor your conclusion there :) -- the test suite is not where the rubber meets the road, it simply represents what my brain translated the spec into as an executable format, and if it has bugs, we fix them.

If you think the spec says something different from the suite, that's a bug, please file a PR, say what's different from the spec, and it will be merged.

Not sure what your point here is though, the test suite makes no such assumptions, so if you do do that, please elaborate on exactly what part of the suite does that.

awwright commented 5 years ago

Here's the central point I'd like to get at:

Are we sure we don't want to pursue simplification?

I don't necessarily disagree, but you're going to have to come up with something specific and actionable. Saying "JSON Schema is too complex" is not, by itself, actionable.

I'm familiar with most of the arguments you presented. For example, iirc, Tim Bray was talking about draft-4, to which I spent a great amount of time addressing with draft-5. (Also, XML isn't really a bastion of simplicity, either. see: billion laughs attack; see: downloadable DTDs; see: escaping CDATA sections inside CDATA sections; see: literally any time you want whitespace to be significant)

First, which specific problems are there for schema authors?

Second, which changes can fix these problems?

Finally, for each of the problems, consider finding the relevant issues in the tracker, or filing new ones.

I see a handful of specifics, but it's awkward to talk about all of them in a single issue. Pick one that's important to you and let's work through it in a new issue, or an email/Slack thread.

philsturgeon commented 5 years ago

On mobile so this one will be brief.

@ucarion, I’m noticing a few trends in your assertions which are making this conversation tricky. You make claims which have no evidence, which are in fact contrary to the reality we see daily in the community and have for years, then you dispute the fact that we see these things come up on a regular basis as though we are purely making an appeal to authority and not the anecdoat evidence it was offered as.

You: “Nobody wears hats.” Ben: “I work in a hat shop and I have a lot of customers coming in to buy hats.” You: “I have seen some articles and comments saying that these high profile people do not like hats.”

Another complication is you are defining everything in terms of simplicity. Simplicity is a vague term, and you seem to be defining it as “features I want and need for my use cases” so anything outside of what you want and need is seen as unsinole cruft that should be removed, ignored, or used later. This is of course not the same definition of simplicity the contributor team should use for this project, or it wouldn’t be wildly applicable to many people.

Bugs existing in older implementations is not evidence of failings in current JSON Schema, especially seeing as the older tools are not being actively maintained (that’s why they’re stuck on draft 04).

Newer versions of JSON Schema (7 and 8) have done a great job of making the language more clear and concsie, and understandable to the layperson.

Anyone, onto your examples in prior art. These we’re the best part of your post.

@timbray had concerns about $ref and error output. As Ben has already said, $ref works the way you want it to work in draft 8, and the language has been simplified since 4/5. Error outputs are also a thing, which Ben also already said. Tim should be content with more recent changes, and should be happy to know that BC breaks are stabilising in later versions as most of the work has been clarification and addition. The discrepancies between drafts should be moving closer together.

Re: The Foundations of JSON Schema: yep, you have had answers explaining that RFC number would be nice to have but also having our very own v1.0 would be equally nice to have, and we’re working on that. Their concerns will be solved when their are no longer drafts. Drafts are required to flesh out ideas, otherwise you’re just flopping stuff out on the public and that’s no good for anyone.

Then yes, there’s #160 and a lot of related issues. If you think that fella has been ignored then you’ve not spent much time on the issues here. We’ve dedicated months to trying to resolve discussions with that person.

So, your concern is that other famous people have concerns, and those concerns are:

past versions had issues (which are fixed)
multiple versions lead to multiple differing implementations (yes, correct)
contributors are not aware of issues (oh we certainly are that’s why they are already fixed or are being fixed)

A call to action: can you define simplification in a more useful fashion? Suggesting the contributors do not want to pursue simplification is to assert we’re trying to make this unexcessairily complex, and that couldn’t be further from the truth.

Maybe you could help identify some wording in draft 8 that could be simplified, and in the form of a pull request improve the spelling? All without ripping out keywords that people actively use, because that would cause some fairly major discrepancies in implementations, and that’s something we strive to avoid unless absolutely necessary. 👍🏼

handrews commented 5 years ago

@ucarion you've gone to some effort to isolate and frame negative comments on JSON Schema, but in terms of validating your view that post-draft-04 work it too complex, it doesn't hold up.

As @awwright noted, that post by Tim Bray is years old and specifically referring to draft-04. @awwright did a fantastic job of making things a lot more straightforward to read and understand, and we have continued to build on that as we've clarified and added examples based on feedback.

Indeed, Tim was one of the more encouraging voices in our otherwise dismal discussion with the JSON mailing list. (Other people who have published RFCs have also been encouraging- that thread is not our only discussion of the topic). Note that one reason for creating an output format in draft-08 is Tim's comments in that discussion:

[Tim Bray] I would want to see existence proof of a validator that produced high-quality error messages before I could really get behind a design."

This was useful feedback which we have acted upon. Because when someone who knows how to write an RFC gives us specific, constructive feedback (even though it was by his own admission outdated), we pay attention to that.

As for your comments regarding stackoverflow, popularity is in part a function of time. And duplicate questions get closed. The most popular questions are about draft-04 because that is what has been around long enough to get popular.

@Relequestual's experience monitoring stackoverflow reflects what people ask about now, and what implementations they are using regardless of whether their question is specific to a newer draft or not. Your dismissal of that experience in favor of a metric that has more to do with time than features is not convincing.

Regarding the Hacker News link, you extracted a single comment from a much larger thread that was started by someone praising JSON Schema. While the full thread has some people debating the roles of JSON Schema vs TypeScript and the like, it also includes a very long sub-thread of people saying that yes, they use JSON Schema, and what they use it for. While I'm sure some of them use draft-04, there is a notable absence of complaints over there being newer drafts.

The one post that you specifically referenced is only confused over Hyper-Schema. As has been stated many times in many places, Hyper-Schema is a separate specification and is not impacting or blocking the Core or Validation specifications in any way. People are welcome to find Hyper-Schema irrelevant. But what people think of Hyper-Schema has nothing to do with Core or Validation. The last draft of Hyper-Schema was not even published at the same time as the last drafts of Core and Validation.

You later say:

Yes. But my suggestion is that it be divorced from this project, so that it does not interfere with JSON Schema validation, which is the crown jewel of this project.

It does not interfere in any way with JSON Schema validation. How many times do we have to say this to you?

More correct would be to say that I think JSON Schema is trying to be all things to all people.

You clearly have not looked at our history and all of the things we have turned down, or shunted to other projects. This is hilariously off-base.

I don't see anything in the annotations that require formalism within the spec. There isn't anything concrete we can formalize about readOnly or writeOnly, for example.

You have clearly not tried to implement something that relies on these (and similar) keywords heavily. Other people have. Your dismissal of their use cases is unconvincing.

Yes. I'm saying better would be to have one output format that actually is widely-supported, instead of four formats which everyone will do a desultory job of implementing.

@gregsdennis (who wrote and maintains an implementation) put in a heroic amount of work to gather feedback and incorporate input from a wide variety of people, primarily other implementors, to produce those formats. You, on the other hand, are just asserting that it's all wrong.

We will use the extensive work done to produce that proposal and get feedback on it. If we get feedback that not all four are used, we will act on that feedback. This is what drafts are for.

If I want to automagically transclude schemas into a single schema [...]

I believe this is another instance of a problem nobody really has. It is not a terrible burden upon implementations to support taking a list of schema objects, instead of only supporting a single object.

This is an incredibly common problem. There are tools out there that just do this, and nothing else. The most popular JavaScript one gets nearly 400,000 downloads weekly from npm. Why do you think you can just assert all of these other use cases away? Your arguments are completely counter to measurable reality.

Say an organisation or group want to form a JSON Schema Form standard, which extends JSON Schema, and is uninterested in Validation. If you had a unified Core and Validation spec, they would have to unpick the bits they required from it for applicators and annotations. Yuck.

"Yuck" would be to muddle JSON Schema in order to solve for problems nobody has yet. Let's fix real problems, that people today have.

Formal extension is a problem that many people have. That is why we have spent so much time and effort on it. In particular, UI generation, code generation, and API documentation generation are not well-served by current use cases. And since we actually do refuse to make JSON Schema everything to everyone, we have drawn hard lines against adding features to support those use cases.

However, there is a great deal of demand for using JSON Schema as a base for these sorts of things. There are numerous very popular web form libraries (Mozilla maintains one, and there are others for both Angular and React). API documentation and code generation are major use cases for OpenAPI, which I hope you are aware is very popular.

I work directly with the OpenAPI Technical Steering Committee on converging their use of JSON Schema with future drafts. The alternativeSchema keyword tentatively slated for 3.1 exists primarily to answer the many requests OAS gets for being able to use more recent drafts of JSON Schema. They are also very interested in draft-08 specifically because of extensible vocabularies which promise to solve many of the problems that caused OAS to use a restricted subset of JSON Schema in the first place. One TSC member told me that he expects OAS 3.1 tooling vendors to just focus on draft-08 once alternativeSchema is out there because it is substantially more compelling than draft-06 or draft-07. And there is also a proposal up as a PR for making their subset more compatible.

Our intention is to fix this by formalizing what is and is not a sub-schema, a solution at odds with our attempts to make JSON Schema generalizable, because we'll end up locking down all possible "applicator" keyword forms.

I'm not sure who "our" is supposed to refer to? I also have no idea what you mean in general. There is nothing that locks down "applicator" keyword forms. There's no solution to this yet published, so I have no idea how you can know that it will cause such a problem.

Are you saying you want to make JSON Schema generalizable? But you also want to throw out all of the work done to establish patterns of keyword behavior which is explicitly done to make it generalizable?

This strikes as another instance of being everything to all people. Do you expect all validators to support all MIME types and content encodings?

No, and the spec notes that such a thing would be prohibitively difficult. Doing any sort of validation with contentMediaType and contentEncoding is optional (MAY). The main use cases for any implementation are embedding JSON inside of JSON strings (which I think is bizarre, but have seen done in several completely separate contexts- this is the use case for which contentSchema is being added in draft-08), and handing XML processing including XSD validation off to an appropriate parser.

This has the same problem as optional format behavior, but as has been noted repeatedly, this problem is a major focus of draft-09.

If you and Evgeny want to go have a pure validation with no $id club, go right ahead. The spec is, in fact, carefully written to ensure that pure validators are in full conformance (annotation collection and the output specification, for example, are optional, specifically to allow a pure validator to be optimized).

As for the rest, your assertions of what is and is not a real-world use case reveal a profound ignorance of what people actually ask for. None of it is convincing.

timbray commented 5 years ago

Hi, I notice my name being bandied about quite a bit here. I would like to be able to use JSON Schema, but have disliked earlier drafts. Is now a good time to take a look at the current draft?

BTW I'm super glad to hear that you're working with the OpenAPI people. Having them keep up with your progress is A Good Thing. BTW, my current project is all about high-volume heterogeneous event processing, and we really need some sort of schema formalism, but don't want to invent one. As I've mentioned before, it's characteristic of events that they have a high-level "type" field or combination like "source"/"type" that govern the schema used for the payload. Being able to express this declaratively in the schema, in a reasonably concise and readable way, would be a big deal for us. But I'm starting to think that this is an intractable problem at the declarative-schema level, and may have have to select between a repertory of schemas by procedurally looking at the the field values before selecting which schema to use.

epoberezkin commented 5 years ago

If you and Evgeny want to go have a pure validation with no $id club, go right ahead. The spec is, in fact, carefully written to ensure that pure validators are in full conformance (annotation collection and the output specification, for example, are optional, specifically to allow a pure validator to be optimized).

My name is branded here too - good club it seems. The language could have been a bit nicer, but I guess all deserved.

On the core subject of $id’s - I can confirm that very few JSON Schema users outside of some small silos of really advanced users use IDs inside schemas (I.e. not in the root) and/or understand how base URI change works and why it is needed.

pure validators are in full conformance

I’d really like to see which validators would pass Ajv tests for all $ref scenarios - no JS validator was passing them (it’s not to say that Ajv is any better because of that - it’s just to support that the current $ref spec is very complex to fully implement, and I gave up on fixing some of the rare edge cases): https://github.com/epoberezkin/test-validators

formats

They are indeed optional, but the problem is that ALL users expect them to work in a certain way, often in a different way from the spec (particularly when it comes to uri and email). Supporting formats is a constant source of learning for me (for example, I would not have known that 23:59:60 is a valid time otherwise).

But I don’t think removing them is a viable option at this point, even though they cause more contention than all other keywords. As an idea, maybe it is possible to include very formal and simplified/permissive definitions in the spec that can be expressed as simple regular expressions (instead of relying on complex definitions in specific RFCs that most validators implement as convoluted regular expressions anyway, but inconsistently) and let end users either redefine them if they need more restrictive validation or to capture the errors outside of JSON schema.

handrews commented 5 years ago

@timbray thanks for commenting! We are currently wrapping up the latest draft, which focuses on establishing a consistent processing model and classifications of schema keyword behaviors, so that JSON Schema as a system can be extensible.

I need to do a read-through of the whole thing again (now that all of the major changes are in), and then we'll probably do some re-working of the sections for better logical flow and readability. For example, the "Overview" section has gotten far too long to qualify as an overview!

After that (hopefully in the next week-ish now that I've finally had time to focus on this again), we will put the result up for final review for a couple of weeks, and then publish it as the next I-D. If you're interested in taking a look during the pre-publication review period we would love to get your feedback, or you can wait until it's published and comment then. The final pre-publication review is mostly for ensuring readability- unless someone spots an egregious problem, any substantive changes will be deferred to the next draft (this one is already months late due to personal life interfering).

I consider this upcoming draft of the Core and Validation specifications to be nearly feature-complete. There are some glaring unresolved issues around extensibility, but we decided that the best way to address those was to get feedback on the parts we have worked out. There's a lot to it already and while we've had good participation here there is no substitute for people trying it out in Real Life (tm). The other major unresolved thing is providing some predictability around what are now very unpredictable "optional" validation behaviors (format and content*). The lack of consistency or control around these is a very common complaint, so we want to resolve that before moving ahead with a more formal standardization process.

This draft that we are about to publish addressed the really big questions that caused the project to stall several years ago, so we are hoping that we are over the hump and now just tying up known loose ends.

As @ucarion notes, all of this has introduced complexity, but we believe that a.) if you still want to implement a plain validator, you can do that, and b.) the most complex aspects only impact people designing extension vocabularies, or writing a full-featured extensible implementation. And there are major use cases, such as OpenAPI, who are interested in that extensibility.

handrews commented 5 years ago

@timbray regarding the usage pattern (switching schemas on a type field), the idiom that is the most straightforward (although verbose, and see below for discussion of declarativeness) is something like:

{
  "type": "object",
  "oneOf": [
    {   
      "if": {"properties": {"schemaType": {"const": "foo"}}},
      "then": {"$ref": "#/$defs/foo"},
      "else": false
    },  
    {   
      "if": {"properties": {"schemaType": {"const": "bar"}}},
      "then": {"$ref": "#/$defs/bar"},
      "else": false
    },  
    {   
      "if": {"properties": {"schemaType": {"const": "biz"}}},
      "then": {"$ref": "#/$defs/biz"},
      "else": false
    }   
  ],
  "$defs": {
    "foo": {...},
    "bar": {...},
    "baz": {...}
  }
}

There was much debate over whether the "if", "then" and "else" keywords can be considered declarative, involving discussions over the nature of the material conditional. I held out against this for a long time, but ultimately was persuaded by a.) an unusually large number of people showing up and supporting the proposal, and b.) the fact that, as we have specified them, "if"/"then"/"else" can always be rewritten in terms of "anyOf"/"allOf"/"oneOf"/"not". So it is no more or less declarative than it was before adding the keywords. And finally c.) it often results in much more understandable error messages.

The above example can be written as:

{ 
  "type": "object",
  "oneOf": [
    { 
      "allOf": [
        {"properties": {"schemaType": {"const": "foo"}}},
        {"$ref": "#/$defs/foo"}
      ]
    },
    { 
      "allOf": [
        {"properties": {"schemaType": {"const": "bar"}}},
        {"$ref": "#/$defs/bar"}
      ]
    },
    { 
      "allOf": [
        {"properties": {"schemaType": {"const": "baz"}}},
        {"$ref": "#/$defs/baz"}
      ]
    }   
  ],
  "$defs": {
    "foo": {...},
    "bar": {...},
    "baz": {...}
  }
}

Which idiom you prefer is probably a matter of stylistic preferences plus the error reporting behavior of your implementation.

As noted earlier in this thread, the forthcoming draft also proposes standardized error reporting behavior, which we hope will improve the quality and consistency of error reporting in implementations.

handrews commented 5 years ago

@epoberezkin thanks for commenting! Good to hear from you.

pure validators are in full conformance

I’d really like to see which validators would pass Ajv tests for all $ref scenarios

It would have been more accurate for me to say "it is possible for pure validators to be in full conformance, despite non-validator features having been added".

As an idea, maybe it is possible to include very formal and simplified/permissive definitions in the spec that can be expressed as simple regular expressions

I filed #54 for this idea back in 2016 :-D

and let end users either redefine them if they need more restrictive validation or to capture the errors outside of JSON schema.

This is exactly what we mean when we talk about just making format an annotation, so that applications have a standardized, well-defined way to check for formats and perform whatever additional validation or handling that they want. I think this is covered in #563 but I don't really recommend slogging through that whole thread. After we get this draft out the door I'll summarize that issue and re-file it clearly.

I am pretty sure that one or the other or both of these ideas will be a feature of the draft after this one.

We are also looking at improving the extensible vocabulary support enough to let meta-schema authors control optional behavior, e.g. "if you can't guarantee full validation of this, then refuse to process this schema at all". The worst part of format is that it's completely unpredictable whether or how any given implementation will handle it.

handrews commented 5 years ago

@epoberezkin

On the core subject of $id’s - I can confirm that very few JSON Schema users outside of some small silos of really advanced users use IDs inside schemas (I.e. not in the root) and/or understand how base URI change works and why it is needed.

Yeah I definitely agree that in terms of humans writing schemas, that feature is very, very rarely used.

I do want to point out that the "advanced users" case includes automated tooling, which I would argue is what the feature is really for. I have used that feature to package schemas into a single easily distributable file, even though the schemas are developed in many much smaller files which are easier to work with by hand and in version control.

One thing we might want to do is make those use cases more clear. We have put more information into the spec about when and why you would use these things, but perhaps if it was explicitly clear that base URI changing is primarily for programmatic tools, people who are just writing schemas by hand would be less confused over it. And people writing tools and implementations would understand why it's there a bit better.

This is something we can look for during the final wording review for this draft.

handrews commented 5 years ago

@jdesrosiers had some excellent ideas on $id on slack today. I have written them up as #719, and am seriously considering this for draft-08. It removes some of the most confusing and least useful features, while preserving (and in fact clarifying) the embedded document use cases.

Relequestual commented 5 years ago

To your point, however, asides such as these:

(See, listening to real community needs)

Are wholly unnecessary, and perhaps a bit unprofessional. @ucarion

Apologies for the tone I assume was implied by this aside. Apologies I did not give it full consideration. I was attempting to demonstrate that we did in fact take a lot of care to listen to community needs.

You're correct, we are here to dicuss techincal issues. You're right to call that out. Whatever our disagreements, we want to be welcoming of open discussion.

To mention one specific comment...

Say an organisation or group want to form a JSON Schema Form standard, which extends JSON Schema, and is uninterested in Validation. If you had a unified Core and Validation spec, they would have to unpick the bits they required from it for applicators and annotations. Yuck.

"Yuck" would be to muddle JSON Schema in order to solve for problems nobody has yet. Let's fix real problems, that people today have. As Oakeshott would say: let's prefer the familiar to the unknown, the sufficient to the superabundant, present laughter to utopian bliss.

This issue I present is actally EXACTLY what OpenAPI (formally Swagger) did, creating a sub/super set of JSON Schema, because they wanted to exclude, redefine, and, to JSON Schema. It created no end of problems with OpenAPI implementations using JSON Schema implementations "as is".

I'm hoping that a lot of the discussion has been helpful here. I would suggest that, as @awwright, you now proceed to raise individual github issues to discuss specific issues, as there are too many to address repeatidly in this single thread.

Assuming you do create new issues for each point, copy relevant discussion over, and create a comment here which links to those issues.

Once you've done this a few times, I'll go ahead and close this issue.

I'll leave it unlocked, on the basis that discussion on specific issues will be discussed in OTHER new or existing github issues. No one has the headspace for multi issue mega threads.

Realstically, because of the phase of draft-8, we aren't going to make any grand changes now.

I feel that, and as an attempt to summarise others comments also, the evidence you present doesn't stack up, or is outdated, relating to previous drafts of JSON Schema.

Relequestual commented 5 years ago

@timbray Great to see you here. I'm hoping @handrews's example was helpful. Should you have any questions on the current draft, the slack is often the best place to go. Open invite link is the "discussion" link on http://json-schema.org

bobfoster commented 5 years ago

I see you got the expected reaction. "You don't understand! There's a Good Reason for all this complexity!"

handrews commented 5 years ago

@bobfoster do you have any more convincing arguments, or are you just here to tell us that we don't know what we're doing?

ucarion commented 5 years ago

I think this issue can be safely closed out. But for posterity, I think it could be good to agree on why we're closing it. @handrews, @awwright: would you be ok with closing this ticket? I'll add my summary to the top of the ticket.

On my reading, in summary:

The intention of this issue was to discuss whether JSON Schema should make IETF standardization its prime directive, and focus on simplification as the instrumental means of achieving that end.
JSON Schema remains ultimately a project on the basis of rough consensus. And there does not today exist many people on this project with enthusiasm for wrestling with standards bodies.
Nor is it evident that JSON Schema can or ought to dramatically cut scope. Though there are many people who could live with just a small subset of JSON Schema that the project has long supported, there are also many people who want everything that's in the spec present, imminent, and future.
Therefore, JSON Schema shall not change its focus. The current trajectory -- of making a sophisticated, generalizable, extensible system for validating and annotating JSON-like data -- shall remain the course.

I have no doubt this approach will work, but it will take time -- when you do more, there's more to get right. There perhaps exists room for a far more modest variant of JSON Schema, more aligned with the aims I've proposed in this ticket.

Sorry this ticket had to take the form of a sort of FOSS ninety-five theses! But my goal here was to change the zeitgeist, not shave off things at the peripheries. That's why this wasn't a specific issue about a particular feature, but instead a proposal about how we think about all features.

epoberezkin commented 5 years ago

@ucarion There are several variants of XML schemas, I don’t see why there should be only one JSON schema. In my experience, an absolute majority (feels like 99%) of the current users needs a small subset of features well aligned with what was suggested in this ticket, with the exception of formats (on which I commented too - I believe formats in schema should not mean any other RFC, instead they should mean a shorthand for some agreed simple regular expression that is much more permissive and simply does structural rather than semantic validation of strings, in a way similar to how JSON schema does structural rather than semantic validation of data structures).

The only viable reason I see for the base URI change and JSON pointers to exists is the desire to bundle multiple schemas into a single file. Unfortunately many users believe that it is possible to bundle multiple schemas into a single schema by substituting refs - it is not possible to do in a static way in general case. A simple bundle being an array of schemas is possible though and already supported by several validators - defining it in the spec and making its support mandatory would eliminate the need for any other bundling, base URI changes and JSON pointers.

I agree that a radically simplified spec is long due, whether it happens in the current group or outside. While there is a growing number of JSON schema users, there is a much bigger number of developers who do not use it. The current level of complexity is a serious blocker for JSON schema adoption.

handrews commented 5 years ago

Closing per @ucarion's last comment.