json-schema-org / json-schema-spec

The JSON Schema specification
http://json-schema.org/
Other
3.43k stars 251 forks source link

Proposal: Make `format` validate by default #1520

Open gregsdennis opened 2 weeks ago

gregsdennis commented 2 weeks ago

There's a long and sticky history around format.

  1. Going back all the way to Draft 01, format has never required validation.
  2. Whether to support format validation has always been the decision of the implementation.
  3. The extent to which formats are validated has also been the decision of the implementation.

The result of all of this is that implementation support for validation has been spotty at best. Despite the JSON Schema specs referencing very concretely defined formats (by referencing other specs), implementations that do support validation don't all support each format equally. This has been the primary driving force behind keeping format as an opt-in validation.

With 2019-09, we decided that it was time to give the option of format validation to the schema author. They could enable validation by using a meta-schema which listed the Format Vocabulary with a true value, which meant, "format validation is required to process this schema."

In 2020-12, we further refined this by offering two separate vocabularies, one that treats the keyword as an annotation and one that treats it as an assertion. The argument was that the behavior of a keyword shouldn't change based on whether the vocabulary was required or not.

However, the fact remains that our users consistently report (via questions in Slack, GitHub, and StackOverflow) that they expect format to validate. (The most recent case I can think of was only last week, in .Net's effort to build a short-term solution for schema generation from types.)

This consistency in user expectations leads me to believe that we should officially make format an assertion keyword and strictly enforce it by moving the appropriate tests into the required section of the Test Suite.

(Personally, I'm not passing all of the optional format tests, so I'll have to do some work to get there or document why they're not supported.)

mwadams commented 2 weeks ago

FWIW I am of like mind.

(Corvus.JsonSchema does pass all the optional format tests, so that's easy for me to say!)

jdesrosiers commented 2 weeks ago

I agree that users expect that format will assert and it's probably best that the expected behavior is the default. I say that reluctantly considering that my implementation currently doesn't support format-assertion at all and I'm not excited about the effect that a requirement will have on my bundle size.

jviotti commented 2 weeks ago

I also agree users expect format to perform assertions and are almost always confused when they first see they don't. However, if we bring back format as an assertion, how do we prevent the issues that caused it to become an annotation by default?

As described in https://json-schema.org/draft/2019-09/release-notes#format-vocabulary, it was very hard to make implementations interoperable on the validation they should perform and at which degree, and these inconsistencies sometimes ended up being more painful and confusing than format not being an assertion. Keep in mind that some formats are specifications on their own that can be very tricky to validate (like URIs?).

Sounds like if we go in this direction, we should not only make the currently optional tests required, but also extend them a lot more, which might be tricky given that some formats don't even have an "official" test suite of their own (again like URIs?)

Maybe a middle ground would be to make format an assertion while also being a bit more prescriptive about i.e. the expected regular expressions that implementations should use as a minimum bar?

gregsdennis commented 2 weeks ago

how do we prevent the issues that caused it to become an annotation by default

It was always an annotation (or before annotations were a thing, it just wasn't validated).

gregsdennis commented 2 weeks ago

I agree having some minimal requirement set for each format is probably warranted.

gregsdennis commented 2 weeks ago

Another question this raises (as highlighted by @jdesrosiers' https://github.com/json-schema-org/json-schema-spec/pull/1510#discussion_r1646587826) is how we want implementations to handle formats they don't understand. Should implementations automatically pass validation for these or fail them?

I think an implementation failing validation (or maybe even refusing to process the schema) gives a more expected outcome when compared with an implementation that does support the format. Getting a pass from an implementation that doesn't know the format when one that does fails feel more wrong to me.

Thoughts?

jdesrosiers commented 2 weeks ago

Sounds like if we go in this direction, we should not only make the currently optional tests required, but also extend them a lot more

Agreed. I think the biggest concern was the inconsistency between implementations. The test suite can address that issue. Maybe it's not perfect, but it can get better over time as we find edge cases. I think the only way making format assert by default is a viable option is if we're fairly strict about what that validation behavior has to be. If we keep the lax requirements of the past, we'll just end up with the same problems.

jdesrosiers commented 2 weeks ago

Should implementations automatically pass validation for these or fail them?

I have different opinions depending on how we end up defining what asserting on format requires. If we're strict about how each format must validate (within reason), then I think implementations should refuse to process a schema with a format it doesn't know. To be clear, I mean it should be considered an invalid schema, it should not evaluate to false. But, if we stick with the current anything-goes requirements, then an unknown format should return true.

SorinGFS commented 2 weeks ago

IMO format should exist in both states all the time, not enabled by some sort of config or vocabulary. The reason for this is because there are plenty of situations when we need both in the same schema! As it stands now, draft 2020 has format as annotation and schemas may use pattern to validate a format, so it may use both states of format, but if you are going to make format to validate by default this option goes away! When I analyzed about 700 schemas from schemastore I noticed that surprisingly format was rarely used, probably due to ambiguity around this keyword. Making format to validate by default would rise the ambiguity around the keyword to highest level. Moreover, making format to validate by default would be a massive breaking change for plenty of schemas that didn't expect the format to be validated. And, json-schema has defined patterns for format validation, but those formats are not the same across various languages which are supporting different standards. Since json-schema advertises that is a language agnostic standard I consider that json-schema should not decide which patterns should be used to validate formats. All these being said, I consider that format-validation should be completly removed, format keyword should stay as annotation and users should be encouraged to use pattern! This approach would eliminate the ambiguity for ever. But, if you won't accept the above solution I think at least you should create a second keyword for format:

gregsdennis commented 2 weeks ago

if you are going to make format to validate by default this option goes away

No, it doesn't. People can still use x-format (or any x- keyword) if they must have an annotation.

Making format to validate by default would rise the ambiguity around the keyword to highest level.

Making it always behave either way would definitively remove ambiguity. Making it validate would align with users' expectations.

If you're referring to the level of support for each format offered by different implementations, other comments in this issue address that by saying we need more rigorous (and non-optional) testing.

making format to validate by default would be a massive breaking change for plenty of schemas

It probably would be a breaking change for many schemas that don't declare $schema, yes. This is why we recommend using it.

On the other hand, it will fix the multitude of schemas that exist in the wild which expect format validation.

It's also going to be a burden for many tooling maintainers because a lot of them don't support these formats fully. (I'm one of them.)

those formats are not the same across various languages which are supporting different standards. Since json-schema advertises that is a language agnostic standard I consider that json-schema should not decide which patterns should be used to validate formats.

JSON Schema very clearly defines the specifications for each format, and they're all language-agnostic (except maybe regex which declares ECMA-262). If tools are implementing other specifications, then they are not conforming to JSON Schema and need to be fixed anyway.

format keyword should stay as annotation and users should be encouraged to use pattern

Not all formats can be well-represented with regex, which is the only validation pattern provides. Further, not everyone agrees on the regexes to use for each format. IMO forcing people to use pattern doesn't fix anything and may actually make the problem worse.

gregsdennis commented 2 weeks ago

@Julian what would it take to get Bowtie to report on formats, maybe even just locally, so we can get some rough numbers?

awwright commented 2 weeks ago

Indeed there's a great need to have a "format" keyword that validates. The difficulty of going about with this before has been a couple things:

Even when unknown keywords are ignored, validation keywords typically cause errors when their value is outside the permitted values, so I'd expect an error. Similarly, a validating "format" is not too different from a $ref that has predefined names—you're referencing some external, arbitrary validator, and if you don't know what that is, that's an error.

SorinGFS commented 2 weeks ago

@gregsdennis

Not all formats can be well-represented with regex, which is the only validation pattern provides. Further, not everyone agrees on the regexes to use for each format. IMO forcing people to use pattern doesn't fix anything and may actually make the problem worse.

Can you give some examples? And if a format wouldn't be well-represented with regex how would that format be validated by implementers?

gregsdennis commented 2 weeks ago

A validating "format" rejects all types of values, instead of within a single type (like most keywords). - @awwright

Could you clarify this? Are you saying that a validating "email" format would reject a number? I can't find such a requirement. In fact the (optional) test suite verifies that format ignore value types to which the format doesn't apply.

Maybe I'm misunderstanding you.

Can you give some examples? - @SorinGFS

email & uri can't be completely represented by a regex. Some rather large regexes get close, but I haven't seen a perfect one.

And if a format wouldn't be well-represented with regex how would that format be validated by implementers? - @SorinGFS

These generally need parsers. The spec says that it's expected that an implementation will rely on established functionality to perform the validation.

SorinGFS commented 1 week ago

@gregsdennis

These generally need parsers. The spec says that it's expected that an implementation will rely on established functionality to perform the validation.

Well, I think this is precisely the point where this concern goes into ambiguity: the same format: "uri" could be interpreted differently from one implementer to another, and from user's perspective I think this shouldn't be acceptable. I think json-schema should stick to things that are giving similar results in any context.

email & uri can't be completely represented by a regex. Some rather large regexes get close, but I haven't seen a perfect one.

As for the email part using the pattern is exactly the perfect solution, because the schema author has the authority to decide which pattern is expected by own application.

awwright commented 1 week ago

Could you clarify this? Are you saying that a validating "email" format would reject a number? I can't find such a requirement. In fact the (optional) test suite verifies that format ignore value types to which the format doesn't apply.

So you're right, most of the formats won't reject e.g. numbers. There's still a related problem when if I have a schema like { "type": ["number", "string"] }: Defining separate formats for the numbers and strings. Suppose I want to say numbers are unix timestamps, and strings are RFC3339 dates. While if/then (or oneOf) is a solution, selecting on "type" is supposed to have another solution: multiple keywords that don't cross "type" boundaries.

More specifically: if it's possible to break down a keyword into multiple parts, because someone might want one to reject and not the other, then separate keywords are justified. "type" is the most common instance of this.

jdesrosiers commented 1 week ago

making format to validate by default would be a massive breaking change

I think everyone has recognized this is a breaking change, but I'm glad this was brought up because we haven't stated explicitly the consequences of that fact. After the stable release, breaking changes won't be allowed, so if we want to make this change, it would have to be before the release.

format keyword should stay as annotation and users should be encouraged to use pattern

In addition to Greg's response to this, there are other problems with relying on pattern. Regular expressions can be hard to write and hard to read. That makes them error prone, hard to maintain, and impossible to produce useful error messaging for. Being able to use "format": "email" rather than an incomprehensible pattern is a better experience for schema authors as well as schema consumers... as long as there's a strict and enforced definition of how `"format": "email" should validate.

SorinGFS commented 1 week ago

Being able to use "format": "email" rather than an incomprehensible pattern is a better experience for schema authors

I totally dissagree on this one, and I think others should express their opinion on this matter. As for me, I will always prefer pattern as schema author because it gives me the oportunity to twick the pattern until I get the desired result, and from that on I will have the guarantee that I will get precisely what I intended in my app... while the format keyword is fix! And despite the fact that format is fix... the results across implementers would vary! In fact, I think this should be a wider discussion, not about the format keyword itself, but about up to where shoud go the json-schema concern. IMHO json-shema should respect a simple rule in this matter: no concern for matters that cannot produce similar results among implementers under any circumstances! All those concerns should go to application level where the schema author has the ability to require exactly what he needs in his application. The results of the implementers for the same schema must be identical by design, meaning that json-schema should only provide capabilities that cannot be interpreted differently.

SorinGFS commented 1 week ago

@jdesrosiers

I think validating uri doesn't even belong to format. In general format validation is required on user inputs, and I think nobody expects to validate uris in a frontend. That would be terribly wrong! They may expect to validate a uri just for being a string instead of a number, but actual validation of the string provided in frontend would be made in backend under application's control.

And if we think about validating $refs keyword definition that part also doesn't require validation because the actual place where the validation occures is where accessing $ref in a schema would succed or not.

SorinGFS commented 1 week ago

@jdesrosiers

similarly, if we talk about date formats IMO date values should always travel the network as number. Probably that's why json doesn't have a type: date or variations. Data interchange with date as number will always guarantee the precision, then is up to applications how they choose to display or collect date values.

SorinGFS commented 1 week ago

@jdesrosiers

I totally understand the need for some sort of specifications about how imputs should look like in a frontend. But I think format validation is not the answer. There is a missing piece of puzzle there: inputType. Having inputType schema authors may provide a hint about the expected type of input, but this wouldn't have any validation role, the validation would be made by other keywords provided in the schema location.

jdesrosiers commented 1 week ago

@SorinGFS I think we're talking about two different things. If I understand you correctly, you understand format to be a kind of semantic identifier. "format": "date" would indicate that the value represents a date. One schema might represent a date as a string and another might represent a date as a number, but they're both semantically a date. Given that interpretation, your arguments makes perfect sense.

But, that's not the way format is defined in the spec. Every format has a very specific expectations defined by a standard. For example, "date" is defined as a string that conforms to RFC 3339, section 5.6. If you use "format": "date" where you expect a timestamp, you're using format in a way that isn't correct according to the spec. All the defined formats are the same. They aren't open-ended. They all have very specific definitions that aren't open to interpretation by schema authors.

SorinGFS commented 1 week ago

But, that's not the way format is defined in the spec.

I didn't say that using format: date would be wrong acording to the spec. I say that the very existence of format: date (and date-time, and time, and duration) in the spec which are strings interpreted as date values is wrong! Wrong acording to the logic! Here json-schema leaves the precision area by introducing an element of ambiguity, offering the user the possibility to transport date in the form of a string while the data type for date is number! Data interchange for date values should only be made as number, and how that date value transported as number is displayed should be completely another concern.

gregsdennis commented 1 week ago

data type for date is number

This is an incorrect assumption. Different systems handle dates differently. Some (like many spreadsheets) internally represent dates as numbers.

.Net uses the DateTime struct, an object with multiple fields. When serializing a DateTime to JSON, .Net chose to represent the date as an ISO 8601 string.

Data interchange for date values should only be made as number, and how that date value transported as number is displayed should be completely another concern.

This is a valid opinion, but it is not an industry standard or best practice.

JSON Schema (long ago) made a decision that the ideal way to represent dates in JSON should be strings per RFC 3339. My guess is that, at the time, JSON Schema was published under IETF, and they wanted to use an IETF standard for date representation. Since then, the majority of the internet (based on my experience, which is admittedly .Net heavy) seems to have decided they like ISO 8601 better. (Happy to be proven wrong.) So if anything, we should be changing format: date to check for ISO 8601 dates.


But that's not what this discussion is about.

We're not discussing any format specifically. We're discussing whether the format keyword in general should validate. I'm marking any of your comments that do not address this specifically as off-topic.

In the future, please keep discussions on topic. If you have something new to discuss, please open a new discussion. If it's related, add a link.

SorinGFS commented 1 week ago

@jdesrosiers @gregsdennis

IMHO, as a rule of tomb, wherever json-schema requires an external library to function is a clear indication that it has exceeded its competence area. Json-schema must work using the basic capabilities of a language, with zero dependencies. Which is not the case with format: date,date-time,time,duration, which needs an external library to work.

SorinGFS commented 1 week ago

JSON Schema (long ago) made a decision that the ideal way to represent dates in JSON should be strings per RFC 3339. My guess is that, at the time, JSON Schema was published under IETF, and they wanted to use an IETF standard for date representation. Since then, the majority of the internet (based on my experience, which is admittedly .Net heavy) seems to have decided they like ISO 8601 better. (Happy to be proven wrong.) So if anything, we should be changing format: date to check for ISO 8601 dates.

that is another indication that json-schema exceeded its competence area by entering in a land of 'choices and preferences' against precision! 😄 Since json-schema as you said 'long ago' choosed to use RFC 3389 things are changed: that was just a 'proposed standard' and... now is updated with 9557

SorinGFS commented 1 week ago

We're not discussing any format specifically. We're discussing whether the format keyword in general should validate. I'm marking any of your comments that do not address this specifically as off-topic.

No, I don't think this is off-topic, because it matters which formats are in discussion to be validated by default. If you ask me this whole discussion should be ... off-topic, since is a wrong thing to do in my opinion.

SorinGFS commented 1 week ago

@gregsdennis

This is an incorrect assumption. Different systems handle dates differently. Some (like many spreadsheets) internally represent dates as numbers. .Net uses the DateTime struct, an object with multiple fields. When serializing a DateTime to JSON, .Net chose to represent the date as an ISO 8601 string.

I'm not talking about ancient languages. Even your .Net struct example object has all inner properties as number, and then comes the concern of conversion to string (which again, is a separate concern even in your example). In database you are probably (you should be) storing the dateTime as a 64bit number.

Julian commented 1 week ago

Just to have my personal opinion documented here (which I've shared elsewhere although probably piecemeal):

So I'd be -1 personally for my vote, and would instead recommend inventing new keywords which can be richer (and express more nuance in how they allow for different "levels" of compliance).

Julian commented 1 week ago

@Julian what would it take to get Bowtie to report on formats, maybe even just locally, so we can get some rough numbers?

It's a bit of work, since I'd have to go enable format validation in each implementation, but nothing too crazy? If it's really helpful I can try to get to it this week at some point.

gregsdennis commented 1 week ago

There's still a related problem when if I have a schema like { "type": ["number", "string"] }: Defining separate formats for the numbers and strings. Suppose I want to say numbers are unix timestamps, and strings are RFC3339 dates. While if/then (or oneOf) is a solution, selecting on "type" is supposed to have another solution: multiple keywords that don't cross "type" boundaries. - @awwright

I see what you're getting at (I think, correct me if I'm wrong), but format already doesn't support multiple values, e.g. format: ["timestamp", "datetime"], where timestamp applies to numbers as you suggest. I like the idea, actually, but it's not something we currently do.

Users have to do something like this currently:

{
  "type": ["number", "string"],
  "allOf/anyOf/oneOf": [
    { "format": "timestamp" },
    { "format": "datetime" }
  ]
}

(You can use allOf in this case because format ignores instances that aren't the right type. You only have to use oneOf/anyOf if you have two formats that validate the same type; then they probably aren't both valid.)

Still, I'm not sure how this applies to whether we have format validate by default or not. I think most users would expect both format keywords to validate in this case.

... because someone might want one to reject and not the other

This also isn't really something that the spec currently or historically supports. It's been all-or-nothing: either format validates or it doesn't. While an implementation technically has the right to allow its users to select which formats are validated and which aren't, that behavior isn't and hasn't been supported by the specification.

Supposing that a schema author did want only annotation behavior, yes, I agree with the rest of that statement: that they'd need to use different keywords. And in this case, I'd recommend using format, which validates, and x-format which doesn't.

awwright commented 1 week ago

Still, I'm not sure how this applies to whether we have format validate by default or not. I think most users would expect both format keywords to validate in this case.

To bring it back around, it's because the primary motivation for https://github.com/json-schema-org/json-schema-spec/issues/1391 was this issue, to introduce "format" keywords that are definitively validation keywords. Since validation keywords often apply only to a single type, typed keywords e.g. "stringFormat" would be a convenient way of adding format validation.

gregsdennis commented 1 week ago

That makes sense. Thanks for the connection.

karenetheridge commented 1 week ago

if you are going to make format to validate by default this option goes away

No, it doesn't. People can still use x-format (or any x- keyword) if they must have an annotation.

...or they can define their own metaschema, with the format-assertion vocabulary removed and format-annotation added. (Or, we can formally publish a metaschema variant with this change in it already. You may recall that we've previously discussed publishing a draft2020-12 metaschema with format-assertion enabled; and then with this change, for the next spec release we would have the hing only in reverse: the "main" schema has format-assertion turned on, and the "secondary" schema has it turned off).

gregsdennis commented 1 week ago

or they can define their own metaschema, with the format-assertion vocabulary removed and format-annotation added

This isn't an option anymore. We're removing vocabularies. They're being extracted into the new Feature Life Cycle as a proposal. That means the specification can't rely on vocabs anymore.

There is no format-assertion or format-annotation vocab. There's only the format keyword now.

karenetheridge commented 1 week ago

A validating "format" [potentially] rejects all types of values, instead of within a single type (like most keywords). This makes use with multi-type schemas (e.g. "type": ["integer", "string"]) more difficult.

We could explicitly state in the spec that while a format may choose to only match strings, it MUST NOT reject other data types for which it is not possible to have valid data (that is: you can have a format that matches against both string and number values, but it cannot produce a "valid":false result for objects, arrays etc. Or to rephrase, a "valid":false result may only be produced for a particular data type so long as there is at least one instance of that type for which "valid":true would be produced).

karenetheridge commented 1 week ago

This isn't an option anymore. We're removing vocabularies.

I did not have the impression that this had been decided yet.

gregsdennis commented 1 week ago

This isn't an option anymore. We're removing vocabularies.

I did not have the impression that this had been decided yet.

It's been in PR for a month, and it was part of an ADR at the beginning of May.