Type specific validating formats (stringFormat, numberFormat)

awwright commented 1 year ago

The "format" keyword has historically changed functionality, it's gone back and forth from being a validation keyword, to annotation, back to a validation keyword if you specify (out-of-band) that it's a validation keyword. The fact this is specified out-of-band makes it impossible to determine the right way to "upgrade" the keyword between dialects of a schema.

This ambiguous behavior doesn't make a lot of sense, it seems to me there ought to be a keyword that's for annotation, and a keyword that's for validation.

When defining a validation "format" keyword, usually validation only happens when the instance is of one type or another (e.g. "minimum" doesn't do anything if the input is a boolean). If I'm using "format" as a validation keyword, and I want it to apply only to strings, I have to use more sophisticated logic. This won't work as expected:

{ "type": ["number", "string"], "format": "date" }

Number inputs will always fail. Instead, I have to write:

{ "oneOf": [ {"type": "number"}, { "type": "string", "format": "date" }] }

But this is complicated. I should be able to do something like:

{ "type": ["number", "string"], "stringFormat": "date" }

Then, these new type-specific formats could be validation keywords, leaving "format" to be an annotation-only keyword:

There would be "stringFormat". I believe all of the existing formats are string formats.
"numberFormat" would accept "integer" (no decimal point allowed, even 1.0), "float", or "exponential" (e.g. 1e5)
Potentially "objectFormat" and "arrayFormat" because it might be useful for some niche applications
(null/boolean don't need formats, being very small value spaces)

Some additional features:

The keyword would only be defined for values that the validator knows. That is, if an unknown value for the typed-format keywords was provided, it would fail the same way unknown keyword would.
A URI could be provided, to allow for one-off, user-defined formats that bypass standardization. For example, if I want to represent an ISO 8601 period, I could write down {"stringFormat": "http://example.com/format/period"}
Formats could refer not just to standard syntaxes, but also references to outside validators, or nonstandard sets. e.g. I could write {"numberFormat": "https://example.org/numberFormat/A000045"} to refer to all numbers that are in the Fibonacci sequence.
(as an idea) In the event a format is renamed, or a URI format is standardized, the typed-format keywords could accept a space-delimited list of format names or URI names; this would mean "all of these formats are the same, use any one that you understand." e.g. if the above period format gets standardized as "period", then you could write { "stringFormat": "period http://example.com/format/period" } to indicate using either definition is OK, they're the same thing.

Blockers: This depends on "unknown keywords prohibited" being a feature, otherwise these proposed keywords will just be annotation keywords.

Related: #1383, #1284

Julian commented 1 year ago

This is unnecessary API churn, and additionally churn which sends a poor message about the vocabulary system, in my opinion.

awwright commented 1 year ago

Can you elaborate? I believe I addressed all of those points... specifically that the existing "format" keyword is not reliable enough for many uses. It doesn't deprecate any existing functionality, so I'm not sure how "churn" is a concern.

Julian commented 1 year ago

There would be "stringFormat". I believe all of the existing formats are string formats.

This isn't correct, as I've mentioned previously when you brought this up, and Karen did recently as well. The only core formats are defined over strings, but we explicitly allow extra formats to be defined outside the spec, and for them to be defined on any primitive type. Folks almost certainly have done so.

By churn I mean "you are proposing changing the way something is done in a way that might be clearer but isn't sufficiently better to overcome the cost of retraining". I don't know what more to elaborate on unfortunately, I simply don't see value in this kind of change personally. I think format is not good, but not defective enough we should introduce something to replace it.

Blockers: This depends on "unknown keywords prohibited" being a feature, otherwise these proposed keywords will just be annotation keywords.

I don't see any relationship FWIW -- if somehow people agree this is useful, of course if the spec defines keywords they're not unknown and wouldn't be annotation keywords.

awwright commented 1 year ago

The only core formats are defined over strings, but we explicitly allow extra formats to be defined outside the spec

Right, I was just pointing out there are no core formats except string formats—and if others have defined custom non-string formats, this proposal doesn't affect those. And if you want to move to a typed format, you just have to pick the correct type.

you are proposing changing the way something is done in a way that might be clearer but isn't sufficiently better to overcome the cost of retraining

Not quite, I believe what I'm proposing has no current equivalent, there is no way to require "format" to validate. It is allowed to be inconsistent and there's no way to tell if an annotation was intended, or if validation was intended. typed-format neatly solves this.

I don't see any relationship FWIW -- if somehow people agree this is useful, of course if the spec defines keywords they're not unknown and wouldn't be annotation keywords.

The benefit of the typed formats is they validate, and won't be ignored if the format is unknown, which is the current situation. But this depends on "unknown keywords prohibited" first being written in.

Julian commented 1 year ago

there is no way to require "format" to validate.

The current way is you declare your schema to use a dialect which uses the format assertion vocabulary (rather than the annotation vocabulary which the default dialect uses).

gregsdennis commented 1 year ago

"numberFormat" would accept "integer" (no decimal point allowed, even 1.0), "float", or "exponential" (e.g. 1e5)

JSON Schema cannot validate the text format in which a value has been encoded because it operates on the JSON data model. The text has already been parsed into a number by the time JSON Schema gets it. The best we can do is detect of the number has a fractional part.

This is akin to trying to get JSON Schema to validate that the input JSON had no line breaks or extra whitespace (minified).

gregsdennis commented 1 year ago

Blockers: This depends on "unknown keywords prohibited" being a feature, otherwise these proposed keywords will just be annotation keywords.

This isn't a blocker anymore. We've decided to do this. It's just not in the document yet.

awwright commented 1 year ago

JSON Schema cannot validate the text format in which a value has been encoded because it operates on the JSON data model. The text has already been parsed into a number by the time JSON Schema gets it. The best we can do is detect of the number has a fractional part.

When it comes to "numberFormat" what I'm proposing may be a bit of a change to the data model... I don't think it's completely unreasonable, there are parsers that distinguish 1 and 1.0 and this would support those. And JSON is ambiguous as to which numbers are supposed to be equal to each other. Supporting this would be the same situation as supporting bigints in multipleOf, most validators won't be able to support it, but there's some that might.

It's definitely not the same as whitespace, which is unambiguously not important.

And if this actually is unpopular or a bad idea, then we can omit integer/float/exponential from "numberFormat" and only permit numberFormat to distinguish the mathematical (scalar) values.

This isn't a blocker anymore. We've decided to do this. It's just not in the document yet.

The devil may be in the details... and what I mean is, this can't be written in until "prohibit unknown keywords" is written in.

gregsdennis commented 1 year ago

there are parsers that distinguish 1 and 1.0 and this would support those

Unless all parsers/models do this, we can't require it. Supporting only the subset of parsers that result in a data model that makes this distinction is not interoperable.

Supporting this would be the same situation as supporting bigints in multipleOf, most validators won't be able to support it, but there's some that might.

And the tests we have for this support are optional for this reason.

we can omit integer/float/exponential from "numberFormat"

I'm not opposed to these formats. I'm opposed to validating textual encoding when this is not within the capability of all parsers, many of which are built into the language/framework. We can't require the ability to differentiate between the text encodings 1 and 1.0 when they both render in the data model as 1 (it's a number, the encoding is irrelevant); but we can require differentiating between 1 and 1.1, so this can still be supported.

Essentially, if we want

{
  "type": "number",
  "format": "integer"
}

that's fine, but it needs to pass validation for both 1 and 1.0 while failing validation for 1.1.

This won't work as expected:
{ "type": ["number", "string"], "format": "date" }

This does work as expected. We have tests that ensure this (and I had to make code changes to pass those tests).

Formats are already typed in that they only respond to a particular type, like other validator keywords. However, to allow multiple formats, you do have to do the anyOf thing.

awwright commented 1 year ago

I'm opposed to validating textual encoding when this is not within the capability of all parsers, many of which are built into the language/framework.

I'm not suggesting this should be required, the same way that supporting big numbers isn't required... but for the validators that do make a distinction between 1 and 1.0, should be a standard way to indicate this. ("type": "integer" is mathematical/scalar, "numberFormat": "integer" is syntactical).

This does work as expected.

Hm, I remember this now, I'm going to have to re-think this then.

awwright commented 1 year ago

there is no way to require "format" to validate.

The current way is you declare your schema to use a dialect which uses the format assertion vocabulary (rather than the annotation vocabulary which the default dialect uses).

Yes you're right, though this is much extra work, I don't think I've seen this in the wild. You have to ship two schemas as separate documents, instead of one. It seems to be much more overhead than what most people are willing to accept.

Julian commented 1 year ago

Yes you're right, though this is much extra work

I don't know what you mean, it's not any work on behalf of the schema author once someone (one person, undoubtedly someone has already done this) publishes some dialect with ...format-assertion: true in it, at which point now the person authoring the schema has no additional work whatsoever to do.

As I say, I'm pretty -1 on this kind of idea personally, but you may find someone else who sympathizes obviously.

(I'm ignoring all the discussion above on integer formats, I don't agree with some of the back and forth, but I don't think it's central to what you're proposing anyhow).

awwright commented 1 year ago

I don't know what you mean

Let's try an exercise, I have an object like { "isbn": "978-4-04-893705-4", "published": "2020-04-05" }

I want "isbn" to annotate and I want "published" to validate against a RFC3339 date. How do I do that? If you can't do that without looking it up the right keywords, then I'd like to suggest it's too complicated.

Julian commented 1 year ago

That's different than what you previously said, but also not an issue today, you use either:

{
  "$schema" : "somedialectwithformatassertiontrue",
  "properties": {
    "isbn": {"formatNoValidate": "isbn"},
    "published": {"format": "date"}
  }
}

or

{
  "properties": {
    "isbn": {"format": "isbn"},
    "published": {
      "$schema": "somedialectwithformatassertiontrue",
      "format": "date"
      }
  }
}

Or an allOf.

awwright commented 1 year ago

@Julian Neither of these are standard solutions that will work across validators, since you're talking about a $schema value that doesn't even exist yet, and has to be written.

Or if the custom schema does work across validators, then presumably, you omitted what the contents of it because it's lengthy or difficult to write. Right?

Julian commented 1 year ago

What I wrote will work in any implementation with support for vocabularies (and the format assertion vocabulary". Yes I didn't Google for who has previously written the trivial metaschema enabling assertion for format, as I say undoubtedly someone has done so and published it at a URI anyone can use.

I also honestly think with all due respect that I've both spent enough time thinking about this idea and also have explained why I don't see value in it personally, so I'll probably bow out from the issue at this point.

Relequestual commented 1 year ago

Chiming in at this point, I've speculated, and others have agreed it should be viable, that you can bundle the meta-schema with the schema. If the vocabularies are known (such as the format assertion vocabulary) and supported by the implementation, that should be enough.

Do implementations support that today? I've not verified it. Should we verify if this works today? Absolutley. Should we try to sidestep the vocabulary system because there aren't many implementations that support it properly? Categorically no. Can we provide better documentation, explanations, and examples, to enable better and more complete implementations? Very much so. Can we even incentivise implementers to support what's needed? YES, and we very much should.

awwright commented 1 year ago

Is that really the best solution to this problem though? So far nobody has been able to provide a schema that would reliably across validators. If you can't post it here, nobody's going to understand it on Stack Overflow.

Is the argument seriously that separate annotation and validation keywords is inferior?

gregsdennis commented 1 year ago

I think we also need to consider the reason we all hate format and why we made it an annotation in the first place: it's open-ended. That aspect alone makes it very hard to ensure interoperability for custom formats. Even for spec-defined formats, the validation support is "best effort."

Personally, I'd prefer a different solution that doesn't allow custom values, but the only thing I can think of is a bunch of (e.g.) date-time: true keywords, and that doesn't seem very user-friendly.

Regarding the format-assertion meta-schema, we had a discussion somewhere about creating and publishing one, but we decided against doing so since it was trivial to make one. (Take the standard meta-schema and change "format-annotation" to "format-assertion"... and probably change the meta-schema $id.)

awwright commented 1 year ago

That aspect alone makes it very hard to ensure interoperability for custom formats. Even for spec-defined formats, the validation support is "best effort."

So that I understand, this is the only (major) problem; you think having separate keywords for annotation format and validation format could work, if not for this?

Afaik, using format-assertion: true doesn't guarantee that the validator will understand the format name you're using: it only guarantees that the formats it does understand will assert and be used for validation. The spec is not clear.

My solution would make this guarantee, by treating the keyword as unrecognized, when the format name is unrecognized. (More specifically: Only known format names would be in the range of valid values for the keyword.)

gregsdennis commented 1 year ago

Afaik, using format-assertion: true doesn't guarantee that the validator will understand the format name you're using: it only guarantees that the formats it does understand will assert and be used for validation. The spec is not clear.

The spec is very clear on what format-assertion: true means for implementations:

When the Format-Assertion vocabulary is declared with a value of true, implementations MUST provide full validation support for all of the formats defined by this specificaion. Implementations that cannot provide full validation support MUST refuse to process the schema.

and (for custom formats):

When the Format-Assertion vocabulary is specified, implementations MUST fail upon encountering unknown formats.

With those two requirements, it can be understood that if an implementation processes the schema, it supports validation of any formats present within it.

awwright commented 1 year ago

it supports validation of any formats present within it.

Ok, this is essentially what I'm proposing for the typed/assertion "format" keywords. You said you'd prefer something like date-time: true, is that better than treating an unknown format name as an error?

karenetheridge commented 1 year ago

FWIW - I'm using a pure_integer format in my \@work code, to ensure compatibility with rust data structures that require data to contain integers with no .0, but I acknowledge that this is only possible to validate under certain architectures, and I would never expect it to be a standard part of json schema (at best it could be entered into a registry of optional formats, like OpenAPI is doing). This is a format that only applies to the number type, which is already fully accomodated by the spec; we don't need an extra numberFormat keyword.

I'm pretty happy with the format keyword as it is today -- by default it is an annotation, but the format_assertion vocabulary exists to be included in dialects that wish to be more strict with formats -- and accepting that format validation is imperfect and subject to a lot of edge cases with various library implementations (the test suite has several edge cases that are tricky or impossible to implement perfectly without a great deal of pain), so anyone using formats as assertions needs to be aware of the tradeoffs involved.

awwright commented 1 year ago

@karenetheridge This is a good insight, but I think that typed format keywords would still be an improvement, for situations where you may support multiple types, and you want to specify a different format for each, e.g. to specify an RFC3339 datetime, or a unix timestamp, you could write { "stringFormat": "datetime", "numberFormat": "integer" }

karenetheridge commented 1 year ago

e.g. to specify an RFC3339 datetime, or a unix timestamp, you could write { "stringFormat": "datetime", "numberFormat": "integer" }

I don't see how that's any better than what we'd do today:

format: date-time
type: [string, integer]

Or even, if we deprecated integer as a type and made it a format:

type: [string, number]
anyOf:
- format: date-time
- format: integer

or

anyOf:
- type: string
  format: date-time
- type: number
  format: integer

awwright commented 1 year ago

I don't see how that's any better than what we'd do today:

"type": "integer" permits 1.0 when "numberFormat": "integer" would not. "numberFormat" would mean, literally, the format used to notate the number.

I would expect usage to be somewhat esoteric (as most validators use a parser that can't make that distinction), but according to JSON, it doesn't appear to be illegal to make that distinction in general.

Or even, if we deprecated integer as a type and made it a format:

Necessitating "anyOf" is what I'm trying to avoid.

gregsdennis commented 1 year ago

I agree with @karenetheridge. I don't think this is a significant improvement. You might be able to make a case for an array-form format, but that's probably a hard sell since we're trying to get away from multi-form keywords.

{
  "type": [ "string", "number" ],
  "format": [ "date-time", "integer" ]
}

jdesrosiers commented 1 year ago

My assessment so far is that the {type}Format keywords don't provide anything that the format-assertion format keyword doesn't already provide. However, I find the arguments for having separate format keywords for annotating and validating compelling.

The vocabulary system works, but it isn't well designed to be used by schema authors in their everyday work. It's better designed for organizations (like OpenAPI) to create a custom dialect that will be used as part of a domain specific system. So, I don't think relying on the vocabulary system for the everyday decision of whether or not format should validate is a good enough solution.

While I'm not in favor of introducing type-specific format keywords, I would be in favor of introducing one new keyword that would allow users to use annotation-format and assertion-format in the same schema using the standard dialect.

Personally, I think format is overloaded anyway and splitting it would make some sense. I think of format as an annotation for things like marshallers to use. Like when converting a JSON document to a Java object, it can use the format to do things like marshal a JSON string into a Java Date object. The format isn't for validating the string, but it can be nice to have that validation to give users a warning that your string isn't going to marshal into a Date object. I think it makes sense to have separate keywords for each use-case: one strictly for validation and one for identifying an external type.

gregsdennis commented 1 year ago

I would be open to creating a single new validation-dedicated "format" keyword, leaving format as annotation (which is currently the default behavior).

I looked up synonyms for "format" and "form" to see what we could use, and there's not much. Maybe "model," but that's not quite right.

I'm going to split off the format: integer discussion since that also impacts type. This discussion (though a bit related) should stay on splitting format and whatever that entails.

gregsdennis commented 3 months ago

As mentioned in https://github.com/json-schema-org/json-schema-spec/issues/1520#issuecomment-2264068827, I'd like to move forward with this by creating a proposal document for several [type]Format keywords.

json-schema-org / json-schema-spec

Type specific validating formats (stringFormat, numberFormat) #1391