URI/IRI "Normalization" and Compatibility

watuwo commented 2 years ago

The current core specification (2020-12) requires that "$schema" (and some other) URIs "MUST be normalized". While RFC 3986 makes suggestions for "normalization" steps, it does not define a normal form for URIs (IRIs analogous).

"normalized"/"normal"/"normalization"/... (of URIs/IRIs) is not well-defined. I propose that these terms are either eliminated from the specification or defined.

My current personal preference: Eliminate the "normalization" constraint.

A thought about the switch from URIs to IRIs:

An implementation could/should reject a schema if

the resolution of "$id" yields an IRI that is not also a URI

AND the schema's meta schema declares

...
"$vocabulary": {
...
"https://json-schema.org/draft/2020-12/vocab/core": true
...

Is this true now? Can this still occur in the future? Is rejection still possible/desired? What about this:

Meta-schemas that do not use "$vocabulary" MUST be considered to require the Core vocabulary as if its URI were present with a value of true.

handrews commented 2 years ago

We specify normalization to improve the likelihood that a URI or IRI is recognized as a known resource. Comparison and equivalence is the context in which normalization is defined in RFC 3986.

Normalization is a well-understood industry term, and while it is hard to be exact about it, JSON Schema should not try to step in on behalf of other specifications. The one place where we can do this is to note that that for documents of media type application/schema+json, a URI with a trailing empty fragment can be normalized to a URI without a fragment (because that is under the control of the media type specification).

Otherwise, the point of the normalization requirement is to avoid requiring JSON Schema implementations to implement normalization themselves. There is nothing in JSON Schema that requires or encourages an implementation to detect an insufficiently normalized URI and error on it. Failing to normalize your "$schema" URI, for example, just means that an implementation might not recognize it and might therefore refuse to process your schema.

the resolution of "$id" yields an IRI that is not also a URI

All IRIs can be mapped to URIs.

watuwo commented 2 years ago

I agree with everything you say about "normalization". The question is whether the phrase "MUST be normalized" expresses well what you wrote in your reply.

I am aware that IRIs can be mapped to URIs. If that is all, then you could have just stuck with URIs :). The question was more whether schemas written for the 2020-12 specification are "future proof". I believe the answer is: technically no (but hopefully mostly kind of).

handrews commented 2 years ago

The question is whether the phrase "MUST be normalized" expresses well what you wrote in your reply.

I'm open to suggestion on better ways to word this. The RFC 2119 usage in the core spec is not great- some of it is old and might have been written in a different context as we changed a lot of the wording over the last few drafts.

What we want is to convey that non-normalized URIs (or IRIs in the future) will likely not behave correctly. We do not need (or in my opinion want, although someone may disagree) to require JSON Schema implementations to enforce normalization. I admit I don't know the best way of communicating that with proper formal language.

The question was more whether schemas written for the 2020-12 specification are "future proof". I believe the answer is: technically no (but hopefully mostly kind of).

Can you elaborate on that? 2020-12 only allows URIs, the next release will allow IRIs. All URIs are valid IRIs, so what is the expected breakage?

The only thing that will be at all braking is that "$id" no longer allows an empty fragment (or any fragment at all, but its's only allowed an empty fragment for the last several drafts). And while that is technically a breaking change, it doesn't reduce the functionality at all and we've been warning about it in a CREF for three draft publications now. So it's definitely not a surprising change.

watuwo commented 2 years ago

My resolution: I have chosen to use the JSON schema specification (for myself) AS IF...

... the "MUST be normalized" phrases were not there at all
... the specification required that URIs/IRIs are to be considered equal exactly if their String/string definition/representation are equal
... implementations were required to "modify URIs/IRIs as little as possible"

Challenges about the third point (apart from language):

The third point is sensitive to the switch from URIs to IRIs ("http://example.com#/properties/%F0%9F%90%87" vs "http://example.com#/properties/🐇").
Accepting the third point means accepting that an implementation can possibly dereference "http://example.com#/properties/foo" while failing to dereference "http://example.com#/properties/%66oo" or "http://example.com#%2Fproperties%2Ffoo". Some people might read the JSON pointer specification differently (see examples here).

All URIs are valid IRIs, so what is the expected breakage?

I do not expect any practically relevant problems. Technically the upcoming change in the output structure could be considered "more breaking" (and I definitely do not want to stand in the way of such evolution). I am excited to see that a concept of stability is emerging (https://json-schema.org/blog/posts/future-of-json-schema).

awwright commented 2 years ago

While, practically speaking, this only impacts how people choose URIs for their meta-schemas/vocabularies (not validators that merely consume schemas), I'd like to point out the process is fairly well defined, and RFC 3986 actually has lots to say on exactly how you normalize a URI:

Syntax-Based Normalization — "Syntax-based normalization includes such techniques as case normalization, percent-encoding normalization, and removal of dot-segments"
Case Normalization — "use uppercase letters for the digits A-F"
Percent-Encoding Normalization — "[decode] any percent-encoded octet that corresponds to an unreserved character"
Path Segment Normalization — "remove dot-segments by applying the remove_dot_segments algorithm to the path"
... and additional scheme-specific normalization as defined by the registration for the scheme.

And JSON Schema Core normatively references RFC 3986, so it means the same as if we incorporated the text directly.

watuwo commented 2 years ago

It is just not the intention of RFC 3986 to define a normal form (your examples come from a section called "Comparison Ladder"). I know this is an unfortunate decision by the RFC 3986 authors (which probably had their reasons) but the JSON schema specifications should acknowledge this. It seems like the JSON schema specification makes unintended use of RFC 3986 without even a comment.

fairly well defined

... exactly :)

You are right that the phrase "MUST be normalized" only occurs in the "$schema" and "$vocabulary" sections. There seems to be only one more mention of "normalized". The phrase "This URI-reference SHOULD be normalized" occurs in the "$id" section (which is even more dubious: Do you really claim that RFC 3986 defines a normal form for unresolved URI-references? Do you really believe that we should apply remove_dot_segments to an unresolved URI-reference?).

this only impacts how people choose URIs for their meta-schemas/vocabularies (not validators that merely consume schemas)

I do not think that the "normalized" "restrictions" in the JSON schema specification have a clear meaning. Removing them (or rephrasing them as non-normative comments of some sort) probably has no real effect (just a bit of polish).

awwright commented 2 years ago

It is just not the intention of RFC 3986 to define a normal form

Please elaborate on this point... I think it's fair to assume when spec talks about something, it is conveying intent. Here it provides a specific definition for "normalized" that's exactly the meaning we're looking for: Use uppercase pct-encoded sequences, remove unnecessary dot components, etc.

You'll have to explain how it's possible to interpret this in any other way, with a specific example.

Do you really believe that we should apply remove_dot_segments to an unresolved URI-reference

RFC 3986 has this to say:

Note that dot-segments are intended for use in URI references to express an identifier relative to the hierarchy of names in the base URI. The remove_dot_segments algorithm respects that hierarchy by removing extra dot-segments rather than treat them as an error or leaving them to be misinterpreted by dereference implementations.

The effect of this is you need to preserve extra leading dot segments when applying them to relative references, instead of removing them. The meaning isn't ambiguous, just a little bit buried.

I do not think that the "normalized" "restrictions" in the JSON schema specification have a clear meaning.

We're using the BCP 14 "SHOULD" and "MUST" language, which imposes an interoperability requirement... in this case, it's imposing a requirement on how you write schemas (rather than how you parse them or use them in validation). Practically speaking, it says if you don't follow this requirement, then interoperability with other implementations won't necessarily be guaranteed. It might still work fine for now, but it might break sometime in the distant future, we don't know.

watuwo commented 1 year ago

Surely next you are going to tell me that you can also do scheme based normalization on an unresolved URI-reference without a scheme :).

My wording in the original issue might have been too harsh / misleading. RFC 3986 uses the term "normalization" in a consistent, understandable way. Nonetheless I still believe:

Whether a URI or IRI has been "normalized" is not a question that is decidable on the basis of RFC 3986 or RFC 3987 and the JSON schema specification should not include a "MUST be normalized" constraint.
The JSON schema specification should not even suggest the application of "normalization" to unresolved URI-references. RFC 3986 and RFC 3987 present "normalization" options as a way of broadening equivalence and both specifications (RFC 3986 and RFC 3987) explicitly discourage comparing unresolved URI/IRI-references.

jdesrosiers commented 1 year ago

I agree with @awwright that normalization is sufficiently well defined. I don't think I've ever seen a URI library that doesn't normalize URIs and I've definitely never hear of normalization implementations that don't normalize in a way that is incompatible with other implementations. There can be slight variations such as whether uppercase or lowercase letters are used for percent encoded characters, but that doesn't matter as long as you normalize both URIs you're comparing using the same library.

Technically, the spec requires schema authors rather than implementations to normalize URIs, which I've always found to be awkward. Implementations still need to normalize before comparison to account for variations in normalization, so asking schema authors to normalize their URIs doesn't really seem necessary.

I'd argue that the spec doesn't need to say anything about normalization, but not because normalization isn't well defined. It's the opposite really. It doesn't need to be mentioned because normalization is just part of the processes for comparing URIs as defined in RFC 3986. We don't need to say any more, just point to RFC 3986.

watuwo commented 1 year ago

normalization is just part of the processes for comparing URIs

Yes ... and (from RFC 3986)

URI comparison is performed for some particular purpose. Protocols or implementations that compare URIs for different purposes will often be subject to differing design trade-offs in regards to how much effort should be spent in reducing aliased identifiers. This section describes various methods that may be used to compare URIs, the trade-offs between them, and the types of applications that might use them.

... and (from RFC 3986)

In testing for equivalence, applications should not directly compare relative references; the references should be converted to their respective target URIs before comparison.

Whereas the JSON schema specification says

This URI-reference SHOULD be normalized

Before we get into a nonsense discussion: I am not saying that the JSON schema specification demands comparing unresolved URI-references. I am saying that RFC 3986 and RFC 3987 present normalization as a way of broadening equivalence while explicitly discouraging equivalence testing on unresolved URI-references. "Normalizing" unresolved URI-references is certainly not something that the URI/IRI RFCs encourage. It is also not so clear how to even do it (some people might not do scheme based normalization if there is no scheme; some might find a creative, custom way of doing it anyways, similar to remove_dot_segments above).

I do not believe that the question whether a URI has been normalized is decidable based on RFC 3986 (or RFC 3987). At least nobody has added a "uri-normalized" format, or even a "uri-reference-normalized" format (future meta-schema authors might just want to do that). Would you feel comfortable implementing it (as assertion)?

jdesrosiers commented 1 year ago

This URI-reference SHOULD be normalized

I completely agree that that statement doesn't make sense. $id is a URI-Reference, which is a (non-relative) URI or a relative reference. Normalizing makes sense for a URI, but not a relative reference. That line at least needs clarification that it doesn't apply if it's a relative reference, but I'd rather see that requirement removed altogether or reworded in a way that requires implementations to normalize URIs when comparing rather than requiring schema authors to normalize. I very much do think implementations should be normalizing when comparing URIs. I just don't think it makes sense as a requirement for schema authors.

gregsdennis commented 5 months ago

The action here is to clean up the text a bit.

I do somewhat agree with @jdesrosiers' statement in the last comment about how the requirement should be on implementations not authors.

gregsdennis commented 1 month ago

@watuwo would you please have a look at #1537 to see if that addresses the concerns you listed above?

json-schema-org / json-schema-spec

URI/IRI "Normalization" and Compatibility #1349