python-jsonschema / jsonschema

An implementation of the JSON Schema specification for Python
https://python-jsonschema.readthedocs.io
MIT License
4.6k stars 580 forks source link

Inconsistent behaviour of validator_for depending on http vs https #1182

Closed berislavlopac closed 11 months ago

berislavlopac commented 11 months ago

There seems to be a problem when selecting a validator based on the $schema URL, using the validator_for function: specifically, some schemas can't be located depending whether the URL is http or https.

After some research, it looks like the current implementation assumes the following:

This assumption is incorrect, as in practice all http urls are redirected (with 301 response code) to their https counterparts, and https works for all; this short script shows what happens both when calling validate_for and retrieving the schema from the URL, with either protocol:

from jsonschema.validators import validator_for
import httpx

metaschemas = [
    "//json-schema.org/draft-04/schema#",
    "//json-schema.org/draft-06/schema#",
    "//json-schema.org/draft-07/schema#",
    "//json-schema.org/draft/2019-09/schema#",
    "//json-schema.org/draft/2020-12/schema#",
]

print("== schemas with http:")
for metaschema in metaschemas:
    url = f"http:{metaschema}"
    validator_http = validator_for({"$schema": url})
    remote_schema = httpx.get(url)
    print(url, remote_schema.status_code, validator_http)

print()

print("== schemas with https:")
for metaschema in metaschemas:
    url = f"https:{metaschema}"
    validator_https = validator_for({"$schema": url})
    remote_schema = httpx.get(url)
    print(url, remote_schema.status_code, validator_https)

This is the output of that script:

== schemas with http:
http://json-schema.org/draft-04/schema# 301 <class 'jsonschema.validators.Draft4Validator'>
http://json-schema.org/draft-06/schema# 301 <class 'jsonschema.validators.Draft6Validator'>
http://json-schema.org/draft-07/schema# 301 <class 'jsonschema.validators.Draft7Validator'>
/Users/berislavlopac/Documents/Development/personal/schematalog/jstest.py:15: DeprecationWarning: The metaschema specified by $schema was not found. Using the latest draft to validate, but this will raise an error in the future.
  validator_http = validator_for({"$schema": url})
http://json-schema.org/draft/2019-09/schema# 301 <class 'jsonschema.validators.Draft202012Validator'>
http://json-schema.org/draft/2020-12/schema# 301 <class 'jsonschema.validators.Draft202012Validator'>

== schemas with https:
/Users/berislavlopac/Documents/Development/personal/schematalog/jstest.py:24: DeprecationWarning: The metaschema specified by $schema was not found. Using the latest draft to validate, but this will raise an error in the future.
  validator_https = validator_for({"$schema": url})
https://json-schema.org/draft-04/schema# 200 <class 'jsonschema.validators.Draft202012Validator'>
https://json-schema.org/draft-06/schema# 200 <class 'jsonschema.validators.Draft202012Validator'>
https://json-schema.org/draft-07/schema# 200 <class 'jsonschema.validators.Draft202012Validator'>
https://json-schema.org/draft/2019-09/schema# 200 <class 'jsonschema.validators.Draft201909Validator'>
https://json-schema.org/draft/2020-12/schema# 200 <class 'jsonschema.validators.Draft202012Validator'>

This behaviour means that a schema with the "wrong" HTTP(S) protocol in the $schema URL with be treated as the default metaschema, potentially failing validation.

Julian commented 11 months ago

The current behavior is correct.

The URIs in $schema are just that -- URIs. They are identifiers for the JSON Schema versions. Regardless of the current website behavior (which has to do more with convenience), up until draft 7 the identifiers were indeed HTTP (even if they were retrievable over HTTPS). And the current ones are HTTPS. The same is true about whether they contain fragments or not.

Essentially, you are supposed to use the exact identifier, and it's irrelevant whether the meta schema is even retrievable at all from that URL.

karenetheridge commented 11 months ago

a schema with the "wrong" HTTP(S) protocol in the $schema URL with be treated as the default metaschema

This bit seems wrong -- if the $schema URI does not match one of the known metaschemas (whether a mismatch of http/https or something else), the implementation shouldn't fall back to the default, but rather it should error out entirely.