python-jsonschema / jsonschema

An implementation of the JSON Schema specification for Python
https://python-jsonschema.readthedocs.io
MIT License
4.6k stars 578 forks source link

Support custom Registry for meta schema in check_schema #1228

Closed sahilshahpatel closed 7 months ago

sahilshahpatel commented 7 months ago

Hi, I am fairly new to jsonschema in general, but I am trying to use python-jsonschema to validate some custom vocabularies and I ran into an issue which seems to have a trivial fix within the python-jsonschema library.

Since I'm new to this, if I am using this all wrong please let me know! This issue seems like it would be pretty common with custom vocabularies, so it's possible there is another intended path, but I couldn't find it.

Essentially, python-jsonschema allows you to pass in a Registry to lookup custom IDs that are $ref-ed within a schema. However, it doesn't do the same for the meta schema. I am attempting to create a vocabulary following the general guidance of json-everything.net -- admittedly I feel this is a part I don't understand well, but I believe my problem would still exist most places.

(I am aware that $refs are supposed to be URIs, but the rest of python-jsonschema doesn't care about this as long as you set up a registry to look up your custom ID scheme. For my purposes I need these to be local files with relative paths, not URI-based)

This leaves me with a meta schema something like the following where vocab.schema.json maps to a local file which I want to reference via a Registry.

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "meta.schema.json",
    "$vocabulary": {
        "https://json-schema.org/draft/2020-12/vocab/core": true,
        "https://json-schema.org/draft/2020-12/vocab/applicator": true,
        "https://json-schema.org/draft/2020-12/vocab/validation": true,
        "https://json-schema.org/draft/2020-12/vocab/meta-data": true,
        "https://json-schema.org/draft/2020-12/vocab/format-annotation": true,
        "https://json-schema.org/draft/2020-12/vocab/content": true,
        "https://json-schema.org/draft/2020-12/vocab/unevaluated": true
    },
    "allOf": [
        { "$ref": "https://json-schema.org/draft/2020-12/schema" },
        { "$ref": "vocab.schema.json" }
    ]
}

If I set the contents of this file (via json.load) as my META_SCHEMA, however, check_schema will now fail due to a referencing error. A simple change to jsonschema/validators.py::create adding a registry argument to the check_schema function will allow it to succeed. This is because check_schema creates a new Validator with no option to pass a registry here. To be honest, I also don't understand why a new validator is created rather than using self (or cls in this case).

The change was very simple and easy when calling check_schema manually. When using jsonschema.validate directly, something else might be needed because the current kwargs go into validating the schema, and you might not want the same args for the meta schema.

I see another possible solution as well. Instead of adding a registry parameter to check_schema, it would be possibly more transparent to allow creators of custom validators to "bake in" required registry entries. That way a user would need to do nothing as the registry would be set to include the required entries on creation of the Validator.


Here's my minimal example. First, start with the JSON contents from above in a file called meta.schema.json. In the same folder, create vocab.schema.json with the following contents:

{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "vocab.schema.json",
    "title": "Meta schema for my vocab",
    "properties": {
        "mykeyword": { "type": "string" }
    }
}

In the same folder again create test.schema.json with the following:

{
    "$schema": "meta.schema.json",
    "mykeyword": "this is a test"
}

And finally create mytest.py in the same folder:

import json
import jsonschema
from referencing import Registry, Resource

def my_callable(validator, value, instance, schema):
    pass

def get_validator():
    with open("meta.schema.json") as meta_schema_file:
        meta_schema = json.load(meta_schema_file)

        # Here the first argument to jsonschema.validators.extend is a class. I could see a version
        # where you pass in an actual Validator object (with registry included) and that makes the
        # returned class operate with that registry by default. It doesn't currently work that way
        MyValidator: type[jsonschema.protocols.Validator] = jsonschema.validators.extend(
            jsonschema.Draft202012Validator, validators={
                "mykeyword": my_callable
            })
        MyValidator.META_SCHEMA = meta_schema
        return jsonschema.validators.validates(meta_schema["$id"])(MyValidator)

def main():
    with open("test.schema.json") as schema_file:
        schema = json.load(schema_file)

        # Set up the jsonschema validator to load local files
        references = []
        with open("vocab.schema.json") as file:
            contents = json.load(file)
            references.append((contents['$id'], Resource.from_contents(contents)))
        registry = Registry().with_resources(references)

        MyValidator = get_validator()
        validator = MyValidator(schema, registry)

        # This will fail with jsonschema.exceptions._WrappedReferencingError: Unresolvable: vocab.schema.json
        validator.check_schema(schema)

        # This would work with my modifications to check_schema
        # validator.check_schema(schema, registry=registry)

        # Note that Validator.validate() allows you to pass in a registry through kwargs and so does
        # jsonschema.validate, but Validator.check_schema does not! This is the crux of the issue

if __name__ == "__main__":
    main()

My proposed modification to check_schema (referenced in the myscript.py file) is to jsonschema's validators.py:

# Note the addition of the registry param. None is NOT a good default value and causes issues,
# but you get the point
@classmethod
def check_schema(cls, schema, format_checker=_UNSET, registry=None):
    Validator = validator_for(cls.META_SCHEMA, default=cls)
    if format_checker is _UNSET:
        format_checker = Validator.FORMAT_CHECKER
    validator = Validator(
        schema=cls.META_SCHEMA,
        format_checker=format_checker,
        registry=registry
    )
    for error in validator.iter_errors(schema):
        raise exceptions.SchemaError.create_from(error)

I have worked around this for now by not using $ref in my meta schemas and simply including the JSON fragments myself. But this isn't ideal as the project gets more complicated, and anyways this seems very possible with the current framework.

Looking forward to hearing feedback, either on how else I should be approaching this or on the pros and cons of my suggested fix.

Julian commented 7 months ago

Thanks for the detailed issue! I haven't read it carefully yet but skimming it certainly seems to make sense. Will probably be a few days (even not including the weekend) before I look, but just ACKing that I saw it.

sahilshahpatel commented 7 months ago

Hey @Julian, don't want to bother too much but let me know if you've had a chance to take a look at this!

Julian commented 7 months ago

So here's a self contained example, just because I tidied it slightly as I was reading your comment:

```python meta_schema = { "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "urn:meta.schema.json", "$vocabulary": { "https://json-schema.org/draft/2020-12/vocab/core": True, "https://json-schema.org/draft/2020-12/vocab/applicator": True, "https://json-schema.org/draft/2020-12/vocab/validation": True, "https://json-schema.org/draft/2020-12/vocab/meta-data": True, "https://json-schema.org/draft/2020-12/vocab/format-annotation": True, "https://json-schema.org/draft/2020-12/vocab/content": True, "https://json-schema.org/draft/2020-12/vocab/unevaluated": True }, "allOf": [ { "$ref": "https://json-schema.org/draft/2020-12/schema" }, { "$ref": "urn:vocab.schema.json" } ] } vocab = { "$schema": "https://json-schema.org/draft/2020-12/schema", "$id": "urn:vocab.schema.json", "title": "Meta schema for my vocab", "properties": { "mykeyword": { "type": "string" } } } schema = { "$schema": "urn:meta.schema.json", "mykeyword": "this is a test" } from jsonschema import exceptions, validators from referencing import Registry, Resource import json import jsonschema def my_callable(validator, value, instance, schema): if instance == 37: yield exceptions.ValidationError("NO!") def get_validator(): MyValidator = validators.extend( validators.Draft202012Validator, validators={"mykeyword": my_callable}, ) MyValidator.META_SCHEMA = meta_schema return validators.validates(meta_schema["$id"])(MyValidator) # Set up the jsonschema validator to load local files registry = Resource.from_contents(vocab) @ Registry() MyValidator = get_validator() ```

This is because check_schema creates a new Validator with no option to pass a registry here. To be honest, I also don't understand why a new validator is created rather than using self (or cls in this case).

It's a classmethod, it indeed doesn't create a new Validator and does exactly use cls, you don't call it on an instance! That it's a classmethod is documented at the bottom here or in the API documentation, but if there's some other place you looked and didn't find that let me know.

To now jump directly to answering how you can provide a registry, you can simply create a validator instance for the meta schema if you'd like, and then pass it whatever you want (a registry) i.e.:

MetaschemaValidator = validators.validator_for(MyValidator.META_SCHEMA)  # or directly use Draft202012Validator
meta_schema_validator = MetaschemaValidator(MyValidator.META_SCHEMA, registry=registry)

print("valid schema:", meta_schema_validator.is_valid(schema))
print("invalid schema:", meta_schema_validator.is_valid({"mykeyword": 37}))

validator = MyValidator(schema, registry=registry)
print("valid instace:", validator.is_valid(73))
print("invalid instace:", validator.is_valid(37))

Yes check_schema could take all the arguments that a Validator does, and if it were written today it probably would, but there are a few small reasons not to add it right now:

The 2 above things basically mean I don't really want to make tweaks that aren't completely necessary (like the referencing changes in the last few versions) -- and since what you're doing is easily doable using existing means (creating a normal validator with your meta schema) I'm inclined to say "just use that for now" I think rather than changing it too much.

Let me know if the above makes sense / helps you get what you're after.

sahilshahpatel commented 7 months ago

Thanks for the response @Julian! This suggestions is basically when I'm doing now as a workaround. The downside is that users of the custom validator must know what registry to use with it's meta validator if they want to check it. Of course since I'm both the user and creator that's doable, as you suggest.

Also I missed that check_schema is a class method, thanks for pointing that out!

I understand the points about general overhauls based on vocabularies, so I think I agree this doesn't warrant a change now since there is a relatively simple work around.

Thanks for the discussion!