Invalid extended schema is validated as correct

asargento commented 11 years ago

Lastest version: 1.4.1 Error explanation:

Define a schema1 that is an extension from draftv3 and that add a new property, 'indexed', that is declared as boolean.
Define an new schema using schema1, schema2. The value for property 'indexed' is 111.
The validator should give an error, since for property 'indexed' only a boolean value should be allowed.

Code:

import java.io.File;
import java.io.IOException;

import com.fasterxml.jackson.databind.JsonNode;
import org.eel.kitchen.jsonschema.main.JsonSchema;
import org.eel.kitchen.jsonschema.main.JsonSchemaFactory;
import org.eel.kitchen.jsonschema.report.ValidationReport;
import org.eel.kitchen.jsonschema.util.JsonLoader;

public class JsonSchemaTest {

  public static void main(String[] args) {
    try {
        JsonNode draftv3 = JsonLoader.fromResource("/draftv3/schema");
        JsonNode schema1 = JsonLoader.fromFile(new File("schema1.json"));
        JsonNode schema2 = JsonLoader.fromFile(new File("schema2.json"));

        ValidationReport report;

        JsonSchemaFactory factory = JsonSchemaFactory.defaultFactory();
        JsonSchema draftv3Schema = factory.fromSchema(draftv3);
        report = draftv3Schema.validate(schema1);
        if (!report.isSuccess())
            System.out.println("Schema1: " + report.asJsonObject());

        JsonSchema schema1Schema = factory.fromSchema(schema1);
        report = schema1Schema.validate(schema2);
        if (!report.isSuccess())
            System.out.println("Schema2: " + report.asJsonObject());
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
  }
}

The schema1.json file is

{
 "extends" : {
    "$ref" : "http://json-schema.org/draft-03/schema#"
   },
  "properties" : {
  "indexed" : {
     "type" :  "boolean",
     "default" : false
   }
 }
}

The schema2.json is

{
 "type" : "object",
 "properties" : {
   "from" :  {
      "type" : "string",
      "required" : true,
      "indexed" : 111
   }
  }
}

fge commented 11 years ago

OK, I can reproduce the bug easily, and on the other hand, I didn't mean the API to be used this way (if you are to define other keywords, you are meant to be defining your own syntax validators and keyword validators).

I am still chasing the bug here, but I have one question in the meanwhile: what do you mean "from" to be? An instance keyword or a schema keyword?

fge commented 11 years ago

More generally, there is one issue that I am aware of: I do not do recursive schema syntax validation, which means, for instance, with such a schema:

{
    "properties": {
        "p": { "type": null }
    }
}

I would only trigger a schema validation failure if the instance to validate is an object and has a property with name p.

HOWEVER. You define a new schema keyword here, so the API pretty much expects that you define it, see example 9.

I clearly must do recursive schema validation, but on the other hand you can define a new keyword for your needs -- and I just noticed that your extended schema has no definition for from.

I'd appreciate if you expressed your needs a little further? There is probably a way to do what you mean to do using 1.4.x.

fge commented 11 years ago

I did a little experience based on example 9. I declared the "indexed" keyword syntax-wise only and schema2 indeed fails to validate:

https://gist.github.com/4393138

I believe there are two problems in general:

I don't validate schemas recursively (mentioned above);
you don't declare the new keyword in an appropriate KeywordRegistry (which is done in the gist above).

I do agree that the current API is ungainly; I don't know what to do of this bug, since I cannot "autodetect" new keywords, and especially this from gets in the way. Which is why I'd like to know what your aim is.

asargento commented 11 years ago

The main idea behind this test is to create schemas that extends the draft v3 by adding new keywords/properties. In the test, I was extending the draft v3 by adding a new property, 'indexed', in schema1. Schema2 is the schema that define the instance structure and actually schema2 should look like

{
 "$schema" : "schema1",
 "type" : "object",
 "properties" : {
   "from" :  {
      "type" : "string",
      "required" : true,
      "indexed" : true
   },
   "to" : { "type" : "string", "indexed" : true },
   "subject" : { "type" : "string" }
  }
}

And an instance should look like

{
  "from" : "john.smith",
  "to" : "joe.todd",
  "subject" : "test"
}

So, what I need is the capacity of defining schemas with new properties (extending the drafts) using only the schema definitions.

fge commented 11 years ago

OK, so:

as you can see from the gist, and from Example9, what you want to achieve is definitely possible,
but not the way you want it.

There are two main reasons for it:

syntax validation is entirely separated from keyword validation (ie, checking a schema syntax is a separate process from ensuring an instance is valid against the schema);
syntax validation can do more than keyword validation alone when it comes to schema validation proper (for instance, when it comes to patternProperties, syntax validation can check whether keys of patternProperties are actually regexes: the core meta schema cannot guarantee that).

What you want, basically, is that a JSON file, when injected one way or another, automatically add a new keyword. The problem is that in this case, this can be only added at the syntax level.

Let's take your "indexed" keyword as an example: its value is a boolean, OK. But it plays no role in instance validation, right?

I had in the plans to make a nicer "schema extension" API, but did not plan to extend it using JSON input -- that would require blurring, and ultimately eliminating, the frontier between syntax and keyword validation.

Why not... I just cannot wrap my head around this yet ;)

fge commented 11 years ago

Manipulation error... Refer to above comment

fge commented 11 years ago

Wrapping up: you say

what I need is the capacity of defining schemas with new properties (extending the drafts) using only the schema definitions

And by "schema defintions" I understand a JSON file, right?

Now, if a programmatical way of doing that, more simple than the gist, were proposed? Say:

// New schema definition: must provide new URI; add keywords to it, build it.
// The only possible keyword additions are simple or multiple type syntax checking.
// More complex keywords, and in particular all keywords implying instance checking,
// would have to go the Keyword way.
MetaSchema mySchema = MetaSchema.basedOn(BuiltinSchemas.DRAFTV3_HYPERSCHEMA)
    .withURI("your://schema/uri#").withNewKeyword("indexed", NodeType.BOOLEAN).build();
// Create JsonSchemaFactory with this new schema, make it the default (second argument
// to .addMetaSchema() is true)
JsonSchemaFactory factory = new JsonSchemaFactory.Builder()
    .addMetaSchema(mySchema, true).build();

would that be OK?

As of yet, I am much more at ease with implementing the above than to "heuristically parse" a JSON schema file...

asargento commented 11 years ago

Unfortunately, I need that on the JSON files. Programmatically won't work for me. But if it is too hard to implement this, I can continue to use the version 0.5.0 beta4. It work like I need.

fge commented 11 years ago

I am sorry, but:

Unfortunately, I need that on the JSON files. Programmatically won't work for me.

Why? It does not really make any sense. And:

I can continue to use the version 0.5.0 beta4. It work like I need.

Please don't. Update to 1.4.x, it fixes quite a few bugs and is MUCH faster than 0.5.x.

fge commented 11 years ago

I also insist: please tell me your exact need. I think there is a better way to achieve what you want.

Also, since at least 1.0.x, a JsonSchemaFactory can be private static final, and so can a JsonSchema instance if you use only one. So, please expand upon your needs, so that we can devise a solution.

asargento commented 11 years ago

What I need is the following. First, I need to validate json instances. Image that I have a json like the following

{
  "from" : "john.smith",
  "to" : "santa.claus",
  "subject" : "Presents",
  "date" : "2012/12/21"
}

I can easily use the draft v3 or v4 to validate this instance with schema based on draft v3 (or v4). That schema would be

{
    "$schema" : "http://json-schema.org/draft-03/schema#",
    "id" : "http://example.com/message",
    "description" : "e-mail message template",
    "type" : "object",
    "properties" : {
        "date" : { "type" : "string", "format" : "date-time", "required" : true },
        "from" : { "type" : "string", "format" : "email", "required" : true },
        "to" :  { "type" : "string", "format" : "email" }
    }
}

This is fine. But now, I want to add some new keywords, like 'indexed' for instance. Then, I will extend the draft-03 schema.

{
    "$schema" : "http://json-schema.org/draft-03/schema#",
    "id":"http://example.com/my-schema",
    "description":"My Extended schema",
    "properties":{"indexed":{"type": "boolean"}},
    "extends":{"$ref":"http://json-schema.org/draft-03/schema#"}
}

And now, I can change the message schema

{
    "$schema" : "http://example.com/my-schema",
    "id" : "http://example.com/message",
    "description" : "e-mail message template",
    "type" : "object",
    "properties" : {
    "date" : { "type" : "string", "format" : "date-time", "required" : true, "indexed" : true },
    "from" : { "type" : "string", "format" : "email", "required" : true, "indexed" : true },
    "to" : { "type" : "string", "format" : "email" }
    }
}

That changes on schemas (adding new keywords) are made dynamically. I think that I was clear explaining my needs but if some parts still obscure please feel free to ask.

fge commented 11 years ago

I fail to understand how that worked with 0.5.x at all: new schema keywords also had to be added programatically...

fge commented 11 years ago

I have just understood why you saw your behaviour.

Let me explain. In the draft v3 metaschema, the properties keyword is validated like this:

{
    "properties": { "additionalProperties": { "$ref": "#" } }
}

The { "$ref": "#" } here refers to the schema itself -- that is, the draft v3 core schema and not your extended schema.

As a result, when you try to validate a schema such as:

{
    "properties": { "from": { "indexed": "meh" } }
}

with your extended schema, values of the properties keyword validate against the properties keyword as defined in the draft v3 core schema.

You therefore want to define your extended schema like this:

{
    "$schema" : "http://json-schema.org/draft-03/schema#",
    "id":"http://example.com/my-schema",
    "description":"My Extended schema",
    "extends":{"$ref":"http://json-schema.org/draft-03/schema#"},
    "properties": {
        "indexed": { "type": "boolean" },
        "properties": { "additionalProperties": { "$ref": "#" } }
    }
}

And unfortunately, this only overrides properties. All other keywords which refer to the metaschema also have to be redefined... And this is why adding new keywords programatically is by far the better choice.

This, or write your extended schema fully, ie take the entire draft v3 schema json file and add your keywords in it.

fge commented 11 years ago

Does my explanation make any sense?

asargento commented 11 years ago

Well, your explanation was not what I was expected. I was expecting a behaviour more like it is defined on the draft for v3, that is, "the behavior of extends can be seen as validating an instance against all constraints in the extending schema as well as the extended schema(s)".

And the example on that draft is rather similar to mine... Am I misunderstanding how the extends works?

fge commented 11 years ago

There are two things:

JSON References are resolved relatively to the URI of the JSON document in which they appear;
extends does not fuse constraints.

The first point means that the { "$ref": "#" } in the core metaschema will always refer to the core schema itself.

The second point means that having, for instance:

{
    "id": "ref://schema1#",
    "properties": { "p": { "type": "boolean" } },
    "additionalProperties": false
}

and:

{
    "id": "ref://schema2#",
    "extends": { "$ref": "ref://schema1#" },
    "properties": { "q": { "type": "boolean" } }
}

effectively means that the target instance has to obey both schemas at URIs ref://schema1# and ref://schema2#. It does not mean that schemas 1 and 2 are "fused". Ie, it does not mean schema2 above is equivalent to:

{
    "properties": {
        "p": { "type": "boolean" },
        "q": { "type": "boolean" }
    },
    "additionalProperties": false
}

This is, unfortunately, a very common misinterpretation of extends.

asargento commented 11 years ago

I forget to mention that extends definition on draft v3 also talks about in inheritance. In that case, the schema2 should inherit the definition of schema1 and a target instance

{
  "p" : true,
  "q" : false
}

should be valid and

{
  "p" : 123,
  "q" : false
}

should be invalid. Final result it should be the same, the last target instance is invalid for schema1 but is valid for schema2

fge commented 11 years ago

It talks about inheritance, but not the way you think. It explicitly says that it should be validating instances "against all constraints in the extending schema as well as the extended schema".

And schema1 says that there should be no additional property defined than p. As a result, both instances are invalid.

asargento commented 11 years ago

I see. Thanks for your clarifications. One more question: if schema1 didn't have additional property as false (additionalProperties : true), both instances still invalid? Or the first becomes valid?

fge commented 11 years ago

If additionalProperties were set to true (or were absent, since it equates to the same thing) then the first instance would be valid against both schemas, while the second would be invalid against both schemas since p is not a boolean.

asargento commented 11 years ago

How the target instances are validated against schema2? What is the validation procedure?

fge commented 11 years ago

OK, I'll get into the gory details there, so I beg you to ask questions/source references if there is an obscure point to you.

There are three elements:

the validation context,
the schema registry,
the validator cache.

First, you register your schema (a JSON document) against the registry; this returns a SchemaContainer.

Then the process is as follows (this is step 1):

Resolve JSON References to the actual schema to validate (this yields a SchemaNode, which is attached to a SchemaContainer). Note that during reference resolution, the container may change.

Once you get hold of the schema node, you go through the validator cache which does two things:

it validates the schema syntax; if it fails to validate, it returns a validator which always fails;
it computes the list of keyword validators, and returns an instance validator.

The actual validation happens in the instance validator:

it goes through keyword validators one by one and validates the instance against each keyword;
if the keyword is bound to validate against other schemas (which is the case for extends), it goes through step 1 again;
if the instance is valid at that point, and the instance is a container object with at least one element, a specialized validator is triggered for this type of container, with the current schema, to validate children; for each child, it is back to step 1 again.

Note that the order of keyword evaluation is completely random. And note that for your particular example, extends points to a JSON Reference: as the current schema (your schema) does not "contain" that reference, the container changes, and therefore the context.

And there is yet more. Believe me, I've had a hard time coming with a fully coherent implementation of all that ;) Reference resolving in particular is quite the devil's job. If you are interested into what happens at this point, you can have a look here:

https://github.com/fge/json-schema-validator/blob/master/src/main/java/org/eel/kitchen/jsonschema/validator/JsonResolver.java#L58

asargento commented 11 years ago

Thanks for the explanation, and believe on you about the complexity of the implementation. I was trying only to understand a little better the validation procedure and your explanation helps a lot. One more question: in draft 4, the required keyword became a list of the keywords that are required. Any particular reason for that change? (I am asking that, because my 'indexed' is rather similar and I am considering to change that my schemas).

fge commented 11 years ago

The reason for the change in required is that it was, for instance, valid to write a schema such as:

{
    "type": "integer",
    "required": true
}

Except that required only really made sense if it was in a subschema in properties. As a result, it could not be validated independently.

The new definition of required pulls it back at the instance level proper: it validates that, if the instance is an object, then this object must have members with the given names. Personally, I also find it more pleasing aesthetically, since the list of required members in an object instance is now gathered in one place instead of being spread around in properties.

(an interesting detail about the above: you will note that if you have to validate required against anything other than an object, then validation always succeeds)

Hope this helps!

asargento commented 11 years ago

I agree with you. Thanks for the help.

java-json-tools / json-schema-validator

Invalid extended schema is validated as correct #28