java-json-tools / json-schema-validator

A JSON Schema validation implementation in pure Java, which aims for correctness and performance, in that order
http://json-schema-validator.herokuapp.com/
Other
1.63k stars 399 forks source link

Is there a way to print a schema with references resolved? #41

Closed kelvinpho closed 11 years ago

kelvinpho commented 11 years ago

Hi,

I'm not having any issues with SchemaNode and validation, but I would like a way to print my SchemaNode with $ref resolved. An example of this is on http://www.jsonschema.net/ where they can pretty print the Json Schema in JSON format.

Is there any way to do this with json-schema-validator's SchemaNode or SchemaTree?

Thanks!

fge commented 11 years ago

If I understand correctly, say you have:

{
    "type": "array",
    "items": { "$ref": "foo://bar#" }
}

and at JSON Reference foo://bar# you have:

{
    "type": "string",
    "minLength": 2
}

then you would like to have:

{
    "type": "array",
    "items": {
        "type": "string",
        "minLength": 2
    }
}

and this, recursively?

kelvinpho commented 11 years ago

Yes, that is exactly right. Is this possible with the current API? (I'm using version 2.0.0)

fge commented 11 years ago

Well, by writing a processor, yes it is. However, it is quite delicate. Consider this schema:

{
    "type": "items",
    "minItems": 2,
    "maxItems": 2,
    "items": { "oneOf": [ { "type": "integer" }, { "$ref": "#" } ]
}

This is, by essence, a recursive schema -- it would lead to an infinite loop on expansion.

More generally, such a processor should fail if a resolved JSON Reference is contained within the schema, and the pointer in the schema at which this reference was found is a child of the pointer of the reference. Oh, and there is the case of enum for which we must not expand references.

And that is not all: if we expand a draft v4 schema, we don't want a draft v3 schema to come into the picture either.

As to other cases:

Such a processor is writable, but not easy!

fge commented 11 years ago

Hmmno, this is not quite right, SchemaTreeEquivalence doesn't fit the bill since it checks for the current resolution context whereas here what is needed is equivalence of the loading URI and base node. So another equivalence needs to be written, but this is the easy part.

fge commented 11 years ago

Anyway, I think this is an interesting thing to have, so I'll give a go at it -- but not immediately: I have an Avro converter to write!

fge commented 11 years ago

There is also the solution that you give a go at it, of course. In this case, if you have questions, do not hesitate to ask ;)

kelvinpho commented 11 years ago

Thanks! This api is new to me but I'll give it a shot.

fge commented 11 years ago

You will need to have a good shot at the API of json-schema-core as well since this is what you will use to build your chains:

http://fge.github.com/json-schema-core/stable/index.html

In particular, look at the processing package.

kelvinpho commented 11 years ago

fge,

Perhaps I'm going about this the wrong way, but I basically need to "walk" the schema the same way that ValidationProcessor "walks" the schema. I see that you use the ArraySchemaSelector and ObjectSchemaSelector as helpers to cache/lookup things. Is there a reason that ObjectSchemaSelector, ObjectSchemaDigester, and ArraySchemaDigester are public, but ArraySchemaSelector has package access?

fge commented 11 years ago

Uhm, that is a visibility bug... The initial plan was to have them all package visible!

But now that you mention it, maybe they should not... Your need here, and another one I will have in the near future, will require that I walk a JSON Schema as well (note that syntax validation also walks it in some way), so maybe this could be factorized away and reused for both syntax and instance validation, and for your use case.

Out of curiosity, how did you plan to use these selectors? And would you mind sending a pull request making ArraySchemaSelector public? For the general case I'll have to think it out some more.

kelvinpho commented 11 years ago

Basically, I am doing deserialization/serialization of proprietary formats. Internally, we want our data representation and parsing to be very flexible.

For example, if I have a json schema: { "type" : "object", "properties" : { "name" : { "type" : "string", "required" : true } "age" : { "type" : "number", "required" : false } } }

I can write a generic serializer/deserializer from this schema. I will "walk" this schema, then I can know that the first field is called "name" and the 2nd field is called "age", and also validate that the data received is in the appropriate format (i.e. required, string, date, number etc). Of course, any property can be a reference to another schema if the format is very complex.

When we need to adjust the schema, then all it will take is extending/modifying the json schema and our parsing logic can remain intact. Doing this with POJO's is too difficult to maintain. The way you are walking the schema and json value during validation is exactly what I need to do. But instead of a json value, I will be walking a proprietary serialization format.

fge commented 11 years ago

What do you mean by proprietary format? A "proprietary JSON Schema"?

If yes, there may be another solution: write a processor which converts this to a JSON Schema, and if some constraints are not enforceable by existing keywords, you can create your own.

This is what I am currently doing with Avro Schema (I am writing an Avro to JSON Schema translator at the moment).

kelvinpho commented 11 years ago

No, this is not a proprietary json schema, this is a format such as CSV, XML, fixed width, flat file, or some other proprietary format based on raw bytes.

fge commented 11 years ago

OK, and you need to "flatten" JSON Schemas for your particular use case?

The more I think of it, the more I think what is needed is this:

Comments?

kelvinpho commented 11 years ago

My original work around was to create a flattened/resolved json schema, and stuff it into a JsonNode that I could walk.

After looking at the way ValidationProcessor is written, I think that your approach is much better. Making the walking logic generic would essentially remove the need for me to create a flattened json schema. I would basically take a set of bytes, and break it up into sub segments as I traversed down into the subfields of the JSON schema. Of course, as I walked through the bytes and schema, I would be building up a value to return to the user of my custom processor.

Theoretically, if the behavior was pulled out of ValidationProcessor, I could choose to walk the schema and build the flattened version, or walk the schema and parse my payload directly.

fge commented 11 years ago

OK, there is something to account for: as you may have seen, the logic in ValidationProcessor is driven by the data -- and, specifically, JSON, and even more specifically, Jackson's JsonNode.

But if I understand correctly, you want other types than JSON to be handled? Or do you convert your data to JSON before processing?

kelvinpho commented 11 years ago

I did notice that it was driven by the data. I was able to put together a Processor that mimicks the behavior of the ValidationProcessor, but instead of walking the data, it walks the schema while resolving references. I have two major things to figure out now.

1) if I'm walking the schema, is it possible to "break up" my data in the appropriate way for the recursive call? 2) how do I generate a JsonNode and build it up while I walk my schema?

kelvinpho commented 11 years ago

The more I think about it, the more difficult I think it will be to break the coupling between walking the schema and walking my data. The knowledge on how to "walk" the schema lives in the process, processArray, and processObject methods. These methods also need to contain customized logic on how to break down a chunk of data into the appropriate sub components. The decision on how to break up the chunks can be anywhere between commas (CSV files), to a field on the payload telling you how many chunks follow, and how long each one is.

fge commented 11 years ago

As to your first questions:

{
    "properties": { "p9": {} },
    "patternProperties": { "^p": {}, "\\d+$": {} }
}

If you have a member with name "p9", it will have to be valid against all three schemas. Which means you will end up breaking the data into its individual components anyway, and be driven by the data...

The main difference with, say, JsonTree and SchemaTree is that where both of these return a JsonNode, in MutableTree they return an ObjectNode -- and as such all mutation methods of ObjectNode are available.

With future Jackson, it will be possible to write something like this:

final JsonNode newNode = node.thaw().put("foo", "bar").etc().etc().freeze();

I plan to extend MutableTree, probably making a Frozen/Thawed pair (not unlike the code above, in fact!), because I think it can be very useful. That would go in -core, of course.

Now, to your second comment: yes, I detected that difficulty as well. What would be needed is a generic way to walk the data. Not impossible, mind. After all, this is part of the plan for Jackson as well, with a beefed up TreeNode. But basically, it means being able to break up data the way JSON is "broken up": null, boolean, string, number, array and object.

(why does github mess numbered list items?)

fge commented 11 years ago

Hello again,

Is the source code of your processor available somewhere, or does it "touch private matters" already? I'd like to see how you did it, I must say I lack inspiration to get started for generalizing ValidationProcessor.

kelvinpho commented 11 years ago

Here is a link to the basic stripped down Validator. I called it ParserProcessor.

https://github.com/kalgreenman/json-schema-validator/blob/walker/src/main/java/com/github/fge/jsonschema/processors/parser/ParsingProcessor.java

1) This version walks the schema and not the data 2) I haven't figured out the best way to break down the data yet 3) I haven't figured out the best way to collect a return value yet

This works for the ObjectNodes, but I haven't tested ArrayNodes yet.

Let me know if you have any suggestions or if I'm making any heinous mistakes here.

Thanks!

fge commented 11 years ago

OK, a couple of remarks:

{
    "additionalProperties": { "$ref": "some://where" }
}

the digester will not tell you with the current code. In fact, you not only need to walk properties but also additionalProperties and patternProperties.

Anyway: as I need such a walker for what I am going to do next (JSON Schema to Avro), I'm going to have a go at it too. And I have thought about it more again: maybe what is more suited is the way SyntaxProcessor works: SyntaxCheckers already recursively scan schemas and only schemas.

fge commented 11 years ago

OK, look at the commit referenced above: it contains the general idea.

If you look at, for instance, DraftV4PropertiesSyntaxChecker, you will see that it collects pointers in the Collection<JsonPointer> so that the SyntaxProcessor which called it can process these pointers in the tree afterwards.

And this is the general idea of walking here: all syntax checkers have the logic in place (and tested), that just needs to be made more generic. The KeywordWalkers will collect pointers so that the SchemaWalker can process them, and one possible processing is to substitute $ref ;)

There can be many other uses for this. I just have to think a little more about how to make this really generic.

kelvinpho commented 11 years ago

Thanks! I'll take a look and try the SyntaxProcessor/SchemaWalker way of doing things.

fge commented 11 years ago

In fact I'm already on it ;)

I have transformed it so as to have a pure walker at first. Hold your breath for a couple of minutes and I should have a first version of it in wording order real soon.

kelvinpho commented 11 years ago

OK, I'll hang on ;)

fge commented 11 years ago

Mind helping me a little? Won't be that hard, but it needs to be done ;) If you are willing, I'll explain you what to do. It is really not hard.

kelvinpho commented 11 years ago

Sure, what do you need?

fge commented 11 years ago

Do you see the commits I have done so far testing collectors? The principle is as follows: all keywords which can contain subschemas need to be tested. The way of recognizing such keywords is to see, in src/test/resources/syntax, all JSON test files where there are pointerTests entries: these are the keywords which need to be tested.

The process is as follows:

It is really fast, since all the test infrastructure/code infrastructure etc is already created. Note how AbstractPointerCollector is done: you can use getNode(tree) to get the node for this keyword, and basePointer contains a JsonPointer which you only need to .append() to to build the pointer to add.

If you wish, you can attack draft v4, I do draft v3 ;) Note: I have just finished dependencies, which is in fact in common.

fge commented 11 years ago

Note: I'll do items since it is also common to both drafts.

kelvinpho commented 11 years ago

OK, I will give it a shot for v4

kelvinpho commented 11 years ago

Some of the files did not contain any pointerTests. Should I delete these files? or leave them as empty?

The files currently look like: { }

fge commented 11 years ago

Oh, that's true.

Just ignore them.

By the way, "properties" also goes in the common section, so no need to worry about it either.

kelvinpho commented 11 years ago

it looks like you have dependencies in common. should I ignore this one too? Additionally, your common properties does not have the "pointerTests" and just starts with the array. Should I do the same? my .json files look like...

{ "pointerTests": [ { "schema": { "dependencies": { "b": {}, "a++": {}, "c": null } }, "pointers": [ "/dependencies/a++", "/dependencies/b" ] } ] }

fge commented 11 years ago

Ah, yes, it is only the array which needs to be copied over, not the object.

And yes, "dependencies" is in common/.

In fact, the only keywords you need to care about now are "anyOf", "allOf", "oneOf", "not" and "definitions". The first three can share a common base class (for instance SchemaArrayPointerCollector) and the last one can reuse SchemaMapPointerCollector which is also used for patternProperties and properties.

Right now I am testing the core mechanics of SchemaWalker itself -- this also needs to be done ;)

kelvinpho commented 11 years ago

OK,

I have all the files fixed, and I'm going through the PointerCollector collect method logic. I have definitions working, but I'm struggling with "not".

According the JsonSchema.org "latest specification"

5.5.6. not 5.5.6.1. Valid values This keyword's value MUST be an object. This object MUST be a valid JSON Schema.

So I should check if the tree node is an object to add the basepointer. However this fails your first test case since "inMyLife" is a string, not an object.

[
    {
        "schema": { "not": "inMyLife" },
        "pointers": [ "/not" ]
    },
    {
        "schema": { "not": {} },
        "pointers": [ "/not" ]
    }
]
kelvinpho commented 11 years ago

All the tests now pass. Once you let me know what you want to do with "not", I'll get the pull request ready.

Note: I did not make Dependencies or PatternProperties extend SchemaMapPointerCollector. I can do that also, but I didn't want to stomp on any changes you were making in the common package.

fge commented 11 years ago

As to not, it is quite simple:

pointers.add(basePointer);

Its argument is just a schema.

fge commented 11 years ago

Oh, I see what you mean wrt not.

As I wrote these tests for syntax validation to begin with, the pointer was always appended: it was up to syntax validation (when entering validate() to not go any further since it tested that the node was an object.

But for SchemaWalker, the schema will be valid -- that is a prerequisite. You can just remove the inMyLife test.

fge commented 11 years ago

OK, I am doing a first version of RefExpander.

Note: I think I'll move the code into -core, since it will ultimately be a general-purpose walking mechanism -- and you can update the -core dependency independently. In the meanwhile, if you can test with your own source code, this will be branch walk of my repo.

I may also change some things after I have written the first version. For instance, I think the .doProcess() method will change for it to be more generic. But, as I said, right now, a first version.

fge commented 11 years ago

OK, I have a first version. But it is butt ugly. It works. But it's ugly. And I know how I can make it break quite easily.

I need to have a mutable tree that I fill on the go, this version is real crap. But... Well... For simple cases like the one I wrote, it works OK...

fge commented 11 years ago

OK, I need to think about it some more.

The problem is not with the logic of SchemaWalker and PointerCollector, it is pretty sound. The problem is plugging in whatever work is needed. And I think a Processor is not the way to do it.

I'll think about it some more, right now I need sleep ;) But basically it is needed that we pass a mutable object to process all along the chain, and process it when we walk. .walk() can stay, but .processCurrent certainly needs to be given the boot for something better.

If you think of a design and have some time ahead of you, I'm open to ideas!

fge commented 11 years ago

OK, I have a plan. First, schema walking will be split in two: one walking strategy will not resolve the refs, the other will.

Then there will be an interface:

public interface SchemaListener
{
    void onWalk(final SchemaTree tree);

    void onTreeChange(final SchemaTree oldTree, final SchemaTree newTree);
}

The first will be called each time the walk function is entered; the second will be changed each time the current schema tree changes due to ref resolving. Of course, when not resolving refs, the second method will never be triggered.

I'll implement this and let you know how it went.

fge commented 11 years ago

OK, good news, I have a fairly complete working schema walker, with associate listeners. I could implement schema substitution the way you initially asked for, and this will also help me for Avro, so it is close to being done.

You talked about other uses for this, I'd be curious to know them?

Note that the interface is not finalized yet, I need to find better names, document etc.

fge commented 11 years ago

OK, the walk branch is now obsolete, the code has been merged into the master branch. Note, however: work is currently in progress to make this code part of -core, like I hinted earlier.

kelvinpho commented 11 years ago

Thanks! I will take a look.

Basically here are my use cases: 1) Walk the schema or data to unmarshall a payload of bytes into a Json Object that will pass the JSON schema. 2) Walk the schema or data to marshall a Json Object back into bytes. 3) Be able to load schemas and their references from any URI 4) Maintain schemas in an in memory dictionary/library that is loaded only once. 5) There will be multiple versions of the same schema (i.e. version 1 through 5 of schema A)

fge commented 11 years ago

OK, that makes things more clear, and I have some questions ;)

As to point 1, this unmarshalling can be done with a separate processor, which means only -core is needed, right? What is left to do is to build the appropriate inputs for -validator to operate; what is more, you say "schema or data": if it is a schema only, what about the data? If it is the data, what about the schema? See below for more, however.

As to point 2: am I correct in assuming this is why you needed ref resolved (for the schema)?

As to point 3: SchemaLoader provides everything you need here, since you can support any URI scheme, redirect URIs, preload schemas and so on; however, it is not publicly documented as being a feature, since its primary use at the moment is to be used by a RefResolver to resolve references. Do I understand you want this to be more "public" so as to provide SchemaTrees?

As to point 4: here again, SchemaLoader has what it takes. And, by the way, all this is done via a LoadingConfiguration.

And I don't quite understand point 5?

fge commented 11 years ago

And here is the "below for more". I have the intent to provide, in -core, a mechanism to fuse the output of two processors into the input for another:

public interface ProcessorJoiner<OUT1, OUT2, IN>
{
    IN join(final OUT1 out1, final OUT2 out2);
}

and the same for split. In your use case, this could be used, for instance, to plug a processor producing a ValueHolder<JsonTree> from a binary source, join in to a ValueHolder<SchemaTree>, and making a FullData. I still have to work out the details however.

fge commented 11 years ago

Note: I have just committed the removal of the walking mechanism from -validator, it is now in -core.

Which means I'll continue work there. The discussion can go on in this issue however.

kelvinpho commented 11 years ago

Comments on your comments:

(response to 1, including the reference to below): I think that would work well, since one processor will walk the schema, and one custom processor will have logic to walk the data, and the output of both of these values will be sent to "join" for processing. The result can be collected in the return value. In some cases, such as walking an array, the same schema will be passed with each value in the array.

(response to 2): The refs will need to be resolved for marshalling and unmashalling the bytes. I will need refs resolved in both cases, and also if my "walking" is driven by the schema or the data.

(response to 3): I think in some cases, it will be extremely valuable to me to directly access the SchemaTree of any SchemaNode. I could then actually store more metadata in the JSON Schema, and access it through the exposed SchemaTree. It will definitely give me more flexibility and allow me to ask questions about the schema without having to traverse the entire thing.

(response to 5): For my protocol specifications, I could potentially have the following:

{
  "title":"person v1",
  "type":"object",
  "properties": {
    "name" : {
      "type": "string",
      "required":true
    }
  }
}

And also

{
  "title":"person v2",
  "type":"object",
  "properties": {
    "name" : {
      "type": "string",
      "required":true
    }, 
    "age" : {
      "type": "number"
    }
  }
}

My parsing logic will first have to inspect the data to determine if I should parse the payload with v1 or v2. Then I will lookup "person v2" from some dictionary, so that I can parse the payload with the proper schema. This is really just a namespace issue and I think this is already supported.