Closed kelvinpho closed 11 years ago
If I understand correctly, say you have:
{
"type": "array",
"items": { "$ref": "foo://bar#" }
}
and at JSON Reference foo://bar#
you have:
{
"type": "string",
"minLength": 2
}
then you would like to have:
{
"type": "array",
"items": {
"type": "string",
"minLength": 2
}
}
and this, recursively?
Yes, that is exactly right. Is this possible with the current API? (I'm using version 2.0.0)
Well, by writing a processor, yes it is. However, it is quite delicate. Consider this schema:
{
"type": "items",
"minItems": 2,
"maxItems": 2,
"items": { "oneOf": [ { "type": "integer" }, { "$ref": "#" } ]
}
This is, by essence, a recursive schema -- it would lead to an infinite loop on expansion.
More generally, such a processor should fail if a resolved JSON Reference is contained within the schema, and the pointer in the schema at which this reference was found is a child of the pointer of the reference. Oh, and there is the case of enum
for which we must not expand references.
And that is not all: if we expand a draft v4 schema, we don't want a draft v3 schema to come into the picture either.
As to other cases:
RefResolver
;SyntaxProcessor
;SchemaTree
equivalence is handled by SchemaTreeEquivalence.getInstance()
.Such a processor is writable, but not easy!
Hmmno, this is not quite right, SchemaTreeEquivalence
doesn't fit the bill since it checks for the current resolution context whereas here what is needed is equivalence of the loading URI and base node. So another equivalence needs to be written, but this is the easy part.
Anyway, I think this is an interesting thing to have, so I'll give a go at it -- but not immediately: I have an Avro converter to write!
There is also the solution that you give a go at it, of course. In this case, if you have questions, do not hesitate to ask ;)
Thanks! This api is new to me but I'll give it a shot.
You will need to have a good shot at the API of json-schema-core as well since this is what you will use to build your chains:
http://fge.github.com/json-schema-core/stable/index.html
In particular, look at the processing
package.
fge,
Perhaps I'm going about this the wrong way, but I basically need to "walk" the schema the same way that ValidationProcessor "walks" the schema. I see that you use the ArraySchemaSelector and ObjectSchemaSelector as helpers to cache/lookup things. Is there a reason that ObjectSchemaSelector, ObjectSchemaDigester, and ArraySchemaDigester are public, but ArraySchemaSelector has package access?
Uhm, that is a visibility bug... The initial plan was to have them all package visible!
But now that you mention it, maybe they should not... Your need here, and another one I will have in the near future, will require that I walk a JSON Schema as well (note that syntax validation also walks it in some way), so maybe this could be factorized away and reused for both syntax and instance validation, and for your use case.
Out of curiosity, how did you plan to use these selectors? And would you mind sending a pull request making ArraySchemaSelector public? For the general case I'll have to think it out some more.
Basically, I am doing deserialization/serialization of proprietary formats. Internally, we want our data representation and parsing to be very flexible.
For example, if I have a json schema: { "type" : "object", "properties" : { "name" : { "type" : "string", "required" : true } "age" : { "type" : "number", "required" : false } } }
I can write a generic serializer/deserializer from this schema. I will "walk" this schema, then I can know that the first field is called "name" and the 2nd field is called "age", and also validate that the data received is in the appropriate format (i.e. required, string, date, number etc). Of course, any property can be a reference to another schema if the format is very complex.
When we need to adjust the schema, then all it will take is extending/modifying the json schema and our parsing logic can remain intact. Doing this with POJO's is too difficult to maintain. The way you are walking the schema and json value during validation is exactly what I need to do. But instead of a json value, I will be walking a proprietary serialization format.
What do you mean by proprietary format? A "proprietary JSON Schema"?
If yes, there may be another solution: write a processor which converts this to a JSON Schema, and if some constraints are not enforceable by existing keywords, you can create your own.
This is what I am currently doing with Avro Schema (I am writing an Avro to JSON Schema translator at the moment).
No, this is not a proprietary json schema, this is a format such as CSV, XML, fixed width, flat file, or some other proprietary format based on raw bytes.
OK, and you need to "flatten" JSON Schemas for your particular use case?
The more I think of it, the more I think what is needed is this:
KeywordValidator
interface, so that its first and third argument can change;ValidationProcessor
to make its walking logic available to other processors.Comments?
My original work around was to create a flattened/resolved json schema, and stuff it into a JsonNode that I could walk.
After looking at the way ValidationProcessor is written, I think that your approach is much better. Making the walking logic generic would essentially remove the need for me to create a flattened json schema. I would basically take a set of bytes, and break it up into sub segments as I traversed down into the subfields of the JSON schema. Of course, as I walked through the bytes and schema, I would be building up a value to return to the user of my custom processor.
Theoretically, if the behavior was pulled out of ValidationProcessor, I could choose to walk the schema and build the flattened version, or walk the schema and parse my payload directly.
OK, there is something to account for: as you may have seen, the logic in ValidationProcessor is driven by the data -- and, specifically, JSON, and even more specifically, Jackson's JsonNode
.
But if I understand correctly, you want other types than JSON to be handled? Or do you convert your data to JSON before processing?
I did notice that it was driven by the data. I was able to put together a Processor that mimicks the behavior of the ValidationProcessor, but instead of walking the data, it walks the schema while resolving references. I have two major things to figure out now.
1) if I'm walking the schema, is it possible to "break up" my data in the appropriate way for the recursive call? 2) how do I generate a JsonNode and build it up while I walk my schema?
The more I think about it, the more difficult I think it will be to break the coupling between walking the schema and walking my data. The knowledge on how to "walk" the schema lives in the process, processArray, and processObject methods. These methods also need to contain customized logic on how to break down a chunk of data into the appropriate sub components. The decision on how to break up the chunks can be anywhere between commas (CSV files), to a field on the payload telling you how many chunks follow, and how long each one is.
As to your first questions:
{
"properties": { "p9": {} },
"patternProperties": { "^p": {}, "\\d+$": {} }
}
If you have a member with name "p9", it will have to be valid against all three schemas. Which means you will end up breaking the data into its individual components anyway, and be driven by the data...
2: I have created a limited-purpose class for that in my Avro to JSON Schema processor (which I have just completed), but the real deal will be when I "do" to Jackson what I promised its author to do: apply the freeze/thaw pattern to JsonNode
. At the moment however, and which is what I use, all types of nodes can be created using a JsonNodeFactory
. There is one provided by JacksonUtils.nodeFactory()
. Here is the very limited purpose class that I have created:
The main difference with, say, JsonTree
and SchemaTree
is that where both of these return a JsonNode
, in MutableTree
they return an ObjectNode
-- and as such all mutation methods of ObjectNode
are available.
With future Jackson, it will be possible to write something like this:
final JsonNode newNode = node.thaw().put("foo", "bar").etc().etc().freeze();
I plan to extend MutableTree
, probably making a Frozen
/Thawed
pair (not unlike the code above, in fact!), because I think it can be very useful. That would go in -core, of course.
Now, to your second comment: yes, I detected that difficulty as well. What would be needed is a generic way to walk the data. Not impossible, mind. After all, this is part of the plan for Jackson as well, with a beefed up TreeNode
. But basically, it means being able to break up data the way JSON is "broken up": null, boolean, string, number, array and object.
(why does github mess numbered list items?)
Hello again,
Is the source code of your processor available somewhere, or does it "touch private matters" already? I'd like to see how you did it, I must say I lack inspiration to get started for generalizing ValidationProcessor
.
Here is a link to the basic stripped down Validator. I called it ParserProcessor.
1) This version walks the schema and not the data 2) I haven't figured out the best way to break down the data yet 3) I haven't figured out the best way to collect a return value yet
This works for the ObjectNodes, but I haven't tested ArrayNodes yet.
Let me know if you have any suggestions or if I'm making any heinous mistakes here.
Thanks!
OK, a couple of remarks:
null
as a first argument to the ref resolver processoris, uhm ;) You can use a DevNullReport
if you don't care about the messages, but since RefResolver
only throws exceptions, see below;SchemaHolder
instead of a FullData
;System.out.println()
, you can use report.debug()
and initialize a ConsoleProcessingReport
with LogLevel.DEBUG
as the log level -- this report implementation uses System.out.println()
;JsonPointer.of(index)
instead -- this is the only place where I forgot this!JsonPointer.of("properties", field);
;{
"additionalProperties": { "$ref": "some://where" }
}
the digester will not tell you with the current code. In fact, you not only need to walk properties
but also additionalProperties
and patternProperties
.
Anyway: as I need such a walker for what I am going to do next (JSON Schema to Avro), I'm going to have a go at it too. And I have thought about it more again: maybe what is more suited is the way SyntaxProcessor
works: SyntaxChecker
s already recursively scan schemas and only schemas.
OK, look at the commit referenced above: it contains the general idea.
If you look at, for instance, DraftV4PropertiesSyntaxChecker
, you will see that it collects pointers in the Collection<JsonPointer>
so that the SyntaxProcessor
which called it can process these pointers in the tree afterwards.
And this is the general idea of walking here: all syntax checkers have the logic in place (and tested), that just needs to be made more generic. The KeywordWalkers
will collect pointers so that the SchemaWalker
can process them, and one possible processing is to substitute $ref
;)
There can be many other uses for this. I just have to think a little more about how to make this really generic.
Thanks! I'll take a look and try the SyntaxProcessor/SchemaWalker way of doing things.
In fact I'm already on it ;)
I have transformed it so as to have a pure walker at first. Hold your breath for a couple of minutes and I should have a first version of it in wording order real soon.
OK, I'll hang on ;)
Mind helping me a little? Won't be that hard, but it needs to be done ;) If you are willing, I'll explain you what to do. It is really not hard.
Sure, what do you need?
Do you see the commits I have done so far testing collectors? The principle is as follows: all keywords which can contain subschemas need to be tested. The way of recognizing such keywords is to see, in src/test/resources/syntax
, all JSON test files where there are pointerTests
entries: these are the keywords which need to be tested.
The process is as follows:
draftv4/allOf.json
) from syntax
to walk
, modify the file so that the pointerTests
array remains alone;PointerCollector
in the source;DraftV4PointerCollectorDictionary
to add the keyword and collector;It is really fast, since all the test infrastructure/code infrastructure etc is already created. Note how AbstractPointerCollector
is done: you can use getNode(tree)
to get the node for this keyword, and basePointer
contains a JsonPointer
which you only need to .append()
to to build the pointer to add.
If you wish, you can attack draft v4, I do draft v3 ;) Note: I have just finished dependencies
, which is in fact in common
.
Note: I'll do items
since it is also common to both drafts.
OK, I will give it a shot for v4
Some of the files did not contain any pointerTests. Should I delete these files? or leave them as empty?
The files currently look like: { }
Oh, that's true.
Just ignore them.
By the way, "properties" also goes in the common section, so no need to worry about it either.
it looks like you have dependencies in common. should I ignore this one too? Additionally, your common properties does not have the "pointerTests" and just starts with the array. Should I do the same? my .json files look like...
{ "pointerTests": [ { "schema": { "dependencies": { "b": {}, "a++": {}, "c": null } }, "pointers": [ "/dependencies/a++", "/dependencies/b" ] } ] }
Ah, yes, it is only the array which needs to be copied over, not the object.
And yes, "dependencies" is in common/.
In fact, the only keywords you need to care about now are "anyOf", "allOf", "oneOf", "not" and "definitions". The first three can share a common base class (for instance SchemaArrayPointerCollector
) and the last one can reuse SchemaMapPointerCollector
which is also used for patternProperties
and properties
.
Right now I am testing the core mechanics of SchemaWalker
itself -- this also needs to be done ;)
OK,
I have all the files fixed, and I'm going through the PointerCollector collect method logic. I have definitions working, but I'm struggling with "not".
According the JsonSchema.org "latest specification"
5.5.6. not 5.5.6.1. Valid values This keyword's value MUST be an object. This object MUST be a valid JSON Schema.
So I should check if the tree node is an object to add the basepointer. However this fails your first test case since "inMyLife" is a string, not an object.
[
{
"schema": { "not": "inMyLife" },
"pointers": [ "/not" ]
},
{
"schema": { "not": {} },
"pointers": [ "/not" ]
}
]
All the tests now pass. Once you let me know what you want to do with "not", I'll get the pull request ready.
Note: I did not make Dependencies or PatternProperties extend SchemaMapPointerCollector. I can do that also, but I didn't want to stomp on any changes you were making in the common package.
As to not
, it is quite simple:
pointers.add(basePointer);
Its argument is just a schema.
Oh, I see what you mean wrt not
.
As I wrote these tests for syntax validation to begin with, the pointer was always appended: it was up to syntax validation (when entering validate()
to not go any further since it tested that the node was an object.
But for SchemaWalker
, the schema will be valid -- that is a prerequisite. You can just remove the inMyLife
test.
OK, I am doing a first version of RefExpander
.
Note: I think I'll move the code into -core, since it will ultimately be a general-purpose walking mechanism -- and you can update the -core dependency independently. In the meanwhile, if you can test with your own source code, this will be branch walk
of my repo.
I may also change some things after I have written the first version. For instance, I think the .doProcess()
method will change for it to be more generic. But, as I said, right now, a first version.
OK, I have a first version. But it is butt ugly. It works. But it's ugly. And I know how I can make it break quite easily.
I need to have a mutable tree that I fill on the go, this version is real crap. But... Well... For simple cases like the one I wrote, it works OK...
OK, I need to think about it some more.
The problem is not with the logic of SchemaWalker
and PointerCollector
, it is pretty sound. The problem is plugging in whatever work is needed. And I think a Processor
is not the way to do it.
I'll think about it some more, right now I need sleep ;) But basically it is needed that we pass a mutable object to process all along the chain, and process it when we walk. .walk()
can stay, but .processCurrent
certainly needs to be given the boot for something better.
If you think of a design and have some time ahead of you, I'm open to ideas!
OK, I have a plan. First, schema walking will be split in two: one walking strategy will not resolve the refs, the other will.
Then there will be an interface:
public interface SchemaListener
{
void onWalk(final SchemaTree tree);
void onTreeChange(final SchemaTree oldTree, final SchemaTree newTree);
}
The first will be called each time the walk
function is entered; the second will be changed each time the current schema tree changes due to ref resolving. Of course, when not resolving refs, the second method will never be triggered.
I'll implement this and let you know how it went.
OK, good news, I have a fairly complete working schema walker, with associate listeners. I could implement schema substitution the way you initially asked for, and this will also help me for Avro, so it is close to being done.
You talked about other uses for this, I'd be curious to know them?
Note that the interface is not finalized yet, I need to find better names, document etc.
OK, the walk branch is now obsolete, the code has been merged into the master
branch. Note, however: work is currently in progress to make this code part of -core, like I hinted earlier.
Thanks! I will take a look.
Basically here are my use cases: 1) Walk the schema or data to unmarshall a payload of bytes into a Json Object that will pass the JSON schema. 2) Walk the schema or data to marshall a Json Object back into bytes. 3) Be able to load schemas and their references from any URI 4) Maintain schemas in an in memory dictionary/library that is loaded only once. 5) There will be multiple versions of the same schema (i.e. version 1 through 5 of schema A)
OK, that makes things more clear, and I have some questions ;)
As to point 1, this unmarshalling can be done with a separate processor, which means only -core is needed, right? What is left to do is to build the appropriate inputs for -validator to operate; what is more, you say "schema or data": if it is a schema only, what about the data? If it is the data, what about the schema? See below for more, however.
As to point 2: am I correct in assuming this is why you needed ref resolved (for the schema)?
As to point 3: SchemaLoader
provides everything you need here, since you can support any URI scheme, redirect URIs, preload schemas and so on; however, it is not publicly documented as being a feature, since its primary use at the moment is to be used by a RefResolver
to resolve references. Do I understand you want this to be more "public" so as to provide SchemaTree
s?
As to point 4: here again, SchemaLoader
has what it takes. And, by the way, all this is done via a LoadingConfiguration
.
And I don't quite understand point 5?
And here is the "below for more". I have the intent to provide, in -core, a mechanism to fuse the output of two processors into the input for another:
public interface ProcessorJoiner<OUT1, OUT2, IN>
{
IN join(final OUT1 out1, final OUT2 out2);
}
and the same for split. In your use case, this could be used, for instance, to plug a processor producing a ValueHolder<JsonTree>
from a binary source, join in to a ValueHolder<SchemaTree>
, and making a FullData
. I still have to work out the details however.
Note: I have just committed the removal of the walking mechanism from -validator, it is now in -core.
Which means I'll continue work there. The discussion can go on in this issue however.
Comments on your comments:
(response to 1, including the reference to below): I think that would work well, since one processor will walk the schema, and one custom processor will have logic to walk the data, and the output of both of these values will be sent to "join" for processing. The result can be collected in the return value. In some cases, such as walking an array, the same schema will be passed with each value in the array.
(response to 2): The refs will need to be resolved for marshalling and unmashalling the bytes. I will need refs resolved in both cases, and also if my "walking" is driven by the schema or the data.
(response to 3): I think in some cases, it will be extremely valuable to me to directly access the SchemaTree of any SchemaNode. I could then actually store more metadata in the JSON Schema, and access it through the exposed SchemaTree. It will definitely give me more flexibility and allow me to ask questions about the schema without having to traverse the entire thing.
(response to 5): For my protocol specifications, I could potentially have the following:
{
"title":"person v1",
"type":"object",
"properties": {
"name" : {
"type": "string",
"required":true
}
}
}
And also
{
"title":"person v2",
"type":"object",
"properties": {
"name" : {
"type": "string",
"required":true
},
"age" : {
"type": "number"
}
}
}
My parsing logic will first have to inspect the data to determine if I should parse the payload with v1 or v2. Then I will lookup "person v2" from some dictionary, so that I can parse the payload with the proper schema. This is really just a namespace issue and I think this is already supported.
Hi,
I'm not having any issues with SchemaNode and validation, but I would like a way to print my SchemaNode with $ref resolved. An example of this is on http://www.jsonschema.net/ where they can pretty print the Json Schema in JSON format.
Is there any way to do this with json-schema-validator's SchemaNode or SchemaTree?
Thanks!