networknt / json-schema-validator

A fast Java JSON schema validator that supports draft V4, V6, V7, V2019-09 and V2020-12
Apache License 2.0
847 stars 324 forks source link

Performance Issue #99

Closed abhinku2 closed 8 months ago

abhinku2 commented 6 years ago

I have a json data of size 20 MB( with 3 nested level of children) for schema validation against this data its taking 11 sec, how can we increase its performance? are you using parallel processing if no is there any scope of introducing parallel processing ?

stevehu commented 6 years ago

@abhinku2 The current implementation is optimized for small size JSON objects. I am planning to rewrite it with light-4j service module so that users can choose which set of validators to use. For example, between v4 to v7 and optimized for small size or big size etc. I need to complete my current task and move to this module. Thanks.

lollito commented 5 years ago

@stevehu Cina, come sei messo coi current task che qui le performance snazzicano?

stevehu commented 5 years ago

@lollito There are so many high priority tasks at the moment, it is still in the pipeline but I cannot say when it will start.

jawaff commented 5 years ago

@abhinku2 @stevehu I think I've got some ideas on how to improve performance for this particular case. The loading of the validators into the JsonSchema seems to be an independent issue (which limits us to v4 partially). I think if we just create an ExecutorService (and allow it to be overridden in the JsonSchemaFactory.Builder), then we can have the validators (whatever they may be) processing the JsonNodes in parallel.

I might put together a test project and make the change if I find some time. It doesn't seem like the change would be too difficult.

On another note, I think you should try to keep this json-schema-validator with as few dependencies as possible. The projects that I'm using this in don't have any knowledge of the light-4j platform. If necessary, this project can accept some sort of supplier interface that supplies the validators. That would allow users to define custom validators and inject whatever draft validators they need.

It would be really cool if this project was split into a multi-module project and had a draft v4 and v7 module. That would allow users to depend on the core json-schema-validator project and then inject the validators from the draft module they're interested in (or they can use a custom draft implementation).

stevehu commented 5 years ago

@jawaff I thought about making the validators running in multiple threads but don't know how to resolve the context and relationship between validators. If all validators are working independently, it should be very easy to do so; however, with AnyOf, OneOf, and AllOf in the picture, we need to pass the context of outer validator into the inner validator and they need to be executed in sequence. When I first design this library, I made a mistake thinking all validators are working independently. If I rewrite it, I wouldn't design it the current way.

I agree with you that we need to make sure that this library should independent and won't rely on light-4j. In order to make it configurable and extendable, I was thinking to introduce light-4j config and service modules who are very small libraries that handle externalized configuration and dependency injection. They are designed for microservices and I cannot find any equivalent libraries which are smaller. I am open to recommendations.

jawaff commented 5 years ago

I guess I just don't know much about light-4j. I usually just accept interfaces in order to handle dependency injection. Something is going to operate the JsonSchemaFactory. I was thinking that operator could also inject some DraftV4 implementation of a ValidatoryFactory or something like that. I'm just thinking of simple solutions, but some small light-4j dependencies wouldn't hurt if they provide value.

On Mon, Jun 24, 2019, 5:33 PM Steve Hu notifications@github.com wrote:

@jawaff https://github.com/jawaff I thought about making the validators running in multiple threads but don't know how to resolve the context and relationship between validators. If all validators are working independently, it should be very easy to do so; however, with AnyOf, OneOf, and AllOf in the picture, we need to pass the context of outer validator into the inner validator and they need to be executed in sequence. When I first design this library, I made a mistake thinking all validators are working independently. If I rewrite it, I wouldn't design it the current way.

I agree with you that we need to make sure that this library should independent and won't rely on light-4j. In order to make it configurable and extendable, I was thinking to introduce light-4j config and service modules who are very small libraries that handle externalized configuration and dependency injection. They are designed for microservices and I cannot find any equivalent libraries which are smaller. I am open to recommendations.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/networknt/json-schema-validator/issues/99?email_source=notifications&email_token=AAWC3XBOV6WSCMJWRNU7PZLP4FRVLA5CNFSM4FT2IZKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYOTNPY#issuecomment-505231039, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWC3XAGECLLVQJRKFCQJXTP4FRVLANCNFSM4FT2IZKA .

jawaff commented 5 years ago

I may need to check out the allOf, anyOf and oneOf validators more closely. It still seems like there should be a way to get validators running in parallel with this design. I'll see what I can figure out and I'll get back to you. I personally really like how this project has all of the validators defined separately.

On Mon, Jun 24, 2019, 5:33 PM Steve Hu notifications@github.com wrote:

@jawaff https://github.com/jawaff I thought about making the validators running in multiple threads but don't know how to resolve the context and relationship between validators. If all validators are working independently, it should be very easy to do so; however, with AnyOf, OneOf, and AllOf in the picture, we need to pass the context of outer validator into the inner validator and they need to be executed in sequence. When I first design this library, I made a mistake thinking all validators are working independently. If I rewrite it, I wouldn't design it the current way.

I agree with you that we need to make sure that this library should independent and won't rely on light-4j. In order to make it configurable and extendable, I was thinking to introduce light-4j config and service modules who are very small libraries that handle externalized configuration and dependency injection. They are designed for microservices and I cannot find any equivalent libraries which are smaller. I am open to recommendations.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/networknt/json-schema-validator/issues/99?email_source=notifications&email_token=AAWC3XBOV6WSCMJWRNU7PZLP4FRVLA5CNFSM4FT2IZKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYOTNPY#issuecomment-505231039, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWC3XAGECLLVQJRKFCQJXTP4FRVLANCNFSM4FT2IZKA .

stevehu commented 5 years ago

@jawaff I totally agree with you. If we can find a simple way to handle the config and injection, we don't need to have extra dependencies. This is the way how we handle the configuration for now. A lot of light-4j developers wanted to use the config module but in the end, we just provide the config file in this library and load it in the light-rest-4j instead. Although this library is mainly used by the light-4j framework, there are a lot of users are using it as an independent library.

I thought about the allOf, oneOf, and anyOf again and I think we can handle them as a whole validator. Once we are in one of these validators, we just need the same thread to validate until the leaves. In this case, we can still handle the entire JSON in parallel. It is very rare that the root is any of them.

ddobrin commented 5 years ago

These 3 are at least one level down from the root of the body usually. At my client, oneOf is used consistently in a selection object one level down from the requestBody

jawaff commented 5 years ago

@stevehu If we split this project into multiple modules, then you could add a module that serves as an adapter for plugging this project into the light-4j platform. That would then allow the core part of this library to be independent. That's usually how I handle making projects independent, but able to be easily integrated with a particular platform. That also would allow additional platform adapters to be created in the future. For example, there could be a Jersey adapter that validates REST request/response bodies.

I'm a bit confused on the allOf, oneOf and anyOf validators. Are you saying that they can be merged into a single AllOf/OneOf/AnyOf validator? They seem perfect the way they are from my point of view. I still don't fully understand how the validators are structured though.

I still need to go back to examine how the JsonSchema deals with its validators. I've been picturing a tree-like validator structure. I figure that different branches of the json data can be validated in parallel and only relevant branches of the validator tree would be needed. Json schema validation is kind of confusing. I'm going to need to do some more thinking on that.

jawaff commented 5 years ago

@stevehu I started trying to address the need for parallel validations. My first attempt was to just make all of the validators asynchronous and nonblocking with CompletableFutures. I then used your performance test project to check the results and the validation ended up being slower unfortunately (~5x slower I think). I believe that the issue is with the threads only doing a little bit of work at a time (lots of context switching). I'm going to attempt to make this solution better, but I might have to go back to the drawing board and avoid CompletableFutures.

Here is my current in-progress solution: https://github.com/jawaff/json-schema-validator/tree/fix/%2399-performance-improvement

stevehu commented 5 years ago

@jawaff The result is interesting. I agree with your analysis of the context switching. Another problem with CompletableFutures is that it is blocking. If one validator is slow, others will be waiting for it and the threads are all stuck. I am more leaning to the fork/join approach with work stealing.

jawaff commented 5 years ago

@stevehu I was thinking about the fork/join approach as well. I'm just not as familiar with it.

The CompletableFuture approach isn't entirely blocking though. If one validator is being slow, then the other validators can just use the other threads in the pool. I tried to make the validators as nonblocking as possible. They should be submitting their work to the shared thread pool and returning the CompletableFuture as soon as possible.

stevehu commented 5 years ago

Agreed. We don't know what works and we need to try out different approaches. It is hard work to optimize it and we will learn a lot from this exercise.

kosty commented 5 years ago

Just a few points on the matter. I validate documents of size in 100s of MB and validation time takes < 5 sec on average.

Although this thread is mostly around speedup through parallel execution, I got some numbers for speedup in a single-threaded setting for minimum/maximum here #174

stevehu commented 8 months ago

With the latest changes from @justin-tay, the performance is not an issue anymore.