Question - Validator reference loading and resource management

dvstans commented 3 years ago

Hi, Not sure if this is the right place to ask this, but I'm considering using this library within a multi-threaded C++ service to validate JSON per user request and using many different (custom) schemas, with references. I need to implement schema resource sharing and management (for validator instances) across all of the client threads to reduce memory use and schema loading latency. (We have bursty use cases were a client may need to validate using the same schema 1000's of times in a row.)

My first question is - is it possible/safe for multiple client threads to concurrently validate different JSON documents against the same validator instance (same schema)?

Next, I see that the schema loader expects the schema JSON to be assigned and returned via a JSON parameter. In my scenario, it would be better if I could return a reference to an already initialized schema validator instance (that would be cached in my schema resource pool). I don't mind making this change, but I wanted to ask if there would be any architectural caveats to this approach before I look into it. The idea is for each client thread to assemble a specific validator from existing loaded and initialized validators to reduce latency and total memory use.

Thanks!

pboettch commented 3 years ago

Not sure if this is the right place to ask this,

Right place! 😃

but I'm considering using this library within a multi-threaded C++ service to validate JSON per user request and using many different (custom) schemas, with references.

Multiple schemas or one big schema where everything is referenced and you would to validate instances to sub-schemas?

I need to implement schema resource sharing and management (for validator instances) across all of the client threads to reduce memory use and schema loading latency. (We have bursty use cases were a client may need to validate using the same schema 1000's of times in a row.)

OK.

My first question is - is it possible/safe for multiple client threads to concurrently validate different JSON documents against the same validator instance (same schema)?

Yes. The validator is thread-safe for validating (see constness of the validate()-method) - thus can be used from different threads concurrently. Not so for loading: you would need to load the schema in one (main) thread before using.

Next, I see that the schema loader expects the schema JSON to be assigned and returned via a JSON parameter. In my scenario, it would be better if I could return a reference to an already initialized schema validator instance (that would be cached in my schema resource pool). I don't mind making this change, but I wanted to ask if there would be any architectural caveats to this approach before I look into it. The idea is for each client thread to assemble a specific validator from existing loaded and initialized validators to reduce latency and total memory use.

Interesting idea and problem.

I never used huge schemas but had one with references in every sens (recursive, partly nested). It worked well.

While reading your description I immediately thought of making a super-schema which references all of your schemas, so parsing and instantiating would be done once and all validators would only exist once - without changing the library.

Then we would need to add a method to get a (C++) reference to a sub-schema (which actually is a schema) with an URL which can then be used to validate.

Or, we would add an optional parameter to the validate-method which defines the entry-point of the sub-schema to use. I prefer this method - I'm not sure getting sub-schemas as C++-references will be trivial to implement.

Your idea of combining different instances of validators won't work, because of the absolute URLs used to reference sub-schemas. There would probably be duplicated schemas.

If memory usage is a problem, currently the validator could release some memory after finishing loading (and resolving) all references. Out of lazynes I never took care of that.

Another optimization could be to remove duplicated (sub-)schemas.

Could you provide an example

dvstans commented 3 years ago

Thanks for the info! The general use case is having many concurrent worker threads validating an unknown composition of user-defined schemas - some large/complex, some small, some nested, etc. The schemas being validated per worker at any given point in time are independent of each other, but there is a high probability of sequential and/or sub-schema reuse (i.e. there may be common sub-schemas that are reference from many other schemas). The JSON being validated is domain-specific scientific metadata and is generally fairly small (< 50 KB).

The idea of schema resource management is an optimization that would allow us to sustain more concurrent requests as well as reduce latency per request. This isn't necessarily a requirement up-front, but something we would like to move towards in the near future. In the mean time, I believe I can still cache referenced schemas in JSON using the current API to reduce schema load times (all of our schemas will be stored in a local database). I would also say that memory conservation is more important than latency in our case, so, for the time being, worker threads would simply dispose of their validators at the end of request processing.

If I have any cycles soon, I'd be happy to help support your idea of super-schemas.

pboettch commented 3 years ago

My idea of a super-schema is much simpler than you might think. One question for that: Do you have all of year schema ready at the start of the process, or do new schemas appear while your process (which hosts the worker-threads) is running?

dvstans commented 3 years ago

Within the context of a single server worker thread validating a user request, the involved schema(s) would already exist; however, over time, users/admins may create new schemas, so the server will need to be able to find/load them on first use. (I will be implementing a custom schema loader.) Eventually, there will likely be many thousands of schemas defined. (FYI, this system is a domain-agnostic, federated scientific data management system: DataFed

pboettch commented 3 years ago

Another question: how will the worker threads choose the schema/validator to be used?

dvstans commented 3 years ago

Ah - the JSON being validated is a sub-document within a record that includes, among other items, the root schema ID. So the worker will be able to load the root schema, init a validator, then as the validator progresses, additional schemas would be loaded as needed.

pboettch commented 3 years ago

Ah - the JSON being validated is a sub-document within a record that includes, among other items, the root schema ID. So the worker will be able to load the root schema, init a validator, then as the validator progresses, additional schemas would be loaded as needed.

I see now what you want to do. The most difficult will be to add a new schema to the validator from a worker-thread. As it will block all other thread using the same validator.

Adding a schema will modify internal structures (mostly vectors and maps) and they are not thread-safe.

dvstans commented 3 years ago

I see now what you want to do. The most difficult will be to add a new schema to the validator from a worker-thread. As it will block all other thread using the same validator.

Adding a schema will modify internal structures (mostly vectors and maps) and they are not thread-safe.

I wasn't thinking so much that shared validator instances would be changed over time, simply that when a new validator needs to load a sub-schema, it could do so by utilizing an existing (cached) initialized validator (which, itself, would not need to be changed, hopefully). The new validator would be an aggregate object with a local root schema and links to other existing (immutable) validators for sub-schemas.

pboettch commented 3 years ago

The more you describe the project, the more I find it interesting.

Clearly this library is not exactly ready for the way you would want to use it. However, I'm not sure referencing validators is the best way to do it.

There is also a race condition between multiple worker-thread receiving instances to be validated with the same new schema. Who is right? How to synchronize? I think a single thread has to do the house-keeping and update the schema-validator on requests of the workers.

What about the JSON-schema-standard-version required for your project? This is draft7, would you need it.

Could we discuss off-list by mail? If so, I think you have access to my email-address on Github?

dvstans commented 3 years ago

Oops - let me try that again: DataFed

pboettch commented 3 years ago

Did you advance on this subject?

dvstans commented 3 years ago

Schema support was put on hold for a while due to higher priority issues, but I have just recently resumed work o it. For now, I'm not attempting to optimize schema validation across threads - there are other more significant performance impediments that need to be addressed first.

pboettch commented 3 years ago

By pure chance (?) a similar request (partly at least) was made and I tried myself on an implementation. What do you think?

https://github.com/pboettch/json-schema-validator/commit/2194addca4994a6bc2c0e8c71edca962ae79aed3#diff-d584b8fb27883417319b4f50effb4f09f37520061dcd115e374357f9488f7957R64

pboettch commented 2 years ago

I'm closing this for the moment as I think the implementation for #154 could solve your issue/feature-request at least partly. Do not hesitate to re-open or to open a new issue.

pboettch / json-schema-validator

Question - Validator reference loading and resource management #135