Closed dvstans closed 2 years ago
Not sure if this is the right place to ask this,
Right place! 😃
but I'm considering using this library within a multi-threaded C++ service to validate JSON per user request and using many different (custom) schemas, with references.
Multiple schemas or one big schema where everything is referenced and you would to validate instances to sub-schemas?
I need to implement schema resource sharing and management (for validator instances) across all of the client threads to reduce memory use and schema loading latency. (We have bursty use cases were a client may need to validate using the same schema 1000's of times in a row.)
OK.
My first question is - is it possible/safe for multiple client threads to concurrently validate different JSON documents against the same validator instance (same schema)?
Yes. The validator is thread-safe for validating (see constness of the validate()
-method) - thus can be used from different threads concurrently. Not so for loading: you would need to load the schema in one (main) thread before using.
Next, I see that the schema loader expects the schema JSON to be assigned and returned via a JSON parameter. In my scenario, it would be better if I could return a reference to an already initialized schema validator instance (that would be cached in my schema resource pool). I don't mind making this change, but I wanted to ask if there would be any architectural caveats to this approach before I look into it. The idea is for each client thread to assemble a specific validator from existing loaded and initialized validators to reduce latency and total memory use.
Interesting idea and problem.
I never used huge schemas but had one with references in every sens (recursive, partly nested). It worked well.
While reading your description I immediately thought of making a super-schema which references all of your schemas, so parsing and instantiating would be done once and all validators would only exist once - without changing the library.
Then we would need to add a method to get a (C++) reference to a sub-schema (which actually is a schema) with an URL which can then be used to validate.
Or, we would add an optional parameter to the validate-method which defines the entry-point of the sub-schema to use. I prefer this method - I'm not sure getting sub-schemas as C++-references will be trivial to implement.
Your idea of combining different instances of validators won't work, because of the absolute URLs used to reference sub-schemas. There would probably be duplicated schemas.
If memory usage is a problem, currently the validator could release some memory after finishing loading (and resolving) all references. Out of lazynes I never took care of that.
Another optimization could be to remove duplicated (sub-)schemas.
Could you provide an example
Thanks for the info! The general use case is having many concurrent worker threads validating an unknown composition of user-defined schemas - some large/complex, some small, some nested, etc. The schemas being validated per worker at any given point in time are independent of each other, but there is a high probability of sequential and/or sub-schema reuse (i.e. there may be common sub-schemas that are reference from many other schemas). The JSON being validated is domain-specific scientific metadata and is generally fairly small (< 50 KB).
The idea of schema resource management is an optimization that would allow us to sustain more concurrent requests as well as reduce latency per request. This isn't necessarily a requirement up-front, but something we would like to move towards in the near future. In the mean time, I believe I can still cache referenced schemas in JSON using the current API to reduce schema load times (all of our schemas will be stored in a local database). I would also say that memory conservation is more important than latency in our case, so, for the time being, worker threads would simply dispose of their validators at the end of request processing.
If I have any cycles soon, I'd be happy to help support your idea of super-schemas.
My idea of a super-schema is much simpler than you might think. One question for that: Do you have all of year schema ready at the start of the process, or do new schemas appear while your process (which hosts the worker-threads) is running?
Within the context of a single server worker thread validating a user request, the involved schema(s) would already exist; however, over time, users/admins may create new schemas, so the server will need to be able to find/load them on first use. (I will be implementing a custom schema loader.) Eventually, there will likely be many thousands of schemas defined. (FYI, this system is a domain-agnostic, federated scientific data management system: DataFed
Another question: how will the worker threads choose the schema/validator to be used?
Ah - the JSON being validated is a sub-document within a record that includes, among other items, the root schema ID. So the worker will be able to load the root schema, init a validator, then as the validator progresses, additional schemas would be loaded as needed.
Ah - the JSON being validated is a sub-document within a record that includes, among other items, the root schema ID. So the worker will be able to load the root schema, init a validator, then as the validator progresses, additional schemas would be loaded as needed.
I see now what you want to do. The most difficult will be to add a new schema to the validator from a worker-thread. As it will block all other thread using the same validator.
Adding a schema will modify internal structures (mostly vectors and maps) and they are not thread-safe.
I see now what you want to do. The most difficult will be to add a new schema to the validator from a worker-thread. As it will block all other thread using the same validator.
Adding a schema will modify internal structures (mostly vectors and maps) and they are not thread-safe.
I wasn't thinking so much that shared validator instances would be changed over time, simply that when a new validator needs to load a sub-schema, it could do so by utilizing an existing (cached) initialized validator (which, itself, would not need to be changed, hopefully). The new validator would be an aggregate object with a local root schema and links to other existing (immutable) validators for sub-schemas.
The more you describe the project, the more I find it interesting.
Clearly this library is not exactly ready for the way you would want to use it. However, I'm not sure referencing validators is the best way to do it.
There is also a race condition between multiple worker-thread receiving instances to be validated with the same new schema. Who is right? How to synchronize? I think a single thread has to do the house-keeping and update the schema-validator on requests of the workers.
What about the JSON-schema-standard-version required for your project? This is draft7, would you need it.
Could we discuss off-list by mail? If so, I think you have access to my email-address on Github?
Did you advance on this subject?
Schema support was put on hold for a while due to higher priority issues, but I have just recently resumed work o it. For now, I'm not attempting to optimize schema validation across threads - there are other more significant performance impediments that need to be addressed first.
By pure chance (?) a similar request (partly at least) was made and I tried myself on an implementation. What do you think?
I'm closing this for the moment as I think the implementation for #154 could solve your issue/feature-request at least partly. Do not hesitate to re-open or to open a new issue.
Hi, Not sure if this is the right place to ask this, but I'm considering using this library within a multi-threaded C++ service to validate JSON per user request and using many different (custom) schemas, with references. I need to implement schema resource sharing and management (for validator instances) across all of the client threads to reduce memory use and schema loading latency. (We have bursty use cases were a client may need to validate using the same schema 1000's of times in a row.)
My first question is - is it possible/safe for multiple client threads to concurrently validate different JSON documents against the same validator instance (same schema)?
Next, I see that the schema loader expects the schema JSON to be assigned and returned via a JSON parameter. In my scenario, it would be better if I could return a reference to an already initialized schema validator instance (that would be cached in my schema resource pool). I don't mind making this change, but I wanted to ask if there would be any architectural caveats to this approach before I look into it. The idea is for each client thread to assemble a specific validator from existing loaded and initialized validators to reduce latency and total memory use.
Thanks!