jhthorsen / json-validator

:cop: Validate data against a JSON schema
https://metacpan.org/release/JSON-Validator
56 stars 58 forks source link

Feature request: caching validated output #246

Closed minusdavid closed 3 years ago

minusdavid commented 3 years ago

This is a feature request rather than a bug report.

At the moment, I have a server that runs 60 Mojolicious apps as PSGI apps using Starman. Starman forks off 2 worker processes each, so 120 processes, and each one of those processes is loading and validating a OpenAPI spec file via Mojolicious::Plugin::OpenAPI.

This is creating huge load on the system, even though each process is using the same spec file.

It would be great if JSON::Validator could create a hash of the spec file, validate the spec file, and then cache the hash in a special "validated_cache" within the cache_paths.

Alternatively, having an option to turn off the validation would be great, since I can pre-validate it during dev/test/staging.

minusdavid commented 3 years ago

Actually, maybe this is a case where we just need to use a newer Mojolicious::Plugin::OpenAPI::Perl version, since it almost seems like Mojolicious::Plugin::OpenAPI::Perl 4.02 and JSON::Validator 4.16 might not validate the OpenAPI spec at plugin registration time?

Am I missing something?

minusdavid commented 3 years ago

Oh I think 'my $errors = $self->validator->errors;' is doing the validation? That's obscure.

jhthorsen commented 3 years ago

Are you sure it's the validation that takes a long time, and not just loading the spec file and finding potential "$ref"?

What is Mojolicious::Plugin::OpenAPI::Perl ?

That's obscure.

Please provide some sort of constructive suggestion.

minusdavid commented 3 years ago

Thanks for the quick response!

Are you sure it's the validation that takes a long time, and not just loading the spec file and finding potential "$ref"?

I am not at all sure. At this point, I'm reviewing other people's code, and I've mostly just traced the slowness to Mojolicious::Plugin::OpenAPI. They were doing an extra load_and_validate_schema outside of Mojolicious::Plugin::OpenAPI and that was killing performance. Someone else suggested the validation being the slowness, and I was following it up, but when you put it like that... the loading/resolving $ref could certainly be the time consuming bit. It is a massive spec file with a lot of references.

Comparing 1.08 and 4.16... it looks like 4.16 might have caching for resolved schemas? I don't understand where the "id" comes from or whether that just applies to URLs.

What is Mojolicious::Plugin::OpenAPI::Perl ?

That must be a typo. Just Mojolicious::Plugin::OpenAPI.

That's obscure.

Please provide some sort of constructive suggestion.

My apologies. I should've said "that seems obscure", as I wasn't sure if I was understanding correctly or not.

Usually, if I see an errors() method, it's called after a method that was supposed to do some work. For instance with DBI it would be $sth->execute($foo) or die $dbh->errstr(). Or "cgi_error()" after upload() in CGI.

When I was reading Mojolicious::Plugin::OpenAPI, I figured $self->validator->errors must be a method to collect the errors from a previous validator method, but it wasn't clear what method that could've possibly been. That's when I looked at the errors method and saw it contained the logic for the resolve and validate in it. I just found that confusing as the behaviour didn't fit with the name of the method. Not a big drama though.

minusdavid commented 3 years ago

Looking at all the $ref references and it seems that there are about 410 of them all up.

minusdavid commented 3 years ago

What is Mojolicious::Plugin::OpenAPI::Perl ?

In hindsight, I may have been expanding "libmojolicious-plugin-openapi-perl" and left on the last Perl by accident heh...

minusdavid commented 3 years ago

My apologies. After more investigation, it seems that the problem is elsewhere in Mojolicious::Plugin::OpenAPI. The loading and validating is actually quite quick.

I think that it actually might be memory related. But I'll open up a new issue on Mojolicious::Plugin::OpenAPI...

minusdavid commented 3 years ago

After hacking on the Mojolicous::Plugin::OpenAPI and JSON::Validator libraries directly, I've narrowed down the problem to JSON::Validator::_validate(). (My earlier tests at https://github.com/jhthorsen/json-validator/issues/246#issuecomment-826583541 were using bundle() which just did loading and resolving not validating...)

When only 1 process is run, it's reasonably fast. However, when you run many processes in parallel, it takes a much longer time to validate the schema.

As I mentioned in my opening comment, I'm using the same schema/spec across many instances, so it's a waste of resources to re-validate it 120 times at once.

That said, creating a checksum for a deep Perl hash is easier said than done.

In my use case, we're loading a static spec/schema file but then hacking the data structure returned by JSON::Validator->bundle, so we can't use a checksum from the actual file either. I suppose in that case... it would be handy if there were a way to add routes to Mojolicious::Plugin::OpenAPI beyond mangling the original schema/spec...)

minusdavid commented 3 years ago

Ohh or maybe not.

In load_and_validate_schema(), I can use the following to get a consistent checksum between different Perl runs:

my $bytes = Mojo::JSON::encode_json($spec); my $digest = Digest::MD5::md5_hex($bytes);

That would be perfect. Admittedly that's for an older JSON::Validator, but the concept would work the same for the new JSON::Validator.

Probably JSON::Validator::Schema->errors would have a check around here: https://metacpan.org/source/JHTHORSEN/JSON-Validator-4.16/lib/JSON/Validator/Schema.pm#L12

(Although JSON::Validator::Schema might be too generic for the check to be there. You'll know more about the implementation specific details.)

minusdavid commented 3 years ago

I figure < 1 second to calculate a checksum vs > 30 seconds spent validating would be a big win.

minusdavid commented 3 years ago

(Apologies for the high volume of comments and the stream-of-consciousness. Hopefully it helps to clarify the problem and proposed solution.)

minusdavid commented 3 years ago

Without the validation, I'm able to restart 120 processes in about 18 seconds with very little load on the server.

I think it's worth having the validation, but if a checksum can be cached on a schema/spec that we know has passed validation... then I think we'd be in a good place.

(Something I might not be considering... what are we validating this schema/spec again... there might need to be more data cached than I initially suspect)

jhthorsen commented 3 years ago

@minusdavid: Would it fix your issue if applied this diff to M::P::OpenAPI?

diff --git a/lib/Mojolicious/Plugin/OpenAPI.pm b/lib/Mojolicious/Plugin/OpenAPI.pm
index 9060943..24a1ae9 100644
--- a/lib/Mojolicious/Plugin/OpenAPI.pm
+++ b/lib/Mojolicious/Plugin/OpenAPI.pm
@@ -21,7 +21,7 @@ sub register {
   $self->validator->allow_invalid_ref(1)                    if $config->{allow_invalid_ref};
   $self->_set_schema_version($config->{version_from_class}) if $config->{version_from_class};

-  my $errors = $self->validator->errors;
+  my $errors = $config->{skip_schema_validation} ? [] : $self->validator->errors;
   die @$errors if @$errors;

   unless ($app->defaults->{'openapi.base_paths'}) {
jhthorsen commented 3 years ago

I've added support for "skip_schema_validation" in https://github.com/jhthorsen/mojolicious-plugin-openapi/commit/673079d19f827ce8c8ab3a2943a4abc798fa1e18

minusdavid commented 3 years ago

Thanks, Jan. I think that's a useful option.

However, I like the idea of the validation cache a bit better, because it mean that you're still doing validation on the schema, but you're only doing it 1 time. If your schema changes, then you validate it again, but just the once.

That being said, with a bunch of parallel processes, there would be a bit of a race, so in practice you'd probably be validating it N times before the first one completes successfully. But that would still be a better scenario than validating all the time for every process.

jhthorsen commented 3 years ago

I'm not going to implement what you request, but I'll consider a PR.

What I could however consider is to add support for passing in a JSON::Validator object to M::P::OpenAPI. That way, you might be able to freeze and store that object and restore it from disk... No idea if that idea will actually work though.

minusdavid commented 3 years ago

Fair enough. I was planning on writing a PR, so I'm glad to hear that you'll consider one.

I'm not sure I follow the logic of sharing a JSON::Validator object between different processes. That seems riskier than checking a checksum?

minusdavid commented 3 years ago

Actually, thinking about it again, I suppose that https://github.com/jhthorsen/mojolicious-plugin-openapi/commit/673079d19f827ce8c8ab3a2943a4abc798fa1e18 should be sufficient.

Since I have the schema before giving it to M::P::OpenAPI, I can do my own checksum calculation and set the configuration option based on my analysis. The caching doesn't have to be done within JSON::Validator at all really.

minusdavid commented 3 years ago

Although... I suppose that would man we'd either have to use JSON::Validator separately before M::P::OpenAPI, or set the cache after M::P::OpenAPI has registered itself without dying due to validation errors. I guess that's OK.