jhthorsen / json-validator

:cop: Validate data against a JSON schema
https://metacpan.org/release/JSON-Validator
56 stars 58 forks source link

Optimize checksum generation for simple cases #202

Closed EreMaijala closed 4 years ago

EreMaijala commented 4 years ago

Summary

Calling Data::Dumper is relatively slow, so avoiding it when easily possible can provide a significant performance improvement.

All the tests pass, and I've also run them several times with code that chooses the new path randomly only 50% of time. I've also verified that all the cases encountered in the tests and validation of the real schema (see below), the new code and Data::Dumper create equals keys.

Motivation

Avoiding Data::Dumper for undef, string and numeric variables makes the validation about 40 percent faster for my test case of loading and validating Koha's REST API schema.

References

My test script: https://gist.github.com/EreMaijala/5ba3ae6e3739c948f774a8a1246ca879 My test results: https://gist.github.com/EreMaijala/163b135c280be694e615c2be4e3ac2f7 Koha's REST API schema: https://github.com/Koha-Community/Koha/blob/master/api/v1/swagger/swagger.json

EreMaijala commented 4 years ago

Begs the question why that is, but I couldn't spend more time digging any deeper. I profiled that Dump calls Dumpxs which should be fast native code, but it's obviously not very fast for these cases.

jhthorsen commented 4 years ago

I was wondering if we could use Sereal, Storable or something else, but I haven't checked if they will make the same data structure / checksum on each run.

EreMaijala commented 4 years ago

I'll check Sereal and Storable out quickly...

EreMaijala commented 4 years ago

Storable won't do it since it can't sort hash keys. Sereal would work and seems to be about as fast for my test case. Want me to switch?

jhthorsen commented 4 years ago

I forgot about YAML::XS. I'm going to depend on that in future versions anyways, so maybe that's better to use? It's fine if it's not as fast as Sereal. It just can't be crazy slow like Dumper.

EreMaijala commented 4 years ago

YAML::XS is usable but a bit slower, in my test case it takes ~6.8 seconds vs ~6.1 seconds. I could live with that, but it would be hard as I tested also the custom code + YAML::XS instead of Data::Dumper, and it would give ~5.5 seconds..

jhthorsen commented 4 years ago

I think I want to go with yaml_shortcut() in the example below. Either way, this change needs some more tests to get merged. Once the tests are in place, then we can also opt in for Sereal.

The tests cannot check the returned MD5 sum, since it will change when we change from Data::Dumper to YAML::XS or Sereal. Instead the test need to check that objects are not ignored, and the number logic works. I'm not sure if the dummy test below works, but take it as a rough example:

my $d_hash  = {foo => {}};
my $d_undef = {foo => undef};
my $d_obj   = {foo => JSON::Validator::Error->new};
isnt data_checksum($d_hash), data_checksum($d_undef);
isnt data_checksum($d_hash), data_checksum($d_obj);
isnt data_checksum($d_obj), data_checksum($d_undef);
isnt data_checksum(3.14), md5_sum(3.15);

One way to check if the tests do what they are supposed to could be to replace Data::Dumper with Mojo::JSON. Mojo::JSON will ignore the JSON::Validator::Error object, which should make one of the isnt tests fail.

Btw, below are my benchmark results. Sereal is so incredible fast, but I think in many cases yaml_shortcut() will be fast enough. (I don't want to have too many dependencies, if I can avoid it)

use Mojo::Base -strict;
use Data::Dumper ();
use Mojo::Util 'md5_sum';
use Sereal::Encoder ();
use YAML::XS        ();

sub dd {
  return md5_sum(
    Data::Dumper->new([$_[0]])->Indent(0)->Sortkeys(1)->Terse(1)->Useqq(1)
      ->Dump);
}

my $s = Sereal::Encoder->new({canonical => 1, sort_keys => 1});
sub sereal { return md5_sum($s->encode($_[0])) }

sub yaml { return md5_sum(YAML::XS::Dump($_[0])) }

sub dd_shortcut {
  return md5_sum 'undef' unless defined $_[0];
  return md5_sum $_[0] =~ /^\d+$/ ? $_[0] : qq("$_[0]") unless ref $_[0];
  return md5_sum(
    Data::Dumper->new([$_[0]])->Indent(0)->Sortkeys(1)->Terse(1)->Useqq(1)
      ->Dump);
}

sub yaml_shortcut {
  return md5_sum 'undef' unless defined $_[0];
  return md5_sum $_[0] =~ /^\d+$/ ? $_[0] : qq("$_[0]") unless ref $_[0];
  return md5_sum(YAML::XS::Dump($_[0]));
}

bm(hash   => {foo => 42});
bm(undef  => undef);
bm(number => 42);
bm(string => 'foo');

sub bm {
  my ($desc, $data) = @_;
  use Benchmark 'cmpthese';

  my %tests;
  for my $name (qw(dd dd_shortcut yaml yaml_shortcut sereal)) {
    my $code = main->can($name) or die $name;
    $tests{$name} = sub { $code->($data) };
  }

  warn "\n#--- $desc\n";
  cmpthese($ENV{COUNT} || 500_000, \%tests);
}

__DATA__
$ COUNT=500000 perl b.pl
#--- hash
                  Rate dd_shortcut         dd yaml_shortcut       yaml    sereal
dd_shortcut    77399/s          --        -2%          -53%       -54%      -90%
dd             78740/s          2%         --          -53%       -54%      -90%
yaml_shortcut 166113/s        115%       111%            --        -2%      -79%
yaml          170068/s        120%       116%            2%         --      -78%
sereal        781250/s        909%       892%          370%       359%        --

#--- undef
            (warning: too few iterations for a reliable count)
            (warning: too few iterations for a reliable count)
                   Rate        dd       yaml    sereal yaml_shortcut dd_shortcut
dd             105263/s        --       -61%      -91%          -96%        -96%
yaml           270270/s      157%         --      -78%          -89%        -89%
sereal        1219512/s     1059%       351%        --          -51%        -51%
yaml_shortcut 2500000/s     2275%       825%      105%            --          0%
dd_shortcut   2500000/s     2275%       825%      105%            0%          --

#--- number
            (warning: too few iterations for a reliable count)
            (warning: too few iterations for a reliable count)
                   Rate        dd       yaml    sereal yaml_shortcut dd_shortcut
dd             101420/s        --       -59%      -92%          -93%        -94%
yaml           245098/s      142%         --      -80%          -82%        -85%
sereal        1219512/s     1102%       398%        --          -12%        -24%
yaml_shortcut 1388889/s     1269%       467%       14%            --        -14%
dd_shortcut   1612903/s     1490%       558%       32%           16%          --

#--- string
            (warning: too few iterations for a reliable count)
                   Rate        dd       yaml    sereal dd_shortcut yaml_shortcut
dd             100200/s        --       -57%      -92%        -92%          -93%
yaml           230415/s      130%         --      -81%        -82%          -83%
sereal        1219512/s     1117%       429%        --         -2%          -12%
dd_shortcut   1250000/s     1147%       442%        2%          --          -10%
yaml_shortcut 1388889/s     1286%       503%       14%         11%            --
EreMaijala commented 4 years ago

Nice benchmark! Sereal is awesomely fast, but I understand the desire to minimize dependencies. Latest commit switches to YAML::XS and adds tests (yours and a couple more). I verified that the tests will fail if MOJO::JSON or Storable is used. Does something else need to be changed since YAML::XS becomes a hard dependency?

karenetheridge commented 4 years ago

Mojo::JSON will ignore the JSON::Validator::Error object, which should make one of the isnt tests fail

But we're never doing checksums on Error objects, are we? So this shouldn't be important. I was wondering how YAML::XS compared to Mojo::JSON (which is really Cpanel::JSON::XS beneath) for speed.

EreMaijala commented 4 years ago

But we're never doing checksums on Error objects, are we? So this shouldn't be important. I was wondering how YAML::XS compared to Mojo::JSON (which is really Cpanel::JSON::XS beneath) for speed.

I can't say I know much about that, but several tests failed with Mojo::JSON. Perhaps it's the boolean class?

karenetheridge commented 4 years ago

That would depend on how the tests failed? :)

EreMaijala commented 4 years ago

@karenetheridge Well, err, uh.. I must have made a mistake somewhere when I tested Mojo:JSON. It's actually working fine, is as fast as YAML::XS and all the tests are passing. The latest patch now uses it, so no hard dependency on YAML::XS needed for this. Thanks for your perseverance on this! :)

EreMaijala commented 4 years ago

I resorted to Test::Without::Module as I couldn't make the @INC manipulation to work (no matter what I tried, the sub didn't get called).

karenetheridge commented 4 years ago

Devel::Hide and Test::Without::Module are both good; they work slightly differently and have different sets of prereqs, in case any of that is a factor.

jhthorsen commented 4 years ago

I’ll fix the @INC part. Thanks for work! This is great 👍

jhthorsen commented 4 years ago

I’ll fix the @INC part. Thanks for work! This is great 👍