Reduce comparisons when parsing numbers

tenderlove commented 2 weeks ago

Before this commit, we would try to scan for a float, then if that failed, scan for an integer. But floats and integers have many bytes in common, so we would end up scanning the same bytes multiple times.

This patch combines integer and float scanning machines so that we only have to scan bytes once. If the machine finds "float parts", then it executes the "isFloat" transition in the machine, which sets a boolean letting us know that the parser found a float.

If we didn't find a float, but we did match, then we know it's an int.

This seems to speed up various benchmarks. Just pulling citm_catalog.json,

Before this change:

== Parsing citm_catalog.json (1727030 bytes)
ruby 3.4.0dev (2024-11-06T15:34:20Z master 56ecc243e2) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    45.000 i/100ms
                  oj    34.000 i/100ms
          Oj::Parser    45.000 i/100ms
           rapidjson    39.000 i/100ms
Calculating -------------------------------------
                json    478.631 (± 0.4%) i/s    (2.09 ms/i) -      2.430k in   5.077033s
                  oj    359.229 (± 0.6%) i/s    (2.78 ms/i) -      1.802k in   5.016486s
          Oj::Parser    478.790 (± 0.6%) i/s    (2.09 ms/i) -      2.430k in   5.075481s
           rapidjson    398.902 (± 1.0%) i/s    (2.51 ms/i) -      2.028k in   5.084372s

Comparison:
                json:      478.6 i/s
          Oj::Parser:      478.8 i/s - same-ish: difference falls within error
           rapidjson:      398.9 i/s - 1.20x  slower
                  oj:      359.2 i/s - 1.33x  slower

After this change:

== Parsing citm_catalog.json (1727030 bytes)
ruby 3.4.0dev (2024-11-06T15:34:20Z master 56ecc243e2) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    44.000 i/100ms
                  oj    34.000 i/100ms
          Oj::Parser    43.000 i/100ms
           rapidjson    37.000 i/100ms
Calculating -------------------------------------
                json    458.547 (± 0.2%) i/s    (2.18 ms/i) -      2.332k in   5.085664s
                  oj    344.691 (± 0.3%) i/s    (2.90 ms/i) -      1.734k in   5.030657s
          Oj::Parser    450.732 (± 0.7%) i/s    (2.22 ms/i) -      2.279k in   5.056488s
           rapidjson    384.103 (± 0.8%) i/s    (2.60 ms/i) -      1.924k in   5.009417s

Comparison:
                json:      458.5 i/s
          Oj::Parser:      450.7 i/s - 1.02x  slower
           rapidjson:      384.1 i/s - 1.19x  slower
                  oj:      344.7 i/s - 1.33x  slower

With this patch JSON beats OJ::Parser for this benchmark now.

headius commented 2 weeks ago

Seems like it would be an easy port to the Java parser, if anyone gets to it before I do.

byroot commented 2 weeks ago

Ah thank you Aaron, I had identified that issue yesterday, but was struggling with the intricacies of Ragel.

I'm pretty sure there's multiple similar issues in the parser, I think Ragel is largely being mis-used here :/ The more it goes the more I'm tempted to go with an handrolled parser.

headius commented 2 weeks ago

tempted to go with an handrolled parser.

Worth mentioning that the code ragel generates for Java is pretty awful. It ends up as one huge method with a switch-based state machine. I once looked into what it would take to generate JVM bytecode directly (JVM supports goto but only in bytecode) but ragel is a difficult codebase to work with.

Because of this the performance of the JRuby ext is pretty poor and there's little that can be done about it. We'd probably be better served by binding a proper high performance JVM JSON library and ditching ragel.

tenderlove commented 2 weeks ago

I think Ragel is largely being mis-used here

I think that's true. It feels like ideally we would have the entire parser in one state machine, otherwise we're doomed to be re-examining bytes. We're definitely re-examining bytes, but I think the integer parsing was the most egregious case.

I'm tempted to go with an handrolled parser.

I tend to agree. The nice thing about Ragel though is that we can easily read the regular expressions compared to the stuff we'd have to write in C.

Worth mentioning that the code ragel generates for Java is pretty awful.

Ya, I think Ragel generates equally bad code for all languages. Just happens that the C code is bad in a way the C compiler is okay with.

ruby / json

Reduce comparisons when parsing numbers #698