Closed tenderlove closed 2 weeks ago
Seems like it would be an easy port to the Java parser, if anyone gets to it before I do.
Ah thank you Aaron, I had identified that issue yesterday, but was struggling with the intricacies of Ragel.
I'm pretty sure there's multiple similar issues in the parser, I think Ragel is largely being mis-used here :/ The more it goes the more I'm tempted to go with an handrolled parser.
tempted to go with an handrolled parser.
Worth mentioning that the code ragel generates for Java is pretty awful. It ends up as one huge method with a switch-based state machine. I once looked into what it would take to generate JVM bytecode directly (JVM supports goto but only in bytecode) but ragel is a difficult codebase to work with.
Because of this the performance of the JRuby ext is pretty poor and there's little that can be done about it. We'd probably be better served by binding a proper high performance JVM JSON library and ditching ragel.
I think Ragel is largely being mis-used here
I think that's true. It feels like ideally we would have the entire parser in one state machine, otherwise we're doomed to be re-examining bytes. We're definitely re-examining bytes, but I think the integer parsing was the most egregious case.
I'm tempted to go with an handrolled parser.
I tend to agree. The nice thing about Ragel though is that we can easily read the regular expressions compared to the stuff we'd have to write in C.
Worth mentioning that the code ragel generates for Java is pretty awful.
Ya, I think Ragel generates equally bad code for all languages. Just happens that the C code is bad in a way the C compiler is okay with.
Before this commit, we would try to scan for a float, then if that failed, scan for an integer. But floats and integers have many bytes in common, so we would end up scanning the same bytes multiple times.
This patch combines integer and float scanning machines so that we only have to scan bytes once. If the machine finds "float parts", then it executes the "isFloat" transition in the machine, which sets a boolean letting us know that the parser found a float.
If we didn't find a float, but we did match, then we know it's an int.
This seems to speed up various benchmarks. Just pulling
citm_catalog.json
,Before this change:
After this change:
With this patch JSON beats
OJ::Parser
for this benchmark now.