Closed jeffreykegler closed 2 years ago
So I tried by reducing the number of terminals in the grammar. In JSON, a part of the grammar that I can safely optimize is the string. In all previous tests this was:
string ::= '"' string_parts '"'
i.e. always at least three terminals.
I made an evolution to ESLIF to do regular expression substitution in addition to regular expression match. In other words I can now say that a terminal value is a substitution done directly by the regular expression engine: the processing to remove the surrounding double quotes still exist but:
Concentrating only on the most probable strings that are basic strings, i.e. those that contain no escaped character, the string part of the grammar becomes:
:symbol ::= /"([^"\\\x00-\x1F]*)"/u -> '$1' name => BASIC_STRING priority => 1
string ::= $BASIC_STRING | '"' string_parts '"'
and the result is indeed much better: the ESLIF's PP version now always perform better than JSON::PP, getting closer to the wanted linearity:
1. Without regex substitution optimization (i.e. always at least 3 terminals for a string): a. On a short stream /home/jdurand/git/Marpa--R2/blog/json/test.json :
Rate JSON::PP ESLIF JSON PP JSON::XS
JSON::PP 1347/s -- -2% -99%
ESLIF JSON PP 1370/s 2% -- -99%
JSON::XS 124842/s 9170% 9014% --
b. On a long stream:
ESLIF JSON PP 1.32/s -- -28% -99%
JSON::PP 1.85/s 40% -- -98%
JSON::XS 109/s 8136% 5791% --
2. With regex substitution optimization (i.e. almost always 1 terminal for string): a. On a short stream /home/jdurand/git/Marpa--R2/blog/json/test.json :
JSON::PP 1347/s -- -18% -99%
ESLIF JSON PP 1643/s 22% -- -99%
JSON::XS 123675/s 9084% 7427% --
b. On a long stream:
JSON::PP 1.85/s -- -6% -98%
ESLIF JSON PP 1.96/s 6% -- -98%
JSON::XS 106/s 5635% 5316% --
In conclusion, the parse length is a factor, obviously. I am probably near to close the ticket ;)
Re the cache misses in Libmarpa: they would be non-linear with Marpa::R2 as well, and if I recall correctly, while Marpa::R2 was much slower than ESLIF, it is linear.
From first principles, it seems that the cache miss overhead in Libmarpa should be linear. Simplying, assume h1
hits for Earley set (ES) for small parses. Here "small" means n < c
where c
is a constant which depends on the memory available for cache, and n
is the size of the parse in Earley sets. Assume that, in building the bocage, that above size n
, we need to read the ES into the cache, as a cost of h2
. Then our cost in cache hits, is h1
where n < c
, and h1+h2
worst case and therefore O(n*(h1+h2))=O(n)
or linear. In less theoretical terms, there will be a "bump" at some point as the extra cache hits kick in, but once the parse gets large, the Libmarpa cache hit cost should be linear. I think the Marpa::R2 numbers bore this out.
To bring things into perspective, we picked the JSON target because it was hard. JSON::PP is custom-written for a grammar that was designed to be parsed easily by the weaker parser used in JSON::PP, so you emerged the winner in a contest that was rigged against you.
Someone IMHO should blog this. I'm overwhelmed at the moment -- someone has given Libmarpa a close look and suggested all sorts of necessary improvements, mainly doc and packaging, but also an apparent bug, and I need to focus on these. Can you do a writeup?
Also, there's a new CPAN module which uses Marpa::R2. It strikes me that if ESLIF were dropped in R2's place, the result should be an immense speedup. This would be a comparison in which ESLIF might really shine. R2 was Alex Zuber's best choice, and the ESLIF would inherit that spot, showing for that application, and most likely a wide class of others, ESLIF is the parser of choice.
Yes, the dependency seems to be in terms of grammar terminals, but the numbers do not prove that it is libmarpa's fault. As far as I remember it is asking libmarpa which terminals are expected first, do a sort v.s. priority, and then give them a try. I'll look into https://metacpan.org/dist/MarpaX-G4/view/pod/MarpaX_G4.pod#Real-life-example-:-Translating-a-PL/SQL-grammar btw :) Thanks
From https://github.com/jddurand/c-marpaESLIF/issues/9#issuecomment-1152519230:
As far as I remember it is asking libmarpa which terminals are expected first, do a sort v.s. priority, and then give them a try.
Asking Libmarpa which terminals are expected, can be expensive IIRC. It can be much cheaper simply to try them and note the acceptance or rejection.
Ah no sorry, I looked the source and did another implementation some weeks ago:
Ah no sorry, I looked the source and in fact ESLIF is doing that differently:
Also I re-profiled and found that there are a lot of function calls. By just forcing the compiler to inline the hotests functions I guess the issue is now near to be closed:
perl -I blib/lib -I blib/arch -I ~/git/Marpa--R2/blog/json ~/git/Marpa--R2/blog/json/bench.pl ~/git/Marpa--R2/blog/json/test.json
Using /home/jdurand/git/Marpa--R2/blog/json/test.json
Using Marpa::R2 4
Using MarpaX::ESLIF 6.0.25
Rate JSON::PP ESLIF JSON PP (minimal) JSON::XS
JSON::PP 1357/s -- -19% -99%
ESLIF JSON PP (minimal) 1674/s 23% -- -99%
JSON::XS 122530/s 8928% 7220% --
perl -I blib/lib -I blib/arch -I ~/git/Marpa--R2/blog/json ~/git/Marpa--R2/blog/json/bench.pl /usr/share/iso-codes/json/iso_639-3.json
Using /usr/share/iso-codes/json/iso_639-3.json
Using Marpa::R2 4
Using MarpaX::ESLIF 6.0.25
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
Rate JSON::PP ESLIF JSON PP (minimal) JSON::XS
JSON::PP 1.82/s -- -8% -98%
ESLIF JSON PP (minimal) 1.98/s 9% -- -98%
JSON::XS 107/s 5796% 5314% --
Looks very nice! A lot of work behind it.
Are there final touches to be added, or are we ready to tackle writing it up?
Yep one other thing I would have like to try, requiring an update of my libmarpa wrapper, that is unfortunately blocked by https://github.com/jeffreykegler/libmarpa/issues/119
I have succesfully tried to inline the AVL calls, so that an enormous of jump calls are avoided. I gained something like 5% it seems. Do you think it is worth to do a pull request ? FYI this is the commit: https://github.com/jddurand/libmarpa/commit/ae2173993fc1a81bb757006317d9b0b6b050dd94 With gcc -O3 none of the cmp functions are called. Only non-tree implementations are really affected.
I am really reluctant to lexically inline that much code. So, thanks, but no, don't do the pull request.
No problem - I give up on this libmarpa dev.
Are you ready to do a final set of timings?
I have done a new release of c-marpaWrapper using the pristine tested branch of libmarpa: https://github.com/jddurand/c-marpaWrapper/releases/tag/1.0.98 In the meantime I was doing a last test, but from libmarpa client side instead of libmarpa core:
Suppose that at a given "latest earley set Id" I have a given progress report that I'd call P1. Then I do a set of alternatives and one complete. I redo the progress at "latest earley set Id" and store the result as P2. Then a lot of alternative, complete happens. Suppose that I observe later on that the progress at "latest earley set Id" is equivalent to P1. Is it correct to assume that the next round of alternatives followed by a complete will produce a new progress that is equivalent to P2 ?
If this is true then this mean I could store the notion of "predicted progress"es, avoiding a lot of calls that I currently do to marpa_r_progress_report_start...marpa_r_progress_item...marpa_r_progress_report_finish.
Thanks.
I may need some think time on this. But let me ask for some clarifications.
I assume when you say that the progress reports are "identical" you mean identical "modulo" the latest Earley set ID. Because obviously the latest ES ID will differ in P1 and P2.
And should I assume all three sets of alternatives are identical, except for location?
Indeed, I am concerned only the tuples {rule number, position}. Earley Set Ids (current and origin) are not considered in this comparison.
I meant given a set of {rule,position}, is the next set of {rule,position} always the same after a complete ;)
When you say "set of {rule,position}" do you mean a multiset, where the "sets" differ if one of their elements occurs a different number of times? This makes a difference in case of, for example, right recursion.
Also, I think the sets of alternatives accepted before P1 and P2 must be identical. Because if they differ the set of pre-dot terminal symbols in the rules of P1 and P2 will differ, and therefore P1 and P2 must differ.
A set is defined as:
Two sets are equivalent if they have the same number of items and all items, in order, are equivalent.
At a given latest earley set id L, progress is represented by set PL, calculated from L to L. Then complete happens. Earley set id increases and now have a progress represented by PL+1, calculated from L+1 to L+1.
The question is if for any set Px equivalent to PL, the Px+1 set will always be equivalent to PL+1, where x > L (it is of course true if x == L).
Le 4 août 2022, 21:32, à 21:32, Jeffrey Kegler @.***> a écrit:
When you say "set of {rule,position}" do you mean a multiset, where the "sets" differ if one of their elements occurs a different number of times? This makes a difference in case of, for example, right recursion.
Also, I think the sets of alternatives accepted before P1 and P2 must be identical. Because if they differ the set of pre-dot terminal symbols in the rules of P1 and P2 will differ, and therefore P1 and P2 must differ.
-- Reply to this email directly or view it on GitHub: https://github.com/jddurand/c-marpaESLIF/issues/9#issuecomment-1205686741 You are receiving this because you commented.
Message ID: @.***>
Ok, please forget about this. I made a test and it appears that, no, it does not work. I'm going to do a marpaESLIF release.
Version 6.0.25 of marpaESLIF has been released and uploaded to CPAN if you want to give it a try.
I talked about ESLIF on the IRC channel: https://colabti.org/irclogger/irclogger_log/marpa?date=2022-08-05#l8.
For the record here at the timings with version 6.0.25 of MarpaX::ESLIF:
perl -I ~/git/Marpa--R2/blog/json ~/git/Marpa--R2/blog/json/bench.pl ~/git/Marpa--R2/blog/json/test.json
Using /home/jdurand/git/Marpa--R2/blog/json/test.json
Using Marpa::R2 4
Using MarpaX::ESLIF 6.0.25
Rate SLIF JSON JSON::PP ESLIF JSON PP (minimal) ESLIF JSON JSON::XS
SLIF JSON 449/s -- -64% -72% -86% -100%
JSON::PP 1244/s 177% -- -24% -61% -99%
ESLIF JSON PP (minimal) 1629/s 263% 31% -- -48% -99%
ESLIF JSON 3162/s 605% 154% 94% -- -97%
JSON::XS 118154/s 26238% 9402% 7153% 3637% --
perl -I ~/git/Marpa--R2/blog/json ~/git/Marpa--R2/blog/json/bench.pl /usr/share/iso-codes/json/iso_639-3.json
Using /usr/share/iso-codes/json/iso_639-3.json
Using Marpa::R2 4
Using MarpaX::ESLIF 6.0.25
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
(warning: too few iterations for a reliable count)
Rate SLIF JSON JSON::PP ESLIF JSON PP (minimal) ESLIF JSON JSON::XS
SLIF JSON 0.641/s -- -65% -68% -82% -99%
JSON::PP 1.83/s 186% -- -9% -48% -98%
ESLIF JSON PP (minimal) 2.03/s 216% 10% -- -43% -98%
ESLIF JSON 3.54/s 452% 93% 75% -- -97%
JSON::XS 103/s 15933% 5501% 4970% 2803% --
For the ESLIF JSON PP (minimal) version grammar is:
# --------------------------------------------------
# Meta settings
# --------------------------------------------------
:desc ::= 'Strict JSON Grammar'
:default ::= action => ::shift fallback-encoding => UTF-8 discard-is-fallback => 1
# ---------------------------------
# Discard unsignificant whitespaces
# ---------------------------------
:discard ::= /[\x{9}\x{A}\x{D}\x{20}]+/
# -----------------------------------------------------------------------
# Explicit terminals. Non-ascii strings are special and disable :discard.
# -----------------------------------------------------------------------
:symbol ::= '"' name => DQUOTE symbol-action => ::undef pause => after event => :discard[switch]
:symbol ::= '{' name => LBRACKET symbol-action => ::undef
:symbol ::= '}' name => RBRACKET symbol-action => ::undef
:symbol ::= '[' name => LSQUARE symbol-action => ::undef
:symbol ::= ']' name => RSQUARE symbol-action => ::undef
:symbol ::= ',' name => COMMA symbol-action => ::undef
:symbol ::= ':' name => COLON symbol-action => ::undef
:symbol ::= 'true' name => TRUE symbol-action => ::true
:symbol ::= 'false' name => FALSE symbol-action => ::false
:symbol ::= 'null' name => NULL symbol-action => ::undef
:symbol ::= /"([^"\\\x00-\x1F]*)"/u -> '$1' name => BASIC_STRING priority => 1
:symbol ::= /[^"\\\x00-\x1F]+/u name => BASIC_PART symbol-action => ::transfer
:symbol ::= /(?:\\["\\\/bfnrt])+/ name => ESCAPE_PART symbol-action => ::transfer
:symbol ::= /(?:\\u[[:xdigit:]]{4})+/ name => UNICODE_PART symbol-action => ::transfer
:symbol ::= /-?(?:0|[1-9][0-9]*)(?:\.[0-9]+)?(?:[eE][+-]?[0-9]+)?/ name => NUMBER symbol-action => ::transfer
# ----------
# JSON value
# ----------
value ::= object
| array
| string
| constant
| number
# -----------
# JSON object
# -----------
object ::= $LBRACKET members $RBRACKET action => ::copy[1]
members ::= member* separator => $COMMA proper => 1 action => members
member ::= string (- $COLON -) value action => ::row
# ----------
# JSON Array
# ----------
array ::= $LSQUARE elements $RSQUARE action => ::copy[1]
elements ::= value* separator => $COMMA proper => 1 hide-separator => 1 action => ::row
# -----------
# JSON String
# -----------
string ::= $BASIC_STRING
| $DQUOTE string_parts $DQUOTE action => ::copy[1]
string_parts ::= string_part* action => ::concat
string_part ::= $BASIC_PART
| $ESCAPE_PART action => string_escape_part
| $UNICODE_PART action => string_unicode_part
# -------------
# JSON constant
# -------------
constant ::= $TRUE
| $FALSE
| $NULL
# -----------
# JSON number
# -----------
number ::= $NUMBER action => number
As you can see, I have put the notion of alias in the grammar... you define a symbol once and reference it using a '$' before the alias name. I found that very handy ;)
The perl actions are:
#
# Action for: members ::= member*
# -------------------------------
sub members {
my ($self) = shift;
my %output = ();
while (@_) {
my $member = pop(@_);
$output{$member->[0]} = $member->[1];
#
# Eat separator
#
pop(@_);
}
return \%output
}
sub string_escape_part {
my ($self, $part) = @_;
my $c = substr($part, 0, 1);
if ($c eq 'b') { $c = '\b' }
elsif ($c eq 'f') { $c = '\f' }
elsif ($c eq 'n') { $c = '\n' }
elsif ($c eq 'r') { $c = '\r' }
elsif ($c eq 't') { $c = '\t' }
return $c
}
sub string_unicode_part {
my ($self, $part) = @_;
my @uint32 = ();
my $p = 0;
my $pmax = length($part);
my $output;
while ($p < $pmax) {
my $u = substr($part, $p+2, 4);
push(@uint32, hex($u));
$p += 6;
}
for (my $i = 0, my $j = 1; $i <= $#uint32; $i++, $j++) {
my $c = $uint32[$i];
if (($j <= $#uint32) && ($c >= 0xD800) && ($c <= 0xDBFF) && ($uint32[$j] >= 0xDC00) && ($uint32[$j] <= 0xDFFF)) {
# Surrogate UTF-16 pair
$c = 0x10000 + (($c & 0x3FF) << 10) + ($uint32[$j] & 0x3FF);
++$i;
++$j;
}
if (($c >= 0xD800) && ($c <= 0xDFFF)) {
$c = 0xFFFD; # Replacement character
}
if ($c < 0x80) {
$output .= chr($c);
next;
}
my @q = ();
if ($c < 0x800) {
push(@q, 0xC0 + ($c >> 6));
goto t1;
}
if ($c < 0x10000) {
push(@q, 0xE0 + ($c >> 12));
goto t2;
}
push(@q, 0xF0 + ($c >> 18));
push(@q, 0x80 + (($c >> 12) & 0x3F));
t2:
push(@q, 0x80 + (($c >> 6) & 0x3F));
t1:
push(@q, 0x80 + ($c & 0x3F));
$c = pack "C*", @q;
$output .= $c;
}
return $output
}
sub number {
my ($self, $number) = @_;
my $output = 0+$number;
return $output
}
The built-in JSON parser inside ESLIF is doing almost exactly what JSON::XS is doing: lookahead, done in C. But except that, of course, it goes into the normal alternative/complete sequence. It is just the discovery that is optimized with a direct lookahead into an UTF-8 converted internal buffer.
Unless you disagree, may I close this issue.
Yes, time to close it. I want to discuss how to blog / document this, the best to do that via a separate issue.
FYI I worked on the lua timings mentionned in this ticket, and next version of marpaESLIF will have equivalent timings between perl and lua.
Very nice! By coincidence I just added Lua to the libmarpa repo. My motivation was the need for a fuller test suite, which cannot reasonably be written in C. (Right now Marpa::R2 is Libmarpa's de facto test suite.) I wanted an embeddable language and that essentially means Lua. My Libmarpa Lua work is in early stages, at the moment.
After a couple of benchmarks, it seems that there is a non-linearity in the speed of the ESLIF built-in JSON parser. In the benchmarks included, note the relative speeds of SLIF JSON, JSON::PP and JSON::XS remain roughly the same, while the performance of ESLIF JSON plummets from better than JSON::PP to worse than the SLIF JSON.
Small file benchmark: https://gist.github.com/jeffreykegler/cdf4f8eceb07fcedb9d5dc52bd69a56a
Large file benchmark: https://gist.github.com/jeffreykegler/895e4136578ec9563468e6c103d1b9ae