jddurand / c-marpaESLIF

Extended perl's Marpa::R2 SLIF grammar writen in C
MIT License
6 stars 3 forks source link

ESLIF time is no-linear as a function of input size. #9

Closed jeffreykegler closed 2 years ago

jeffreykegler commented 3 years ago

After a couple of benchmarks, it seems that there is a non-linearity in the speed of the ESLIF built-in JSON parser. In the benchmarks included, note the relative speeds of SLIF JSON, JSON::PP and JSON::XS remain roughly the same, while the performance of ESLIF JSON plummets from better than JSON::PP to worse than the SLIF JSON.

Small file benchmark: https://gist.github.com/jeffreykegler/cdf4f8eceb07fcedb9d5dc52bd69a56a

Large file benchmark: https://gist.github.com/jeffreykegler/895e4136578ec9563468e6c103d1b9ae

jddurand commented 2 years ago

So I tried by reducing the number of terminals in the grammar. In JSON, a part of the grammar that I can safely optimize is the string. In all previous tests this was:

string ::= '"' string_parts '"'

i.e. always at least three terminals.

I made an evolution to ESLIF to do regular expression substitution in addition to regular expression match. In other words I can now say that a terminal value is a substitution done directly by the regular expression engine: the processing to remove the surrounding double quotes still exist but:

Concentrating only on the most probable strings that are basic strings, i.e. those that contain no escaped character, the string part of the grammar becomes:

:symbol ::= /"([^"\\\x00-\x1F]*)"/u -> '$1' name => BASIC_STRING priority => 1
string ::= $BASIC_STRING | '"' string_parts '"'

and the result is indeed much better: the ESLIF's PP version now always perform better than JSON::PP, getting closer to the wanted linearity:

1. Without regex substitution optimization (i.e. always at least 3 terminals for a string): a. On a short stream /home/jdurand/git/Marpa--R2/blog/json/test.json :

                  Rate      JSON::PP ESLIF JSON PP      JSON::XS
JSON::PP        1347/s            --           -2%          -99%
ESLIF JSON PP   1370/s            2%            --          -99%
JSON::XS      124842/s         9170%         9014%            --

b. On a long stream:

ESLIF JSON PP 1.32/s            --          -28%          -99%
JSON::PP      1.85/s           40%            --          -98%
JSON::XS       109/s         8136%         5791%            --

2. With regex substitution optimization (i.e. almost always 1 terminal for string): a. On a short stream /home/jdurand/git/Marpa--R2/blog/json/test.json :

JSON::PP        1347/s            --          -18%          -99%
ESLIF JSON PP   1643/s           22%            --          -99%
JSON::XS      123675/s         9084%         7427%            --

b. On a long stream:

JSON::PP      1.85/s            --           -6%          -98%
ESLIF JSON PP 1.96/s            6%            --          -98%
JSON::XS       106/s         5635%         5316%            --

In conclusion, the parse length is a factor, obviously. I am probably near to close the ticket ;)

jeffreykegler commented 2 years ago

Re the cache misses in Libmarpa: they would be non-linear with Marpa::R2 as well, and if I recall correctly, while Marpa::R2 was much slower than ESLIF, it is linear.

From first principles, it seems that the cache miss overhead in Libmarpa should be linear. Simplying, assume h1 hits for Earley set (ES) for small parses. Here "small" means n < c where c is a constant which depends on the memory available for cache, and n is the size of the parse in Earley sets. Assume that, in building the bocage, that above size n, we need to read the ES into the cache, as a cost of h2. Then our cost in cache hits, is h1 where n < c, and h1+h2 worst case and therefore O(n*(h1+h2))=O(n) or linear. In less theoretical terms, there will be a "bump" at some point as the extra cache hits kick in, but once the parse gets large, the Libmarpa cache hit cost should be linear. I think the Marpa::R2 numbers bore this out.

jeffreykegler commented 2 years ago

To bring things into perspective, we picked the JSON target because it was hard. JSON::PP is custom-written for a grammar that was designed to be parsed easily by the weaker parser used in JSON::PP, so you emerged the winner in a contest that was rigged against you.

Someone IMHO should blog this. I'm overwhelmed at the moment -- someone has given Libmarpa a close look and suggested all sorts of necessary improvements, mainly doc and packaging, but also an apparent bug, and I need to focus on these. Can you do a writeup?

Also, there's a new CPAN module which uses Marpa::R2. It strikes me that if ESLIF were dropped in R2's place, the result should be an immense speedup. This would be a comparison in which ESLIF might really shine. R2 was Alex Zuber's best choice, and the ESLIF would inherit that spot, showing for that application, and most likely a wide class of others, ESLIF is the parser of choice.

jddurand commented 2 years ago

Yes, the dependency seems to be in terms of grammar terminals, but the numbers do not prove that it is libmarpa's fault. As far as I remember it is asking libmarpa which terminals are expected first, do a sort v.s. priority, and then give them a try. I'll look into https://metacpan.org/dist/MarpaX-G4/view/pod/MarpaX_G4.pod#Real-life-example-:-Translating-a-PL/SQL-grammar btw :) Thanks

jeffreykegler commented 2 years ago

From https://github.com/jddurand/c-marpaESLIF/issues/9#issuecomment-1152519230:

As far as I remember it is asking libmarpa which terminals are expected first, do a sort v.s. priority, and then give them a try.

Asking Libmarpa which terminals are expected, can be expensive IIRC. It can be much cheaper simply to try them and note the acceptance or rejection.

jddurand commented 2 years ago

Ah no sorry, I looked the source and did another implementation some weeks ago:

jddurand commented 2 years ago

Ah no sorry, I looked the source and in fact ESLIF is doing that differently:

Also I re-profiled and found that there are a lot of function calls. By just forcing the compiler to inline the hotests functions I guess the issue is now near to be closed:

perl -I blib/lib -I blib/arch -I ~/git/Marpa--R2/blog/json ~/git/Marpa--R2/blog/json/bench.pl ~/git/Marpa--R2/blog/json/test.json
Using /home/jdurand/git/Marpa--R2/blog/json/test.json
Using Marpa::R2 4
Using MarpaX::ESLIF 6.0.25
                            Rate    JSON::PP ESLIF JSON PP (minimal)    JSON::XS
JSON::PP                  1357/s          --                    -19%        -99%
ESLIF JSON PP (minimal)   1674/s         23%                      --        -99%
JSON::XS                122530/s       8928%                   7220%          --

perl -I blib/lib -I blib/arch -I ~/git/Marpa--R2/blog/json ~/git/Marpa--R2/blog/json/bench.pl /usr/share/iso-codes/json/iso_639-3.json
Using /usr/share/iso-codes/json/iso_639-3.json
Using Marpa::R2 4
Using MarpaX::ESLIF 6.0.25
            (warning: too few iterations for a reliable count)
            (warning: too few iterations for a reliable count)
                          Rate     JSON::PP ESLIF JSON PP (minimal)     JSON::XS
JSON::PP                1.82/s           --                     -8%         -98%
ESLIF JSON PP (minimal) 1.98/s           9%                      --         -98%
JSON::XS                 107/s        5796%                   5314%           --
jeffreykegler commented 2 years ago

Looks very nice! A lot of work behind it.

Are there final touches to be added, or are we ready to tackle writing it up?

jddurand commented 2 years ago

Yep one other thing I would have like to try, requiring an update of my libmarpa wrapper, that is unfortunately blocked by https://github.com/jeffreykegler/libmarpa/issues/119

jddurand commented 2 years ago

I have succesfully tried to inline the AVL calls, so that an enormous of jump calls are avoided. I gained something like 5% it seems. Do you think it is worth to do a pull request ? FYI this is the commit: https://github.com/jddurand/libmarpa/commit/ae2173993fc1a81bb757006317d9b0b6b050dd94 With gcc -O3 none of the cmp functions are called. Only non-tree implementations are really affected.

jeffreykegler commented 2 years ago

I am really reluctant to lexically inline that much code. So, thanks, but no, don't do the pull request.

jddurand commented 2 years ago

No problem - I give up on this libmarpa dev.

jeffreykegler commented 2 years ago

Are you ready to do a final set of timings?

jddurand commented 2 years ago

I have done a new release of c-marpaWrapper using the pristine tested branch of libmarpa: https://github.com/jddurand/c-marpaWrapper/releases/tag/1.0.98 In the meantime I was doing a last test, but from libmarpa client side instead of libmarpa core:

Suppose that at a given "latest earley set Id" I have a given progress report that I'd call P1. Then I do a set of alternatives and one complete. I redo the progress at "latest earley set Id" and store the result as P2. Then a lot of alternative, complete happens. Suppose that I observe later on that the progress at "latest earley set Id" is equivalent to P1. Is it correct to assume that the next round of alternatives followed by a complete will produce a new progress that is equivalent to P2 ?

If this is true then this mean I could store the notion of "predicted progress"es, avoiding a lot of calls that I currently do to marpa_r_progress_report_start...marpa_r_progress_item...marpa_r_progress_report_finish.

Thanks.

jeffreykegler commented 2 years ago

I may need some think time on this. But let me ask for some clarifications.

I assume when you say that the progress reports are "identical" you mean identical "modulo" the latest Earley set ID. Because obviously the latest ES ID will differ in P1 and P2.

And should I assume all three sets of alternatives are identical, except for location?

jddurand commented 2 years ago

Indeed, I am concerned only the tuples {rule number, position}. Earley Set Ids (current and origin) are not considered in this comparison.

jddurand commented 2 years ago

I meant given a set of {rule,position}, is the next set of {rule,position} always the same after a complete ;)

jeffreykegler commented 2 years ago

When you say "set of {rule,position}" do you mean a multiset, where the "sets" differ if one of their elements occurs a different number of times? This makes a difference in case of, for example, right recursion.

Also, I think the sets of alternatives accepted before P1 and P2 must be identical. Because if they differ the set of pre-dot terminal symbols in the rules of P1 and P2 will differ, and therefore P1 and P2 must differ.

jddurand commented 2 years ago

A set is defined as:

Two sets are equivalent if they have the same number of items and all items, in order, are equivalent.

At a given latest earley set id L, progress is represented by set PL, calculated from L to L. Then complete happens. Earley set id increases and now have a progress represented by PL+1, calculated from L+1 to L+1.

The question is if for any set Px equivalent to PL, the Px+1 set will always be equivalent to PL+1, where x > L (it is of course true if x == L).

Le 4 août 2022, 21:32, à 21:32, Jeffrey Kegler @.***> a écrit:

When you say "set of {rule,position}" do you mean a multiset, where the "sets" differ if one of their elements occurs a different number of times? This makes a difference in case of, for example, right recursion.

Also, I think the sets of alternatives accepted before P1 and P2 must be identical. Because if they differ the set of pre-dot terminal symbols in the rules of P1 and P2 will differ, and therefore P1 and P2 must differ.

-- Reply to this email directly or view it on GitHub: https://github.com/jddurand/c-marpaESLIF/issues/9#issuecomment-1205686741 You are receiving this because you commented.

Message ID: @.***>

jddurand commented 2 years ago

Ok, please forget about this. I made a test and it appears that, no, it does not work. I'm going to do a marpaESLIF release.

jddurand commented 2 years ago

Version 6.0.25 of marpaESLIF has been released and uploaded to CPAN if you want to give it a try.

jeffreykegler commented 2 years ago

I talked about ESLIF on the IRC channel: https://colabti.org/irclogger/irclogger_log/marpa?date=2022-08-05#l8.

jddurand commented 2 years ago

For the record here at the timings with version 6.0.25 of MarpaX::ESLIF:

perl -I ~/git/Marpa--R2/blog/json ~/git/Marpa--R2/blog/json/bench.pl ~/git/Marpa--R2/blog/json/test.json
Using /home/jdurand/git/Marpa--R2/blog/json/test.json
Using Marpa::R2 4
Using MarpaX::ESLIF 6.0.25
                            Rate SLIF JSON JSON::PP ESLIF JSON PP (minimal) ESLIF JSON JSON::XS
SLIF JSON                  449/s        --     -64%                    -72%       -86%    -100%
JSON::PP                  1244/s      177%       --                    -24%       -61%     -99%
ESLIF JSON PP (minimal)   1629/s      263%      31%                      --       -48%     -99%
ESLIF JSON                3162/s      605%     154%                     94%         --     -97%
JSON::XS                118154/s    26238%    9402%                   7153%      3637%       --

perl -I ~/git/Marpa--R2/blog/json ~/git/Marpa--R2/blog/json/bench.pl /usr/share/iso-codes/json/iso_639-3.json
Using /usr/share/iso-codes/json/iso_639-3.json
Using Marpa::R2 4
Using MarpaX::ESLIF 6.0.25
            (warning: too few iterations for a reliable count)
            (warning: too few iterations for a reliable count)
            (warning: too few iterations for a reliable count)
                           Rate SLIF JSON JSON::PP ESLIF JSON PP (minimal) ESLIF JSON JSON::XS
SLIF JSON               0.641/s        --     -65%                    -68%       -82%     -99%
JSON::PP                 1.83/s      186%       --                     -9%       -48%     -98%
ESLIF JSON PP (minimal)  2.03/s      216%      10%                      --       -43%     -98%
ESLIF JSON               3.54/s      452%      93%                     75%         --     -97%
JSON::XS                  103/s    15933%    5501%                   4970%      2803%       --
jddurand commented 2 years ago

For the ESLIF JSON PP (minimal) version grammar is:

# --------------------------------------------------
# Meta settings
# --------------------------------------------------
:desc                  ::= 'Strict JSON Grammar'
:default               ::= action => ::shift fallback-encoding => UTF-8 discard-is-fallback => 1

# ---------------------------------
# Discard unsignificant whitespaces
# ---------------------------------
:discard               ::= /[\x{9}\x{A}\x{D}\x{20}]+/

# -----------------------------------------------------------------------
# Explicit terminals. Non-ascii strings are special and disable :discard.
# -----------------------------------------------------------------------
:symbol                ::= '"'                                                    name => DQUOTE         symbol-action => ::undef     pause => after event => :discard[switch]
:symbol                ::= '{'                                                    name => LBRACKET       symbol-action => ::undef
:symbol                ::= '}'                                                    name => RBRACKET       symbol-action => ::undef
:symbol                ::= '['                                                    name => LSQUARE        symbol-action => ::undef
:symbol                ::= ']'                                                    name => RSQUARE        symbol-action => ::undef
:symbol                ::= ','                                                    name => COMMA          symbol-action => ::undef
:symbol                ::= ':'                                                    name => COLON          symbol-action => ::undef
:symbol                ::= 'true'                                                 name => TRUE           symbol-action => ::true 
:symbol                ::= 'false'                                                name => FALSE          symbol-action => ::false
:symbol                ::= 'null'                                                 name => NULL           symbol-action => ::undef
:symbol                ::= /"([^"\\\x00-\x1F]*)"/u -> '$1'                        name => BASIC_STRING                                priority => 1
:symbol                ::= /[^"\\\x00-\x1F]+/u                                    name => BASIC_PART     symbol-action => ::transfer
:symbol                ::= /(?:\\["\\\/bfnrt])+/                                  name => ESCAPE_PART    symbol-action => ::transfer
:symbol                ::= /(?:\\u[[:xdigit:]]{4})+/                              name => UNICODE_PART   symbol-action => ::transfer
:symbol                ::= /-?(?:0|[1-9][0-9]*)(?:\.[0-9]+)?(?:[eE][+-]?[0-9]+)?/ name => NUMBER         symbol-action => ::transfer

# ----------
# JSON value
# ----------
value                  ::= object
                         | array
                         | string
                         | constant
                         | number

# -----------
# JSON object
# -----------
object                 ::= $LBRACKET members $RBRACKET                                                   action => ::copy[1]
members                ::= member*                   separator => $COMMA proper => 1                     action => members
member                 ::= string (- $COLON -) value                                                     action => ::row

# ----------
# JSON Array
# ----------
array                  ::= $LSQUARE elements $RSQUARE                                                    action => ::copy[1]
elements               ::= value*                    separator => $COMMA proper => 1 hide-separator => 1 action => ::row

# -----------
# JSON String
# -----------
string                 ::= $BASIC_STRING
                         | $DQUOTE string_parts $DQUOTE                                                  action => ::copy[1]
string_parts           ::= string_part*                                                                  action => ::concat
string_part            ::= $BASIC_PART
                         | $ESCAPE_PART                                                                  action => string_escape_part
                         | $UNICODE_PART                                                                 action => string_unicode_part

# -------------
# JSON constant
# -------------
constant               ::= $TRUE
                         | $FALSE
                         | $NULL

# -----------
# JSON number
# -----------
number                 ::= $NUMBER                                                                        action => number

As you can see, I have put the notion of alias in the grammar... you define a symbol once and reference it using a '$' before the alias name. I found that very handy ;)

The perl actions are:

#
# Action for: members ::= member*
# -------------------------------
sub members {
    my ($self) = shift;

    my %output = ();
    while (@_) {
        my $member = pop(@_);
        $output{$member->[0]} = $member->[1];
        #
        # Eat separator
        #
        pop(@_);
    }

    return \%output
}

sub string_escape_part {
    my ($self, $part) = @_;

    my $c = substr($part, 0, 1);

    if    ($c eq 'b') { $c = '\b' }
    elsif ($c eq 'f') { $c = '\f' }
    elsif ($c eq 'n') { $c = '\n' }
    elsif ($c eq 'r') { $c = '\r' }
    elsif ($c eq 't') { $c = '\t' }

    return $c
}

sub string_unicode_part {
    my ($self, $part) = @_;

    my @uint32 = ();
    my $p = 0;
    my $pmax = length($part);
    my $output;

    while ($p < $pmax) {
        my $u = substr($part, $p+2, 4);
        push(@uint32, hex($u));
        $p += 6;
    }

    for (my $i = 0, my $j = 1; $i <= $#uint32; $i++, $j++) {
        my $c = $uint32[$i];
        if (($j <= $#uint32) && ($c >= 0xD800) && ($c <= 0xDBFF) && ($uint32[$j] >= 0xDC00) && ($uint32[$j] <= 0xDFFF)) {
            # Surrogate UTF-16 pair
            $c = 0x10000 + (($c & 0x3FF) << 10) + ($uint32[$j] & 0x3FF);
            ++$i;
            ++$j;
        }
        if (($c >= 0xD800) && ($c <= 0xDFFF)) {
            $c = 0xFFFD; # Replacement character
        }

        if ($c < 0x80) {
            $output .= chr($c);
            next;
        }

        my @q = ();
        if ($c < 0x800) {
            push(@q, 0xC0 + ($c >> 6));
            goto t1;  
        }
        if ($c < 0x10000) {
            push(@q, 0xE0 + ($c >> 12));
            goto t2;
        }
        push(@q, 0xF0 + ($c >> 18));
        push(@q, 0x80 + (($c >> 12) & 0x3F));
      t2:
        push(@q, 0x80 + (($c >> 6) & 0x3F));
      t1:
        push(@q, 0x80 + ($c & 0x3F));

        $c = pack "C*", @q;
        $output .= $c;
    }

    return $output
}

sub number {
    my ($self, $number) = @_;

    my $output = 0+$number;

    return $output
}

The built-in JSON parser inside ESLIF is doing almost exactly what JSON::XS is doing: lookahead, done in C. But except that, of course, it goes into the normal alternative/complete sequence. It is just the discovery that is optimized with a direct lookahead into an UTF-8 converted internal buffer.

jddurand commented 2 years ago

Unless you disagree, may I close this issue.

jeffreykegler commented 2 years ago

Yes, time to close it. I want to discuss how to blog / document this, the best to do that via a separate issue.

jddurand commented 2 years ago

Closed as per https://github.com/jddurand/c-marpaESLIF/issues/9#issuecomment-1206695341

jddurand commented 2 years ago

FYI I worked on the lua timings mentionned in this ticket, and next version of marpaESLIF will have equivalent timings between perl and lua.

jeffreykegler commented 2 years ago

Very nice! By coincidence I just added Lua to the libmarpa repo. My motivation was the need for a fuller test suite, which cannot reasonably be written in C. (Right now Marpa::R2 is Libmarpa's de facto test suite.) I wanted an embeddable language and that essentially means Lua. My Libmarpa Lua work is in early stages, at the moment.