phorward / libphorward

C/C++ library for dynamic data structures, regular expressions, lexical analysis & more...
MIT License
24 stars 7 forks source link

Increasing scanner performance #11

Closed FreeBASIC-programmer closed 5 years ago

FreeBASIC-programmer commented 6 years ago

I wanted to speed up a plex generated scanner and I found that using the option PREGEX_RUN_UCHAR made a huge difference. If you do not use PREGEX_RUN_UCHAR then plex will perform the following actions for every byte in the input

  ch = u8_char( ptr );
  ptr += u8_seqlen( ptr );

After using the option PREGEX_RUN_UCHAR I figured it would be possible to take the plex generated dfa and turn it into a hard-coded scanner. Maybe that would lead to an even bigger increase in performance. So I wrote a program that, given a dfa --> writes one label (sometimes two labels) for every state in the dfa; --> writes case statements to represent transitions; --> writes goto statements to perform transitions.

BASIC syntax allows for representing case statements in a fairly straightforward manner. An example. Suppose the valid range of transitions is a to z. The BASIC code looks like this (foo denotes the label of the state, bar is the destination label):

foo:
  idx += 1
  select case as const input_string[idx]
  case asc("a") to asc("z"): goto bar

In case there is a transition from state x to state x and state x is a final state then two two labels get created. One that is used as jump target for internal transitions and one that is used as jump target for external transitions. The difference between the two is that at the first label (the one used as target for external transitions) the ID is set. At the second label the id does not get set as there is no need to set the id again and again due to internal transitions. An example of a state that represents matching the second part of an identifier (first character of id already read, valid continuation of id is [a-zA-Z0-9_]*)

foo:
  id = IDENTIFIER
bar:
  idx += 1
  match = idx + 1
  select case as const input_string[idx]
  case asc("a") to asc("z"),_
          asc("A") to asc("Z"),_
          asc("0") to asc("9"),_
          asc("_"): goto bar
  case else: goto exit_label
  end select

Case else is the default transition. Whenever a mismatch occurs a transition is made to exit_label (located near the end of the scanner routine).

There are some caveats when turning the dfa into code. In plex flags can be set at two levels: --> the level of a single token (when defining a single token) or --> the level of the scanner.

In a hard-coded parser the flags can only be set at the level of a single token. And the scanner works as if the user set the options PREGEX_RUN_UCHAR, PREGEX_RUN_NOANCHORS and PREGEX_RUN_NOREF. Using those options it is still possible to process utf8 encoded text by defining the patterns using multiple bytes per code-point (hex escape sequences).

Long story short. You can turn a plex generated dfa into code (with some restrictions). But is it worth it? If the hard-coded scanner generates tokens at a rate that is lower than plex_lex then creation of the tool that turns a dfa into a hard-coded scanner would have been a waste of time.

Luckily for me the hard-coded scanner generated tokens five times faster than plex_lex. That increase in performance did not come as a surprise to me. A re2c generated scanner consists of a combination of switch statements and goto statements as well. And a re2c generated scanner generates tokens at the speed of a handwritten scanner.

As hard-coding the scanner worked so well an obvious next step would be to hard-code the parser as well. In a separate issue I have posted a problem with a parser generated by libphorward. If the parser related issue gets resolved (assuming there is a problem and that it is fixable) it should be very possible to hard-code the parser as well. By combining the hard-coded scanner and the hard-coded parser I should be able to get a lalr parser that parses input at a speed that puts the Flex/Bison combination to shame.

The way I see it hard-coding the scanner/parser is only worthwhile if performance becomes an issue. If pp_par_parse is able to parse input in an amount of time that is short (enough) then there is no need to generate code.

But if performance becomes an issue it's nice to know you can squeeze some more performance out of libphorward by simply hard-coding the scanner/parser.

phorward commented 6 years ago

Hello @FreeBASIC-programmer, and thanks for your detailed description about increasing scanner and parser performance when hard-coding them.

By combining the hard-coded scanner and the hard-coded parser I should be able to get a lalr parser that parses input at a speed that puts the Flex/Bison combination to shame.

Generating this kind of scanner/parser combination would be the next target to achieve when libphorward is conceptually stable, but I think we're on a good way towards this now. The latest developments gave me much more intention, fun and power in getting things done here. Generating the scanner or parser from the phorward generated dfa representation into hard source code would the thing which differs libphorward and UniCCv2, or which could also be the benefit of UniCCv2. I think that making this difference between the two projects would be the best way to choose:

What do you think? I like this combination very much! I think we should not loose the focus on this issue, and put it on hold for a while. It gets important when starting a UniCCv2, which, in my opinion, can be started in a few months.

One thing from your issue can still be discussed right now:

After using the option PREGEX_RUN_UCHAR I figured it would be possible to take the plex generated dfa and turn it into a hard-coded scanner. Maybe that would lead to an even bigger increase in performance.

Yep. What do you think about making the scanner even "pushable", some kind of push-scanning, similar to push-parsing we already implemented?

Having something like

ch = getchar();
while( plex_pushlex( lex, ch, &tok ) == PLEX_STATE_NEXT )
    ch = getchar();

so it depends on the caller how the input data is handled and where it comes from. This would avoid the UTF-8 processing which is currently done, or the restriction to "only" handle wide-character with the PREGEX_RUN_UCHAR flag?

The only thing where I think that more focus is required on is the state context of both the push-parser and the push-scanner, so that the current parser's or lexer's state is stored apart from the lexer or scanner object (plex, ppar) itself. The current libphorward develop branch (formerly push-parsing) holds the current state of the parser together with the parser, but this is not necessary and becomes a problem, when the same parser object should be used multi-threaded or recursively. I think this is an important topic to focus on first, then make the scanner pushable and the input handling can be optimized or changed.

So long, Jan

phorward commented 6 years ago

Hello, I drafted a push-scanner for plex in the push-scanning branch.

Here is a working example program in C:

#include <phorward.h>

int main()
{
#include <phorward.h>

int main()
{
    char*   tok[] = { "keyword", "literal", "identifier", "operator", "other" };
    int     ch;     /* Current character */
    plex*   l;      /* Lexer */
    plexctx c;      /* Context */
    int     i;      /* Position */
    int     w;      /* Current width of token */
    int     s;      /* Current start of token */

    memset( &c, 0, sizeof( plexctx ) );

    /* Set up a lexer */
    l = plex_create( 0 );

    /* Define tokens */
    plex_define( l, "if|else|while|continue|break", 1, 0 );
    plex_define( l, "\\d+|\\d*\\.\\d+|\\d+\\.\\d*|true|false", 2, 0 );
    plex_define( l, "\\w+", 3, 0 );
    plex_define( l, "=|\\+|-|\\*|/|^|>|<|==|>=|<=|!=", 4, 0 );
    plex_define( l, ";|:|\\(|\\)|{|}|\\[\\]", 5, 0 );

    /* Prepare for execution */
    plex_prepare( l );

    /* Tokenize a string */
    for( i = w = s = 0; ( ch = getc( stdin ) ) != EOF; i++, w++ )
    {
        if( !plex_pushlex( l, &c, ch ) )
        {
            if( c.handle )
            {
                ungetc( ch, stdin );
                printf( "%-10s %d,%d\n", tok[ c.handle - 1 ], s, w );
            }

            memset( &c, 0, sizeof( plexctx ) );
            w = -1;
            s = i;
        }

        if( ch == '\n' )
            break;
    }

    printf( "DONE!\n" );
}

This is an example run:

$ ./lexing 
a = 12; if( x == 10 ) b = true;
identifier 0,1
operator   2,1
literal    5,2
other      8,1
keyword    11,2
other      14,1
identifier 17,1
operator   20,2
literal    24,2
other      28,1
identifier 31,1
operator   34,1
literal    37,4
other      42,1
DONE!

You can see that this example already uses a context structure holding the current state, which is plexctx. Some similar structure has to be done for ppar as well.

What do you think about this draft?

FreeBASIC-programmer commented 6 years ago

Push parsing a single token at-a-time should work quite well. But I think push parsing using an arbitrary number of tokens at a time should also be possible. Same for push - lexing.

In case of push-lexing it could be more efficient to have the scanner process chunks at a time instead of one character at a time. Of course the size of a chunk can be 1 char (1 token in case of the parser) but pushing bigger chunks (eg multiple characters/tokens) at a time should be possible.

UniCCv2 using libphorward as a library to create an in - memory representation of a grammar/parser. That's a great combination. It would be nice if there could be various back ends. Given a certain grammar it should be possible to generate not just a lalr parser but also an ll(1) parser or a ll(k) parser or a slr(1) parser or a larl(k) parser (perhaps also non canonical slr(1) parsing or lc parsing).

And several generalized approaches ranging from Unger to Earley to glr (glc?). And let the user choose what he wants: a table driven parser, a parser consisting of a bunch of functions (recursive ascent). Or my personal favorite: a parser consisting of a bunch of labels, goto statements and some supporting code.

It's a good thing I do not have to write UniCCv2 (it's easy to come up with features if you are not the one implementing them ;) ).

phorward commented 6 years ago

In case of push-lexing it could be more efficient to have the scanner process chunks at a time instead of one character at a time. Of course the size of a chunk can be 1 char (1 token in case of the parser) but pushing bigger chunks (eg multiple characters/tokens) at a time should be possible.

You're right, but then things need to be cached. I'm merging it altogether for now into develop and then we look...phorward! ;-)

phorward commented 5 years ago

I'm closing this issue now because of inactivity. Please do also take a look at #21.