phorward / libphorward

C/C++ library for dynamic data structures, regular expressions, lexical analysis & more...
MIT License
24 stars 7 forks source link

Push parsing #7

Closed phorward closed 6 years ago

phorward commented 6 years ago

Originally from issue #6, by @FreeBASIC-programmer

I looked at the way parsing (lalr1) is performed in libphorward. And I have something of a feature request.

Would it be possible to have an external scanner call the parser ('push parsing')? The interface to the libphorward generated parser would consist of a single function that should get called by the scanner for every token that the parser should parse.

Having an external scanner call the parser could make it a lot easier to parse input from anything besides a static string (eg stream-ish style parsing). Also the parser could be part of a 'larger' parser that would decide what subparser would get fed a certain token. It would be great to use x subparsers to parse some 'exotic' language (for example C++ :)).

Parsing one token using push parsing could look something like this.

token = get_token(scanner)
parse(parse_state,token)
if (parse_state->error_) then
  ''error message
end if

I am assuming the parser needs a place (parse_state) to keep track of parsing state. The numbering scheme for tokens could be straightforward: first token to appear in the grammar would get assigned a possible user defined number (or 1 or something else). Subsequent tokens appearing in the grammar would get numbered upward (or downward) from that first , possible user defined number.

parse_state could be used to store other, user-defined information (pppar could be extended to contain some user defined data in the shape of an any ptr or a plist or...?).

phorward commented 6 years ago

Hello @FreeBASIC-programmer,

this sounds interesting and should not be that heavy to implement. Currently, the whole LALR parser's state machine is located at src/parse/parse.c. Thanks to the library's modularity, it won't be a problem to write a parser that can be invoked on demand and holds it state in another kind of structure.

Having an external scanner call the parser could make it a lot easier to parse input from anything besides a static string (eg stream-ish style parsing).

Indeed, this an important feature you mention here. I thought of it as bringing in some kind of "dynamic character fetch" function that can be used to feed the input stream, but I think your proposal of bringing in push-parsing from the scanner to the parser is the more straightforward way. Especially because the origin of every parser lays in the tokens it gets.

The numbering scheme for tokens could be straightforward: first token to appear in the grammar would get assigned a possible user defined number (or 1 or something else). Subsequent tokens appearing in the grammar would get numbered upward (or downward) from that first , possible user defined number.

At this point, I need some time to think about a conforming approach. One priority of libphorward is, to let the user work with objects (in this case, ppsym for grammar symbols) rather than ids. Because objects are referencing one realy entity which exists (and, in this case, is part of the grammar the parser works on), any kind of consecutive number can be error-prone or misunderstood.

But the general idea, making the parser driven by the scanner, is a good one, and I'm thinking about an implementation. I will post any new perception afterwards to this issue!

So long, Jan

phorward commented 6 years ago

Hi @FreeBASIC-programmer,

I've created a first draft of a push-parsing algorithm as you described above. It's in the branch push-parsing (87c8996f2b2360d5ecc823a9e012c0c0eedcd8fc) and adds a function pp_par_pushparse() additionally to the exsiting pp_par_parse(). The current parser state is hold in the pppar object.

The first example I got to work is pushparse.c. I hope you like this approach, I think there is still more modularization to be done before it can be merged into the main branch.

So long, Jan

FreeBASIC-programmer commented 6 years ago

That draft looks mighty interesting. Using the current approach you'd need to known the numbering scheme used by the parser. And you'd need to define tokens twice which can easily become a problem.

A workaround for the above 'problems' would be to only allow uppercase identifiers for terminals and lowercase identifiers for nonterminals.

The user comes up with the names of the tokens. The parser associates numbers with those names. When writing an external scanner you need not know the exact numbering scheme (as long as you are the one in control of the symbolic description of tokens).

After creating the grammar the user can get the numbers associated with the symbols. And then use those numbers in the scanner. The expression grammar would look like this

    pp_gram_from_pbnf( grm,
                "<<     PLUS MIN;"
        "<<      MUL FDIV;"

        "expr$   : expr MUL expr = mul"
                "| expr FDIV expr = div"
                "| expr ADD  expr = add"
                "| expr MIN expr = sub"
                "| LP expr RP"
                "| INT"
                ";" );

I am not quite sure how to get the number associated with a terminal but I am guessing the following code snippet might just work

symbol = pp_sym_get_by_name(grm,"PLUS")
symbol_idx = symbol->idx   
/*  do something with symbol_idx */

The grammar format looses a bit in terms of 'look and feel' but disallowing regular expressions in the grammar file would make parsing the grammar a lot easier. And you could have the grammar parser figure out what the user wants just by looking at the input. If there are only identifiers (either uppercase or lowercase) and the predefined symbols (eg << | ; ) in the input then it is clear that the user is going to be using an external lexer/push parsing.

If you want a push parser (external lexer) then you need to specify the grammar using the 'ugly', identifier only format (<< = etc.. remain valid). If you do not want to use an external lexer then you can specify the grammar using the regular expression format.

Problem solved :) Or rather: the above solution would work fine for me.

phorward commented 6 years ago

Hello @FreeBASIC-programmer,

thanks for taking a look at my first draft.

A workaround for the above 'problems' would be to only allow uppercase identifiers for terminals and lowercase identifiers for nonterminals.

This behavior is already implemented. Any symbol that is named (= has an identifier) that starts with a lower-case letter is treated as a nonterminal, any other, named symbol is treated as a terminal. So your example grammar from above will already work exactly the way you described it, also the way how you obtain the symbol's numeric ID. But latter one is the sticking point that needs to be solved in a nice, user-friendly way. Because when you have to obtain the symbols' ID every time you write a parser first, to the tokenize the input, you won't get any benefit to my first push-parsing draft.

If you take a look into the example at tests/pushparse.c, line 62, there is the function call

pp_par_pushparse( par, sym, token->start, token->end )

it would be much nicer if this could be

pp_par_pushparse( par, pp_sym_get_by_name( par->gram, "INT"), token->start, token->end )

or more simplyfied, just

pp_par_pushparse( par, "INT", token->start, token->end )

so the symbol's identifier becomes the only thing the user needs to handle, instead of juggling around with numbers. But this breaks libphorward's working with objects design it currently is focused on.

The grammar format looses a bit in terms of 'look and feel' but disallowing regular expressions in the grammar file would make parsing the grammar a lot easier.

This push-parsing thing is really a great benefit to libphorward, and I'm currently thinking about changing the entire framework to support push-parsing as its only parsing method. So we can think about providing several functions, one accepting the ppsym pointer of the symbol (like pp_par_pushparse() above), another accepting the symbol's unique name (as drafted above) and maybe one more that grabs the symbol by its ID. The regular expression-based grammars can still be used to generate a lexer from the grammar that directly works together with the parser, as it is implemented in libphorward 0.22.

What do you think about it?

phorward commented 6 years ago

Hey @FreeBASIC-programmer, I've created a bunch of new functions and made push-parsing the general parsing method that is used by libphorward in branch push-parsing. See here for the new functions.

I hope you like the new changes. It makes the full thing very flexible and the prevous calls work like before with only minor changes. Stay tuned, more will follow soon.

phorward commented 6 years ago

Hi @FreeBASIC-programmer, I made some more changes to the branch push-parsing and think about merging it into develop soon. I now started to support the pany-datatype in the ppast structures, which is a universal data type in libphorward. You are not beholden to use it, but its easier to still support the on-demand parser construction like with the pparse command-line utility.

phorward commented 6 years ago

I've merged the branch into develop now, so it becomes available with the next release :-) Thanks for your recommendations!

cogmeta commented 6 years ago

Is it merged into develop branch? I cant find these new functions. can you please confirm?

phorward commented 6 years ago

Is it merged into develop branch? I cant find these new functions. can you please confirm?

Sure, you can find it here https://github.com/phorward/phorward/blob/develop/src/parse/parse.c#L300 and below...

cogmeta commented 6 years ago

Found it. Thanks.