phorward / unicc

LALR parser generator targetting C, C++, Python, JavaScript, JSON and XML
MIT License
58 stars 9 forks source link

FreeBASIC template #2

Closed FreeBASIC-programmer closed 6 years ago

FreeBASIC-programmer commented 6 years ago

When I started writing a FreeBASIC template I figured I'd translate the code in /targets/Csource/ to FreeBASIC and create a template using that code. In hindsight that actually was a bad idea. It would have been easier to follow the suggestion in c.tlt

It should be possible to abstract this template and given expansion macros
to several other programming languages, not only C-like ones. 

I ran into two issues when using unicc.

Issue one has to do with unicc crashing.

The first thing I did when trying to create a FreeBASIC template was building unicc and have a look at the resulting C code when using the expression example that comes with unicc (expr.c.par). That way I could get an idea of what the generated looks like and I needed a working unicc to test the FreeBASIC template.

When using any option that does not result in generation of a parser (eg -V -h or an empty commandline) all is well. unicc shows a help screen and exits. When using any option that does lead to the generation of a parser I have to include the option -v or else unicc will crash. If I use no option or any of the other options I get a windows message window stating that unicc stopped working.

I compiled the code using mingw 3.4.5 and msys 1.0 (OS win7 64bit pro). I tried to figure out why unicc crashes when not using -v but could not find any. The workaround for this issue is simple: just use -v and there will be no problem.

Issue 2 is an fb specific issue

The syntax of unicc contains a terminal defined as @ (terminal #48). The @ serves the same role as the $ in Bison.

@1 refers to the semantic value associated with symbol 1 on the rhs of a rule. And @@ refers to the semantic value associated with the symbol on the lhs of a rule.

In C the $ symbol has no special meaning: it is not used as a separator in the C programming language. Which is why it is used in Bison. In C the @ symbol has no special meaning: it is not used as a separator in the C programming language. Which is why it is used in unicc. In FreeBASIC the @ symbol has special meaning: it is a unary operator with the same semantics as the prefix unary & operator in C. Apart from the fact that it is a unary operator it is also an operator that can be overloaded.

The above means that FreeBASIC users must watch out when using the @ operator. unicc will rewrite any part of a semantic action that involves the use of @. And that could lead to unexpected results when compiling the resulting code. An example. @1 would yield a syntax error in FreeBASIC ((assuming operator @ has not been overloaded)). But after unicc has rewritten @1 to FreeBASIC code the code that contains the syntax error has vanished. The resulting code (the code resulting after unicc has rewritten the semantic actions) may (or may not) compile correctly. In either case the initial syntax error is no longer getting reported.

How much of a problem the use of @ as a terminal in unicc syntax poses I do not know. What I do know is that writing a unicc template for any language that is defined using a grammar containing the terminal @ will involve some action on the part of the template writer.

At the very least the documentation inside the template should include a severe warning. It should be made clear to a user of the template that the symbol @, when used inside a semantic action, gets treated in a special way by unicc.

There are other solutions to dealing with the use @ - issue that involve a bit more than adding a comment to a template. I think the issue deserves some looking into. You want people that write a DSL using unicc to be free in their choice of terminals.

phorward commented 6 years ago

Hello @FreeBASIC-programmer,

sorry for the late answer, but I didn't have much time the last few days. Thank you for your very impressive report about the issues you ran into.

The reason why unicc crashes seems to be a Windows-specific issue. I'd ran into this problem when porting the latest setup version 1.3.1 (96329ea5fade5c4aefad8ccee94dfb84db21d77b) of UniCC, in this case the problem occured in the C template where invalid pointers where processed. It seems to be a pointer problem with your issue also, because the option "-v" did also solve the problem temporarily in my case on Windows. I'l take a look at it soon, but it would be very useful to test it with the template your already implemented, to get more close to the problem.

Your second problem with the "@" is totally understandable to me and needs a closer look relating to UniCC's handling of semantic actions. Can you provide some examples on how the @ is normally used in FreeBASIC? Maybe we can find a workaround or a more flexible solution for this, which fixes your problem. The idea with the DSL for writing templates sounds nice, but I would prefer a templating language for this later on. The XML solution currently used was initially started in 2008, but is not modern anymore for several reasons. This must be consired for a new UniCC major version.

phorward commented 6 years ago

I've just compiled a new setup package for UniCC 1.3.3, and here is also the exe only: unicc-1.3.3.zip

Can you please check if this version also crashes with the files you feed to it and without the -v option?

FreeBASIC-programmer commented 6 years ago

I downloaded unicc-1.3.3.zip, unpacked the executable into the directory of unicc and ran unicc.exe without using option -v (I used unicc -b prefix expr.c.par). And... problem solved!

The @ operator is used exactly like the @ operator in C. I translated a huge amount of C code some time ago and I found 130 occurrences of @ in the translated code. Some snippets taken from that code

static shared g_cmdname as const zstring ptr = @"packcc" '' replaced later with actual one
node_const_array__copy(ctx->stderr__,@node->data.action.vars, vars)
node_const_array__term(@a)
if (b) then
  node_const_array__init(ctx->stderr__,@a, ARRAY_INIT_SIZE)
  capts = @a
end if

I thought about solutions that would solve the issue for FreeBASIC and other languages. I came up with two solutions

Solution one Let the user define a replacement for the @ token In case of FreeBASIC the symbols that can be used are ` ~ %. Of those 3 I think % would be the best to use. A semantic action would look like this (using a rule from expr.c.par)

expression      -> expression '+' expression [* %% = %1 + %3 *]

The other tokens ( ` ~) are less usable. If you use one of those you'd get

expression      -> expression '+' expression [* ~~ = ~1 + ~3 *]
expression      -> expression '+' expression [* `` = `1 + `3 *]

Neither looks promising.

Solution 2 is the solution used by several parser generators. One of those is lemon (part of the sqlite source code and used to generate the sql parser used by sqlite). It uses the following syntax

expression(A) ::= expression(B) PLUS expression(C).  { A = B+C; }

This looks very much like what is used by unicc (at least on the rhs of a production). The differences are

Using a lemon - like syntax while keeping with the unicc syntax of grammar rules

A:expression -> B:expression PLUS C:expression  { A = B+C; }

The type of A B and C must be known before their usage. In lemon you use the instruction %type nonterminal (type-name) to let the parser know what type is to be associated with a non - terminal

%type expression {integer}

Regardless of the programming language the lemon - solution should work. There is a (slight) problem with the lemon - solution as the identifiers used could be reserved words in the programming language used. If you rewrite the above (when using FreeBASIC as target language) to

expression(abs) ::= expression(sin) PLUS expression(tan). {abs = sin + tan;}

then compiling the output will result in errors. abs, sin and tan are reserved words in FreeBASIC. And none of them are used correctly in the semantic action.

The first option (user defined replacement for token @) seems to be the easiest to implement (amount of changes to unicc would be kept to a minimum). The second option (the lemon - solution) involves a lot more changes to unicc.

Now for something totally unrelated: the next version of unicc. I read in the unicc wiki https://github.com/phorward/unicc/wiki/UniCC-v2 that you are thinking about using glr - parsing to enable parsing of arbitrary context free grammars. The most likely candidate would then be rnglr parsing (right nulled glr parsing). But what about gll parsing? A rnglr parser uses a gss for parsing and a sppf as representation of a parse forest. A gll (generalized ll) parser uses the same data structures (gss and sppf). gll parsing might be something to consider (instead of using rngrl).

https://github.com/YaccConstructor/YaccConstructor/tree/dev/src contains implementations of the GLL and RNGLR parsing algorithms (generators for both gll parsers and rnglr parsers). hime is a parser generator that generates a rnglr parser in C# https://bitbucket.org/account/user/cenotelie/projects/HIME Iguana is a parser generator that generates gll parsers https://github.com/iguana-parser/iguana

phorward commented 6 years ago

Hello @FreeBASIC-programmer, and thanks for your very constructive posting.

Both ways on how to resolve this issue are well and have their right to exist. The lemon-style approach would, indeed, be too target-language specific, as with the example you chose. This was also the reason why I chose the "@" followed by an specific postfix in UniCC. Your first idea, making the "@" replaceable, is the most obvious one to me, because it provides more freedom. Maybe on a later excursion, other methods can be tried, but this is for UniCCv1 totally sufficient. If you'd like to check it out, I've created and pushed the branch action_prefix which makes the "@" replaceable by the target language template.

In a first test, I extended c.tlt with the line

<action_prefix>HI!</action_prefix>

and tested with the input file

//Some grammar-related directives
#!language      "C";
#whitespaces    ' \t';
#lexeme         integer;
#default action [* HI!@ = HI!1; *];

#left           '+' '-';
#left           '*' '/';

//Defining the grammar
calculator$     -> expression                [* printf( "= %d\n",
                                                  HI!expression ) *]
                ;

expression      -> expression '+' expression [* HI!@ = HI!1 + HI!3 *]
                | expression '-' expression  [* HI!@ = HI!1 - HI!3 *]
                | expression '*' expression  [* HI!@ = HI!1 * HI!3 *]
                | expression '/' expression  [* HI!@ = HI!1 / HI!3 *]
                | '(' expression ')'         [* HI!@ = HI!2 *]
                | integer
                ;

integer         -> '0-9'                     [* HI!@ = HI!1 - '0' *]
                | integer '0-9'              [* HI!@ = HI!integer * 10 + HI!2 - '0' *]
                ;

and the result worked as expected. UniCC still also compiles itself with this modification. Would it be an adequate extension to resolve your problem with the "@"? Please note, that "@@" in this case relates to "HI!@" because the second "@" is static.

Ok now the other thing with the parsing algorithms. I like your idea of implementing GLL also, but didn't do any research on it, yet. All my current efforts are mostly in the area of LR-parsing, which is more accessible for me, maybe because I've implemented it too often. You may recognized my other project called pynetree, which implements a packrat-parser with left-recursion handling. This is, for now, my most powerful attempt in some kind of recursive-descent LL-parsing yet.

With GLR parsing, I'm still in a research right now and really would like to implement this algorithm in an easy-to-use manner, as the next target to achieve. You seem to be familiar with different parsing algorithms, so why is GLL in your opinion a better choice than GLR/RNGLR for a UniCCv2?

Please note, that my knowledge is not the highest on current parser developments, maybe because I never studied anything on and I'm unfortunatelly not scientifically working on these topics. All this stuff is done in my spare time, for fun, and of interest. But I'm really hungry for new ideas, algorithms and things relating to these topics, so please let me know, what you think about it?

So long, Jan

FreeBASIC-programmer commented 6 years ago

I (finally) finished my FreeBASIC code generator. At first I had some problems with the error messages I got from unicc while testing using freebasic.tlt. The error message I got referred to a problem in the vicinity of line 1. As line 1 was correct the problem I copy - pasted the template into an XML editor.

Luckily for me an XML editor (in this case XML Copy Editor) pinpointed the problem. The problem had to do with some operators (<= >= ) that needed translation to entities (XML). After those minor operator adjustments the code generator worked like a charm. It wasn't all that hard to create a FreeBASIC template.

The output looks exactly as expected. The only thing that seems odd is the indentation of semantic action code. Code following a #line directive does not get indented (at all). I checked whether the C template produced code with a different lay out but it laid out the code in the same manner. It's not a big 'issue' (it's hardly an issue at all).

What's left to do now is to come up with a nice example of a unicc parser. And that can only be a fairly complete C parser or FreeBASIC expression parser.

FreeBASIC-programmer commented 6 years ago

With GLR parsing, I'm still in a research right now and really would like to implement this algorithm in an easy-to-use manner, as the next target to achieve. You seem to be familiar with different parsing algorithms, so why is GLL in your opinion a better choice than GLR/RNGLR for a UniCCv2?

I just thought GLL parsing is an option you could consider. Whether it´s a better choice remains to be seen. In terms of parsing speed it does about as well as RNGLR parsing. And it's a parsing technique that is still very much 'under construction'. Creating a GLL style parser isn't particularly hard. There are multiple papers available online on how to create a recognizer/parser.

The originators of GLL parsing claim that you can write a GLL parser by hand. The same cannot be said of a GLR parser. This is due to the fact that the starting point of GLL parsing is top down parsing ( LL(1) parsing). And writing a top down parser by hand is something programmers do a lot these days.

To overcome the limitations of top down parsing a GSS is used to enable a top down parser to deal with (hidden) left recursion. It's is a bit like getting a packrat parser to deal with left recursion. The reason for using a GSS is actually twofold in GLL parsing: to support left - recursion and to deal with non - determinism in the grammar. I think the following quote from the originators of the GLL algorithm says it all

We develop the fully general GLL parsing technique which is recursive descent-like, and has the property that the parse follows closely the structure of the grammar rules, but uses RNGLR-like machinery to handle non-determinism

GLL uses (besides other data) "grammar slots" to track parsing progress A grammar slot is defined as a production with a dot somewhere on it's rhs. Which is the same as an item in LR parsing. You can read the paper about GLL parsing here https://www.sciencedirect.com/science/article/pii/S1571066110001209

In the paper GLL parsing is implemented using labels and gotos (using states (numbers) instead of labels and a switch statement ((embedded inside a loop)) instead of goto works just as well).
A paper with tips on how to implement a GLL parser can be found here http://alexandria.tue.nl/extra1/afstversl/wsk-i/cappers2014.pdf

Implementing GLL parsing using labels and goto is not going to work in, say, Python (or Javascript or many other languages though it would work in Lua ((I think))). Changing the labels to states and then using a switch or some other choice construct should work in quite some languages.

A table driven approach (not described in any of the mentioned papers) would most likely work in any language. And generating a GLL parser that uses parse tables is feasible. The question becomes: what would those tables look like? Turning a LL(1) grammar into a table driven parser results in the creation of two tables (actually one but that table has variable sized columns so it's easier to chop that table up into two). The smallest of the two has a size equal to N * T where N is the number of nonterminals and T is the number of terminals. The bigger of the two has as many entries as the sum of all symbols appearing on the rhs of productions (a size that is 'equal' to the size of the grammar).

There is plenty of room to optimize the code generated by a GLL parser generator. The generated parser could restrict parsing of the input using full GLL parsing power to that part of the input that actually needs full GLL parsing power. The rest could be parsed using a much simpler LL(1) (or maybe LL(k)) parsing power.

One thing that could give GLL parsing something of an edge over GLR parsing has to do with error handling/error recovery. Top down parsers tend to do better in terms of 'ease' of implementation of error handling/error recovery than bottom up parsers. As GLL parsing is a top down parsing technique chances are dealing with syntax errors using a GLL parser is 'easier' than dealing with syntax errors in a GLR parser.

phorward commented 6 years ago

Hello @FreeBASIC-programmer, and apologize again for another late answer currently. It's due to several busy things currently going on.

I'm glad to hear that you finished the FreeBASIC template, that would be awesome and, indeed, will provide two new target languages, C++ and BASIC, to UniCC v1.4 soon. As you noticed, you're right that you need to encode ">" and "<" with its XML pendants, that's one of the reasons why I wrote an awk script merging the original source files together into the template as with in C.source and also C++.source now.

The output looks exactly as expected. The only thing that seems odd is the indentation of semantic action code. Code following a #line directive does not get indented (at all). I checked whether the C template produced code with a different lay out but it laid out the code in the same manner. It's not a big 'issue' (it's hardly an issue at all).

Well the indention currently is done as stated in the template. Any whitespace in front of any variable that is replaced by UniCC during the parser's build time is copied in place. Maybe you need to remove the whitespace on appropriate positions. I've got a similar problem when implementing the Python template, and the problem is, well, currently not really fixed. So the grammar writer has to take attention on how his code is indented. This is one of the issues that need to be fixed in a UniCCv2.

I'm really excited about you template, do you have anything to check out already, somewhere? Feel free to do a pull request and put the template right into the targets/ directory of the unicc git repo!

phorward commented 6 years ago

Thank you very much for your useful information on GLL parsing. I didn't recognize anything of it yet, but it sounds very interesting, and I've already started to read some papers on it.

Most things I've done so far had taken place in the area of bottom-up parsing, the only top-down thing I dealed with had been pynetree, which implements a left-recursive packrat parser. In my opinion, much too memory consumpting and slow. So I thought, dealing with any LR-related things is the better option. But GLL seems to be a way to make even LL parsers more attractive, when left-recursion is supported. I really like it when a grammar can be defined once and constructs the same parse trees, both if constructed the bottom-up or top-down way.

I think, the first thing I now have to do is reading the papers and understand the algorithm, before I do anything on implementation. In the past, I wanted to make phorward some kind of multi-algorithm parser generator, but didn't follow this approach for now. Maybe it becomes real with both GLL and GLR? But that means much work to do...

FreeBASIC-programmer commented 6 years ago

I read your reactions over a week ago (maybe even longer). I did not leave a reaction right then and there for reasons unknown to me.

When I first started using libphorward I was mostly interested in the scanner generator (plex). It seemed like a great way to get flex - like functionality without the need to hack flex code and getting it to produce FreeBASIC code (aside: with flex I mean the scanner generator flex). Using plex I got a c scanner working in no time. It was far from flawless but that was mostly due to mistakes I made. After that I learned better ways to use plex. And my next c scanner is going to be spot on.

My initial plan was to use plex to turn c code into tokens. And then feed (some of) those tokens one-by-one to a c parser generated by some parser generator. Unfortunately I had not considered using the parser generator that comes as part of libphorward.

As I was in need of a parser generator I turned my attention to unicc to generate a c parser. After creating the FreeBASIC template I realized that combining a lexer created using plex and a parser generated using unicc is not a good idea. And I felt like defining both lexer and parser using unicc isn't an option (at least not when parsing non preprocessed c code).

Overall I'd say I feel much more comfortable using libphorward as opposed to using unicc. The fact that libphorward gives me access to both flex- and bison like functionality without having to port a single line of code to FreeBASIC is great. Of course I have to create FreeBASIC bindings for libphorward but that's fairly trivial to do.

Which is why I am going to be using libphorward for both the c scanner and the c parser. I will return to unicc (I am not going to throw away my FreeBASIC template) but for now I am sticking with libphorward.

I am also going to have a better look at the latest unicc manual to check it for typos. I was reading it the other day and I thought I saw some. And I'll have a good look at the libphorward manual (saw some typos in that manual as well). I'll try and issue a pull request after I have made changes to the libphorward documentation (I'll be forking libphorward sometime later this week).

phorward commented 6 years ago

Hello @FreeBASIC-programmer, and thanks for your reply. It is ok for me when you use libphorward - I think it's the better choice. Anyway, if you finished to create a FreeBASIC Template for UniCC, I would be glad if you share it with us, and maybe do a pull request enabling UniCC to build parsers even for FreeBASIC, or other BASIC dialects.

So long, Jan