jplevyak / dparser

A Scannerless GLR parser/parser generater.
https://github.com/jplevyak/dparser
BSD 3-Clause "New" or "Revised" License
105 stars 14 forks source link

Nondeterministic result with "if-then-else" expressions #31

Closed Valo13 closed 2 years ago

Valo13 commented 2 years ago

We have used dparser as the parser in our project for years now, but have recently discovered some non-deterministic behaviour from the parser, see this issue. This has to do with the combination of an "if then" binary operator and an "if then else" ternary operator in our language. The syntax for these operators used by dparser is defined in this file. Since we could not find support for ternary operators in dparser, the "if then else" operator was defined (after quite some experimentation, done before I personally joined the project) using two unary_op_right keywords, which has worked as desired for years until we found this specific case. I can understand that this might not be a valid way to define such an operator, but I think that the non-deterministic behaviour is concerning regardless. I would expect that either dparser parses the expression the same always or dparser returns an error saying that it has found an ambiguity.

To reproduce the non-deterministic behaviour, you need a specification file test.mcrl2 with the following contents:

proc P = true -> tau.true -> (tau.P) <> P;
init P;

Also you need the mcrl22lps tool. You can download a recent version here with some additional debug output that shows the parsed specification (which is done by adding std::cout << spec; right after this line).

Then if you run the command mcrl22lps --check-only test.mcrl2 repeatedly, you should see differences in the debug output between runs, namely that parentheses are placed differently in the parsed specification, either

proc P = true -> tau . (true -> tau . P <> P);

or

proc P = true -> tau . (true -> tau . P) <> P;

I'm not exactly sure which version of the dparser is used in our project by default, but I could reproduce this with the latest release of dparser.

Do you think this non-determinism can be fixed or do you have any advice on how to handle ternary operators in dparser?

jplevyak commented 2 years ago

There was a bug in the way priorities were compared and I have pushed a fix.

Your use of unary priorities for if-then-else is interesting. The intersection of precedence and GLR parsing is a bit troublesome since it is not always clear how to value two precedences in different subtrees and at different levels. I suppose I should do a paper search when I get some time and see if there is a simpler solution.

PTAL

On Thu, Feb 24, 2022 at 7:54 AM Olav Bunte @.***> wrote:

We have used dparser as the parser in our project https://github.com/mCRL2org/mCRL2 for years now, but have recently discovered some non-deterministic behaviour from the parser, see this issue https://github.com/mCRL2org/mCRL2/issues/1665. This has to do with the combination of an "if then" binary operator and an "if then else" ternary operator in our language. The syntax for these operators used by dparser is defined in this file https://github.com/mCRL2org/mCRL2/blob/9daba43f760670c75c0aee16080304fb1bfb7ca8/libraries/core/source/mcrl2_syntax.g#L183 . Since we could not find support for ternary operators in dparser, the "if then else" operator was defined (after quite some experimentation, done before I personally joined the project) using two unary_op_right keywords, which has worked as desired for years until we found this specific case. I can understand that this might not be a valid way to define such an operator, but I think that the non-deterministic behaviour is concerning regardless. I would expect that either dparser parses the expression the same always or dparser returns an error saying that it has found an ambiguity.

To reproduce the non-deterministic behaviour, you need a specification file test.mcrl2 with the following contents:

proc P = true -> tau.true -> (tau.P) <> P; init P;

Also you need the mcrl22lps tool. You can download a recent version here https://github.com/jplevyak/dparser/files/8133347/mcrl22lps.zip with some additional debug output that shows the parsed specification (which is done by adding std::cout << spec; right after this line https://github.com/mCRL2org/mCRL2/blob/9daba43f760670c75c0aee16080304fb1bfb7ca8/tools/release/mcrl22lps/mcrl22lps.cpp#L194 ).

Then if you run the command mcrl22lps --check-only test.mcrl2 repeatedly, you should see differences in the debug output between runs, namely that parentheses are placed differently in the parsed specification, either

proc P = true -> tau . (true -> tau . P <> P);

or

proc P = true -> tau . (true -> tau . P) <> P;

I'm not exactly sure which version of the dparser is used in our project by default, but I could reproduce this with the latest release of dparser.

Do you think this non-determinism can be fixed or do you have any advice on how to handle ternary operators in dparser?

— Reply to this email directly, view it on GitHub https://github.com/jplevyak/dparser/issues/31, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACEXIDWH64FBHIAD7KYU53U4ZICFANCNFSM5PHYUJPA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jplevyak commented 2 years ago

It is amazing to me that this project has been around so long and yet you have found another bug.

The way that I had intended the productions to be written is:

| DataExprUnit '->' ProcExpr $binary_right 5 | DataExprUnit '->' ProcExpr '<>' ProcExpr $right 6

However I noticed that you were using $unary_op_right so I tried the above and discovered a bug in the indexing into the priority comparison table.

That is also now fixed which should enable you to switch to the above simpler syntax.

Please tell me if this works for you or if you discover any other oddities.

Valo13 commented 2 years ago

Yeah we were surprised about this bug too. It is probably found this late on our end because the "if-then-else" operator isn't used often and because this is a rather specific situation (for instance, removing the parentheses around tau.P makes it deterministic).

I will try out the latest version of dparser with your fixes sometime soon. Thanks for the quick reply and suggestions!

jplevyak commented 2 years ago

I believe this is fixed. Please reopen if that is not the case.

Valo13 commented 2 years ago

I can't seem to be able to reopen the issue (the button doesn't appear), but I'd like to do so.

I used the latest version (1.32) of dparser with your fixes and now the parser gives an ambiguity error between the two previously non-determinically chosen options, so that is an improvement. However, when I adopted your suggested syntax rules, it gave the following error for the example:

[error]   Line 1, column 9: unexpected parse node!
[error]     proc P = true -> tau.true -> (tau.P) <> P;
[error]              ^
[error]   symbol      = ProcExpr
[error]   string      = true -> tau.true -> (tau.P) <> P
[error]   child_count = 3
[error]   child 0 = DataExprUnit true
[error]   child 1 = -> ->
[error]   child 2 = ProcExpr tau.true -> (tau.P) <> P

I also tried using | DataExprUnit ('->' $binary_right 5) ProcExpr, but that gave the following ambiguity errors instead

Ambiguity: (true -> (tau . (true -> (( (tau . P) )))) <> P)
Ambiguity: (true -> (tau . (true -> (( (tau . P) )) <> P)))
Ambiguity: ((true -> tau) . (true -> (( (tau . P) )) <> P))
[error]   Unresolved ambiguity.

Since this gave a different result, I was wondering whether we should change all our rules of the form E ('op' $binary_op_left n) E to E 'op' E $binary_op_left n, which is more in line with the structure of the syntax you suggested. What do you think?

As a side note, I had to remove #include strings.h from d.h to make it build on Windows.

jplevyak commented 2 years ago

I generally recommend that people use the least powerful parser which solves their problem. So LALR(1) yacc or Bison, PEG or if you need it, ANTLR which is LL(N).

DParser is a scannerless GLR parser which is, frankly, usually unnecessarily powerful. It makes it far too easy to shoot one's foot off, and it is far too permissive because it doesn't do any semantic checking of the grammar.

For example:

X : X ('+' $binary_op_right 4) X

is identical to

X: X '+' X $binary_right 4

and the latter is clearer.

X: 'true' $left 20

doesn't make sense unless you have

X: 'true' foo ....

as a production since priorities are only considered for productions which can otherwise conflict, and even then it doesn't make sense since if you gave the first production higher priority then the second would never be evaluated and if you give the second higher priority then it is redundant with the longest match disambiguation rule.

you have

DataExpr : 'true' $left 20

but all the rest of the productions are:

DataExpr: DataExpr ...

so that production will have to be reduced for 'true' in any case so the priority is unnecessary. Also,

| DataExpr '[' DataExpr '->' DataExpr ']' $unary_left 13

doesn't make sense because it is an NARY production. You could create a subproduction e.g.

CondIndex: '[' DataExpr '->' DataExpr ']';

then

| DataExr CondIndex $unary_left 13

or use an NARY priority:

| DataExpr '[' DataExpr '->' DataExpr ']' $left 13

Note that NARY productions with priority are always of the forms:

X: X ... $left N

or

X: ... X $right N

because that is all that makes sense. Similarly, binary expression should be of the form:

X: X op X $binary_X N;

DParser doesn't report any errors because it doesn't do a lot of semantic checking compared to something like ANTLR.

The trick you are using to handle if-then-else by breaking it up into unary operators would overly constrain the parse tree because the priorities are evaluated before there is an ambiguity and that can result in a failed parse. Instead you want to use the priorities only to eliminate ambiguities in recursive productions e.g. recursive unary and binary expressions.

Note that when you insert a parenthesized block e.g. ((DataExprUnit '->' $unary_op_right 6) ProcExpr, the parenthesized expression is rewritten as an anonymous production which means that that priority is meaningless since they apply only to those with the same LHS, and you have a $unary_op_X which is attached to a unary expression rather than an operator. You would need to make this ((DataExprUnit ('->' $unary_op_right 6))), but as I said it would not work since it is now in an anonymous production. You can see the de-sugared productions by running make_dparser with the -vv options.

If you have a regression suite I might be able to help you restructure your grammar.

Please find attached a draft version with the above changes.

On Mon, Mar 7, 2022 at 9:58 AM Olav Bunte @.***> wrote:

I can't seem to be able to reopen the issue (the button doesn't appear), but I'd like to do so.

I used the latest version (1.32) of dparser with your fixes and now the parser gives an ambiguity error between the two previously non-determinically chosen options, so that is an improvement. However, when I adopted your suggested syntax rules, it gave the following error for the example:

[error] Line 1, column 9: unexpected parse node! [error] proc P = true -> tau.true -> (tau.P) <> P; [error] ^ [error] symbol = ProcExpr [error] string = true -> tau.true -> (tau.P) <> P [error] child_count = 3 [error] child 0 = DataExprUnit true [error] child 1 = -> -> [error] child 2 = ProcExpr tau.true -> (tau.P) <> P

I also tried using | DataExprUnit ('->' $binary_right 5) ProcExpr, but that gave the following ambiguity errors instead

Ambiguity: (true -> (tau . (true -> (( (tau . P) )))) <> P) Ambiguity: (true -> (tau . (true -> (( (tau . P) )) <> P))) Ambiguity: ((true -> tau) . (true -> (( (tau . P) )) <> P)) [error] Unresolved ambiguity.

Since this gave a different result, I was wondering whether we should change all our rules of the form E ('op' $binary_op_left n) E to E 'op' E $binary_op_left n, which is more in line with the structure of the syntax you suggested. What do you think?

As a side note, I had to remove #include strings.h from d.h to make it build on Windows.

— Reply to this email directly, view it on GitHub https://github.com/jplevyak/dparser/issues/31#issuecomment-1060967388, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACEXIBJSXJY6A6RHLVPPB3U6Y7UPANCNFSM5PHYUJPA . You are receiving this because you modified the open/close state.Message ID: @.***>

jplevyak commented 2 years ago

I think what you want is:

(( proc ( P = ( true -> ( tau . ( true -> ( ( ( tau . ( P )) ) ) <> ( P )))) ; ))( init ( P ) ; ))

is that correct?

I pulled the starting production to the top so that I could test with:

./test_parser -v mc2.g mc.1

in the dparser directory. This will build the parsing tables for mc2.g and load them before parsing mc.1

On Mon, Mar 7, 2022 at 9:46 PM John Plevyak @.***> wrote:

I generally recommend that people use the least powerful parser which solves their problem. So LALR(1) yacc or Bison, PEG or if you need it, ANTLR which is LL(N).

DParser is a scannerless GLR parser which is, frankly, usually unnecessarily powerful. It makes it far too easy to shoot one's foot off, and it is far too permissive because it doesn't do any semantic checking of the grammar.

For example:

X : X ('+' $binary_op_right 4) X

is identical to

X: X '+' X $binary_right 4

and the latter is clearer.

X: 'true' $left 20

doesn't make sense unless you have

X: 'true' foo ....

as a production since priorities are only considered for productions which can otherwise conflict, and even then it doesn't make sense since if you gave the first production higher priority then the second would never be evaluated and if you give the second higher priority then it is redundant with the longest match disambiguation rule.

you have

DataExpr : 'true' $left 20

but all the rest of the productions are:

DataExpr: DataExpr ...

so that production will have to be reduced for 'true' in any case so the priority is unnecessary. Also,

| DataExpr '[' DataExpr '->' DataExpr ']' $unary_left 13

doesn't make sense because it is an NARY production. You could create a subproduction e.g.

CondIndex: '[' DataExpr '->' DataExpr ']';

then

| DataExr CondIndex $unary_left 13

or use an NARY priority:

| DataExpr '[' DataExpr '->' DataExpr ']' $left 13

Note that NARY productions with priority are always of the forms:

X: X ... $left N

or

X: ... X $right N

because that is all that makes sense. Similarly, binary expression should be of the form:

X: X op X $binary_X N;

DParser doesn't report any errors because it doesn't do a lot of semantic checking compared to something like ANTLR.

The trick you are using to handle if-then-else by breaking it up into unary operators would overly constrain the parse tree because the priorities are evaluated before there is an ambiguity and that can result in a failed parse. Instead you want to use the priorities only to eliminate ambiguities in recursive productions e.g. recursive unary and binary expressions.

Note that when you insert a parenthesized block e.g. ((DataExprUnit '->' $unary_op_right 6) ProcExpr, the parenthesized expression is rewritten as an anonymous production which means that that priority is meaningless since they apply only to those with the same LHS, and you have a $unary_op_X which is attached to a unary expression rather than an operator. You would need to make this ((DataExprUnit ('->' $unary_op_right 6))), but as I said it would not work since it is now in an anonymous production. You can see the de-sugared productions by running make_dparser with the -vv options.

If you have a regression suite I might be able to help you restructure your grammar.

Please find attached a draft version with the above changes.

On Mon, Mar 7, 2022 at 9:58 AM Olav Bunte @.***> wrote:

I can't seem to be able to reopen the issue (the button doesn't appear), but I'd like to do so.

I used the latest version (1.32) of dparser with your fixes and now the parser gives an ambiguity error between the two previously non-determinically chosen options, so that is an improvement. However, when I adopted your suggested syntax rules, it gave the following error for the example:

[error] Line 1, column 9: unexpected parse node! [error] proc P = true -> tau.true -> (tau.P) <> P; [error] ^ [error] symbol = ProcExpr [error] string = true -> tau.true -> (tau.P) <> P [error] child_count = 3 [error] child 0 = DataExprUnit true [error] child 1 = -> -> [error] child 2 = ProcExpr tau.true -> (tau.P) <> P

I also tried using | DataExprUnit ('->' $binary_right 5) ProcExpr, but that gave the following ambiguity errors instead

Ambiguity: (true -> (tau . (true -> (( (tau . P) )))) <> P) Ambiguity: (true -> (tau . (true -> (( (tau . P) )) <> P))) Ambiguity: ((true -> tau) . (true -> (( (tau . P) )) <> P)) [error] Unresolved ambiguity.

Since this gave a different result, I was wondering whether we should change all our rules of the form E ('op' $binary_op_left n) E to E 'op' E $binary_op_left n, which is more in line with the structure of the syntax you suggested. What do you think?

As a side note, I had to remove #include strings.h from d.h to make it build on Windows.

— Reply to this email directly, view it on GitHub https://github.com/jplevyak/dparser/issues/31#issuecomment-1060967388, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACEXIBJSXJY6A6RHLVPPB3U6Y7UPANCNFSM5PHYUJPA . You are receiving this because you modified the open/close state.Message ID: @.***>

Valo13 commented 2 years ago

Yes that is how I would expect the expression to be parsed. You mentioned an attachment, but I can't seem to find any (in GitHub or in my mail client).

I would need to discuss with the others how far we want to go with this. Significant changes to the grammer might result in significant changes needed in the toolset. An ad hoc solution we thought of was to simply force users to put parentheses when using the if then else operator, but I agree that fixing the syntax definition is a nicer solution.

Concerning why we use dparser, I asked around and was told that they did try to use BISON when they were searching for a good parser about 10 years ago, but felt like there was too much effort needed to get it working properly for the language (such as manual ambiguity fixes), while in dparser it was enough to define the grammar as it is defined now, using grammar rules that are rather easy to read.

But at least it seems that the error lies more on our side now than on the side of dparser, though I still wonder why it gave the error it did (especially why child 1 = -> ->, I would expect it to be child 1 = ->, but I am not the expert here).

jplevyak commented 2 years ago

The file seems to have been stripped. Here it is mc2.g.txt .

jplevyak commented 2 years ago

Note: github is pedantic about extensions and will not allow .g files.

jplevyak commented 2 years ago

I added all of your examples as tests and your grammar to dparser (some are disabled because the statistics are non-deterministic because of the way the nodes are hashed). This is with your original grammar (with the Start production at the top). I also added the example you provided above and it passes as well.

PTAL

It is probably not worth trying to change your grammar since you have one which works for you. However if I substitute:

| DataExprUnit '->' ProcExpr $right 5

into the mc2.g that I sent you that grammar also passes all the tests (although the stats are different).

I had this as $binary_right, but it isn't binary since the LHS of the operator isn't ProcExpr but DataExprUnit.

Valo13 commented 2 years ago

Your comment (and commit) seem to insinuate that the original example parses properly on the original grammar, but when I put dparser 1.32 or the latest source in our project it says that there is an ambiguity, namely

Ambiguity: (((true ->) (tau . ((true ->) (( (tau . P) )))) <>) P)
Ambiguity: ((true ->) (tau . (((true ->) (( (tau . P) )) <>) P)))

This ambiguity error is fine to me, I'm just curious why there seems to be a different result for this example on your end.

But yeah, changing our grammar is probably not worth it right now, but I'll add a "long-term" issue to our project for improvements of the grammar with your suggestions in it.

jplevyak commented 2 years ago

That is odd. The latest code contains your parser and this test and the result:

https://github.com/jplevyak/dparser/blob/master/tests/mcrl2_syntax.test.g https://github.com/jplevyak/dparser/blob/master/tests/mcrl2_syntax.test.g.1

and the correct result which passes:

https://github.com/jplevyak/dparser/blob/master/tests/mcrl2_syntax.test.g.1.check

72 states 45 scans 45 shifts 45 reductions 1 compares 0 ambiguities (( proc ( P = (( true -> )( tau . ((( true -> )( ( ( tau . ( P )) ) ) <> )( P )))) ; ))( init ( P ) ; ))

which is to say that there is no ambiguity AFAICT.

Perhaps you can tell me how to reproduce you result?

jplevyak commented 2 years ago

Ah, you are setting

dont_use_greediness_for_disambiguation dont_use_height_for_disambiguation

Let me investigate.

Valo13 commented 2 years ago

Well, I can give you a newer version of the mcrl22lps tool that uses the latest version of dparser, but I don't think you can extract much info from that. Also it seems you may already have found the culprit. I don't know why those settings are set, but I could ask around. I wonder how the parsing of mCRL2 specs would change if we would not set those settings.

jplevyak commented 2 years ago

OK, I set your options in the tests. My previous fix didn't work for your test case and I didn't notice because my tests we falling back to disambiguation via greediness. The particular feature you are leaning on is deep priority comparisons which occur when there are two possible parses which are both legal in terms of parent-child priorities, but one might be preferred because it includes high priority operations deeper in the tree. Of all the bits of disambiguation this is perhaps the least well defined and I need to do some research and define the ordering formally.

However, in the short term your test now passes with your options. PTAL

Valo13 commented 2 years ago

I have tested the latest version of dparser and the original example now parses deterministically on my end too. In my eyes this issue is closed.

Thanks a lot for your effort!

I will add your suggestions towards improving the grammar definition to a (long term, enhancement) issue in our project.