Open GreyCat opened 7 years ago
It turns out that things are not as easy as they look like at the first glance. I've tried a naïve approach, introducing a "priority" integer that gets compared to that's inside. Literals get priority of 0, exponentiation gets 20, multiplication gets 40, addition gets 50, etc. Various translators for various target languages may output different values, as some languages have slightly different ideas on operator priorities. A very simple example ~5**2
gets evaluated differently:
(~5)**2
= 36~(5**2)
= -25However, things get ugly fast, as there's also operator's associativity. For example:
1 - 2 - 3
gets parsed as ((1 - 2) - 3)
(1 - 2) - 3
gets parsed as ((1 - 2) - 3)
1 - (2 - 3)
gets parsed as (1 - (2 - 3))
Current priority-only based algorithm would translate all these examples to 1 - 2 - 3
, as they're all of equal priority, and thus no parenthesis is needed. This is true for associative operations (i.e. (1 + 2) + 3
= 1 + (2 + 3)
, so it's somewhat safe to translate that to 1 + 2 + 3
), but not true for operators like -
, /
, etc.
Moreover, I'm not totally sure that it's a good idea to omit parenthesis in 1 + (2 + 3)
translation as well. It may be correct from arithmetic POV, but I'm not sure if it's a good idea to take so much liberty with original expression author's intent.
I would not omit parentheses as they can make the generated code more understandable when the author uses them to separate logically different parts.
I've started a comparison table on priorities in different languages:
https://docs.google.com/spreadsheets/d/13Cs3SidXHJVDCqydHCBwnk3uQ2FjGmHsjZa6SHkrE34/edit?usp=sharing
I would not omit parentheses as they can make the generated code more understandable when the author uses them to separate logically different parts.
Unfortunately, it's not completely possible. From the AST point of view:
(1 + 2) + 3
1 + 2 + 3
((((1 + 2)) + 3))
are translated to exactly the same
BinOp(
BinOp(
IntNum(1),
Add,
IntNum(2)
),
Add,
IntNum(3)
)
After we've got AST, it's not possible to distinguish them.
They could be
(1 + 2) + 3
BinOp(
Paren(
BinOp(
IntNum(1),
Add,
IntNum(2)
)
),
Add,
IntNum(3)
)
1 + 2 + 3
BinOp(
BinOp(
IntNum(1),
Add,
IntNum(2)
),
Add,
IntNum(3)
)
((((1 + 2)) + 3))
Paren(
Paren(
BinOp(
Paren(
Paren(
BinOp(
IntNum(1),
Add,
IntNum(2)
)
)
)
Add,
IntNum(3)
)
)
)
But I don't know whether this is the proper way to do it, or is it worth it at first place...
Well, it's generally not how AST parsing works. More or less, the general idea is to eliminate parenthesis and all these operator application priority steps, and that's actually what we need, as we can't rely on particular operator priorities (and parenthesis application) of a single language, as they can be different for different targets...
Yes, but we can interpret parentheses in ksy expressions as "force parenthesis here" operators. I cannot think of any case where these "forced" parentheses would change the meaning of the expression in any of the target languages.
Of course sometimes we need to put additional parentheses into the target language which won't be in the AST where the target language's operator precedence does not match with the .ksy expression precedence.
So the AST would still describe the same expression, but would add support for additional used-only-for-clarity-parentheses.
Or this could be even an optional compiler option: keep original parentheses from ksy or not.
I've pushed what I have now into distinct branch optimal_expressions, so anyone can track my progress.
https://github.com/kaitai-io/kaitai_struct_compiler/pull/277 was merged, so I would continue using this issue to track remaining problems, this is roughly the plan:
I've started playing with comparisons, and immediately figured out that it's not so simple even with very basic set of comparison operators. Some languages have all of comparison operators on same level of precedence (e.g. Python), some languages have ==
and !=
as higher precedence (e.g. C++, Java, and many others).
A simple test: 1 < 2 == 3 < 4
.
An also interesting bit is that modern gcc issues a warning on this:
expr_run.cpp: In function 'int main()':
expr_run.cpp:3:25: warning: suggest parentheses around comparison in operand of '==' [-Wparentheses]
3 | std::cout << (1 < 2 == 3 < 4) << std::endl;
| ~~^~~
So, it looks like at least for Python (and likely for C++ too to avoid the warning) we need to modify generation to parenthesize. (1 < 2) == (3 < 4)
yields correct "True" result in Python.
A few other data points:
Go blows up building 1 < 2 == 3 < 4
with the following error, parenthesized version works as expected (returning true):
./expr_run.go:4:23: invalid operation: 1 < 2 == 3 (mismatched types untyped bool and untyped int)
Lua blows up on 1 < 2 == 3 < 4
with the following error, parenthesized version works as expected (returning true):
(command line):1: attempt to compare boolean with number
Nim blows up on 1 < 2 == 3 < 4
with the following error, parenthesized version works as expected (returning true):
/work/expr_run.nim(1, 12) Error: type mismatch: got <bool, int literal(3)>
but expected one of:
proc `==`(x, y: bool): bool
first type mismatch at position: 2
required type for y: bool
but expression '3' is of type: int literal(3)
Perl works the same on unparenthesized and parenthesized versions (as many other C-based languages), returning !!1
(its equivalent of true).
So, looks like we'll have to build that distinction in the custom per-language precedence tables.
While experimenting with expression, I've whipped up a tool all-expression that allows me to quickly test some ideas in many languages we support:
$ ./all-expression -h
Usage: ./all-expression [OPTIONS] EXPRESSION"
Interpretes EXPRESSION in various programming languages using Docker images
and prints the result.
Options:
--parallel, -p Run all targets in parallel
--help, -h Show this help message
$ ./all-expression '1 + 2'
* C++ (gcc11): 3
* C++ (clang11): 3
* Go: 3
* Java: 3
* JavaScript: 3
* Lua: 3
* Nim: 3
* Perl: $VAR1 = 3;
* PHP: int(3)
* Python: 3
* Ruby: 3
Right now, generation of target code by translators yield more or less spontaneous parenthesis generation. In some cases, it proves to be unoptimal (i.e. extra parenthesis are added when there's no need for them) and sometimes it yields wrong results, i.e.:
A few examples that should probably provide more optimal translation:
1 + 2
(1 + 2)
1 + 2
1 + 2 + 5
((1 + 2) + 5)
1 + 2 + 5
(1 + 2) / (7 * 8)
((1 + 2) / (7 * 8))
(1 + 2) / (7 * 8)
I propose to:
Hopefully, it would also provide an ultimate answer to all these bugs above as well.