Optimal operator priority pretty-printing in translators

GreyCat commented 7 years ago

Right now, generation of target code by translators yield more or less spontaneous parenthesis generation. In some cases, it proves to be unoptimal (i.e. extra parenthesis are added when there's no need for them) and sometimes it yields wrong results, i.e.:

A few examples that should probably provide more optimal translation:

1 + 2
- now yields (1 + 2)
- should be 1 + 2
1 + 2 + 5
- now yields ((1 + 2) + 5)
- should be 1 + 2 + 5
(1 + 2) / (7 * 8)
- now yields ((1 + 2) / (7 * 8))
- should be (1 + 2) / (7 * 8)

I propose to:

Define formally the order of operators used in our expression language
Fix translators in some way that would produce optimal translation for every target language; this is intrivial, as every language has somewhat different idea of operator priorities and availability.

Hopefully, it would also provide an ultimate answer to all these bugs above as well.

GreyCat commented 7 years ago

It turns out that things are not as easy as they look like at the first glance. I've tried a naïve approach, introducing a "priority" integer that gets compared to that's inside. Literals get priority of 0, exponentiation gets 20, multiplication gets 40, addition gets 50, etc. Various translators for various target languages may output different values, as some languages have slightly different ideas on operator priorities. A very simple example ~5**2 gets evaluated differently:

in Ruby: (~5)**2 = 36
in Python: ~(5**2) = -25

However, things get ugly fast, as there's also operator's associativity. For example:

1 - 2 - 3 gets parsed as ((1 - 2) - 3)
(1 - 2) - 3 gets parsed as ((1 - 2) - 3)
1 - (2 - 3) gets parsed as (1 - (2 - 3))

Current priority-only based algorithm would translate all these examples to 1 - 2 - 3, as they're all of equal priority, and thus no parenthesis is needed. This is true for associative operations (i.e. (1 + 2) + 3 = 1 + (2 + 3), so it's somewhat safe to translate that to 1 + 2 + 3), but not true for operators like -, /, etc.

Moreover, I'm not totally sure that it's a good idea to omit parenthesis in 1 + (2 + 3) translation as well. It may be correct from arithmetic POV, but I'm not sure if it's a good idea to take so much liberty with original expression author's intent.

koczkatamas commented 7 years ago

I would not omit parentheses as they can make the generated code more understandable when the author uses them to separate logically different parts.

GreyCat commented 7 years ago

I've started a comparison table on priorities in different languages:

https://docs.google.com/spreadsheets/d/13Cs3SidXHJVDCqydHCBwnk3uQ2FjGmHsjZa6SHkrE34/edit?usp=sharing

GreyCat commented 7 years ago

I would not omit parentheses as they can make the generated code more understandable when the author uses them to separate logically different parts.

Unfortunately, it's not completely possible. From the AST point of view:

(1 + 2) + 3
1 + 2 + 3
((((1 + 2)) + 3))
etc

are translated to exactly the same

BinOp(
  BinOp(
    IntNum(1),
    Add,
    IntNum(2)
  ),
  Add,
  IntNum(3)
)

After we've got AST, it's not possible to distinguish them.

koczkatamas commented 7 years ago

They could be

(1 + 2) + 3

BinOp(
  Paren(
    BinOp(
      IntNum(1),
      Add,
      IntNum(2)
    )
  ),
  Add,
  IntNum(3)
)

1 + 2 + 3

BinOp(
  BinOp(
    IntNum(1),
    Add,
    IntNum(2)
  ),
  Add,
  IntNum(3)
)

((((1 + 2)) + 3))

Paren(
  Paren(
    BinOp(
      Paren(
        Paren(
          BinOp(
            IntNum(1),
            Add,
            IntNum(2)
          )
        )
      )
      Add,
      IntNum(3)
    )
  )
)

But I don't know whether this is the proper way to do it, or is it worth it at first place...

GreyCat commented 7 years ago

Well, it's generally not how AST parsing works. More or less, the general idea is to eliminate parenthesis and all these operator application priority steps, and that's actually what we need, as we can't rely on particular operator priorities (and parenthesis application) of a single language, as they can be different for different targets...

koczkatamas commented 7 years ago

Yes, but we can interpret parentheses in ksy expressions as "force parenthesis here" operators. I cannot think of any case where these "forced" parentheses would change the meaning of the expression in any of the target languages.

Of course sometimes we need to put additional parentheses into the target language which won't be in the AST where the target language's operator precedence does not match with the .ksy expression precedence.

So the AST would still describe the same expression, but would add support for additional used-only-for-clarity-parentheses.

Or this could be even an optional compiler option: keep original parentheses from ksy or not.

GreyCat commented 7 years ago

I've pushed what I have now into distinct branch optimal_expressions, so anyone can track my progress.

GreyCat commented 7 months ago

https://github.com/kaitai-io/kaitai_struct_compiler/pull/277 was merged, so I would continue using this issue to track remaining problems, this is roughly the plan:

comparison operators and boolean operators
ternary operator
integer operations with extreme integer constants
unary operators

I've started playing with comparisons, and immediately figured out that it's not so simple even with very basic set of comparison operators. Some languages have all of comparison operators on same level of precedence (e.g. Python), some languages have == and != as higher precedence (e.g. C++, Java, and many others).

A simple test: 1 < 2 == 3 < 4.

C++ (gcc11): 1
C++ (clang14): 1
Java: true
JavaScript: true
PHP: bool(true)
Python: False
Ruby: true

An also interesting bit is that modern gcc issues a warning on this:

expr_run.cpp: In function 'int main()':
expr_run.cpp:3:25: warning: suggest parentheses around comparison in operand of '==' [-Wparentheses]
    3 |         std::cout << (1 < 2 == 3 < 4) << std::endl;
      |                       ~~^~~

So, it looks like at least for Python (and likely for C++ too to avoid the warning) we need to modify generation to parenthesize. (1 < 2) == (3 < 4) yields correct "True" result in Python.

GreyCat commented 7 months ago

A few other data points:

Go blows up building 1 < 2 == 3 < 4 with the following error, parenthesized version works as expected (returning true):
```
./expr_run.go:4:23: invalid operation: 1 < 2 == 3 (mismatched types untyped bool and untyped int)
```
Lua blows up on 1 < 2 == 3 < 4 with the following error, parenthesized version works as expected (returning true):
```
(command line):1: attempt to compare boolean with number
```

Nim blows up on 1 < 2 == 3 < 4 with the following error, parenthesized version works as expected (returning true):

/work/expr_run.nim(1, 12) Error: type mismatch: got <bool, int literal(3)>
but expected one of:
proc `==`(x, y: bool): bool
first type mismatch at position: 2
required type for y: bool
but expression '3' is of type: int literal(3)

Perl works the same on unparenthesized and parenthesized versions (as many other C-based languages), returning !!1 (its equivalent of true).

So, looks like we'll have to build that distinction in the custom per-language precedence tables.

GreyCat commented 7 months ago

While experimenting with expression, I've whipped up a tool all-expression that allows me to quickly test some ideas in many languages we support:

$  ./all-expression -h
Usage: ./all-expression [OPTIONS] EXPRESSION"

Interpretes EXPRESSION in various programming languages using Docker images
and prints the result.

Options:
  --parallel, -p  Run all targets in parallel
  --help, -h      Show this help message

$ ./all-expression '1 + 2'
* C++ (gcc11): 3
* C++ (clang11): 3
* Go: 3
* Java: 3
* JavaScript: 3
* Lua: 3
* Nim: 3
* Perl: $VAR1 = 3;
* PHP: int(3)
* Python: 3
* Ruby: 3

kaitai-io / kaitai_struct

Optimal operator priority pretty-printing in translators #69