kaitai-io / kaitai_struct

Kaitai Struct: declarative language to generate binary data parsers in C++ / C# / Go / Java / JavaScript / Lua / Nim / Perl / PHP / Python / Ruby
https://kaitai.io
4.02k stars 197 forks source link

Optimal operator priority pretty-printing in translators #69

Open GreyCat opened 7 years ago

GreyCat commented 7 years ago

Right now, generation of target code by translators yield more or less spontaneous parenthesis generation. In some cases, it proves to be unoptimal (i.e. extra parenthesis are added when there's no need for them) and sometimes it yields wrong results, i.e.:

A few examples that should probably provide more optimal translation:

I propose to:

  1. Define formally the order of operators used in our expression language
  2. Fix translators in some way that would produce optimal translation for every target language; this is intrivial, as every language has somewhat different idea of operator priorities and availability.

Hopefully, it would also provide an ultimate answer to all these bugs above as well.

GreyCat commented 7 years ago

It turns out that things are not as easy as they look like at the first glance. I've tried a naïve approach, introducing a "priority" integer that gets compared to that's inside. Literals get priority of 0, exponentiation gets 20, multiplication gets 40, addition gets 50, etc. Various translators for various target languages may output different values, as some languages have slightly different ideas on operator priorities. A very simple example ~5**2 gets evaluated differently:

However, things get ugly fast, as there's also operator's associativity. For example:

Current priority-only based algorithm would translate all these examples to 1 - 2 - 3, as they're all of equal priority, and thus no parenthesis is needed. This is true for associative operations (i.e. (1 + 2) + 3 = 1 + (2 + 3), so it's somewhat safe to translate that to 1 + 2 + 3), but not true for operators like -, /, etc.

Moreover, I'm not totally sure that it's a good idea to omit parenthesis in 1 + (2 + 3) translation as well. It may be correct from arithmetic POV, but I'm not sure if it's a good idea to take so much liberty with original expression author's intent.

koczkatamas commented 7 years ago

I would not omit parentheses as they can make the generated code more understandable when the author uses them to separate logically different parts.

GreyCat commented 7 years ago

I've started a comparison table on priorities in different languages:

https://docs.google.com/spreadsheets/d/13Cs3SidXHJVDCqydHCBwnk3uQ2FjGmHsjZa6SHkrE34/edit?usp=sharing

GreyCat commented 7 years ago

I would not omit parentheses as they can make the generated code more understandable when the author uses them to separate logically different parts.

Unfortunately, it's not completely possible. From the AST point of view:

are translated to exactly the same

BinOp(
  BinOp(
    IntNum(1),
    Add,
    IntNum(2)
  ),
  Add,
  IntNum(3)
)

After we've got AST, it's not possible to distinguish them.

koczkatamas commented 7 years ago

They could be

BinOp(
  Paren(
    BinOp(
      IntNum(1),
      Add,
      IntNum(2)
    )
  ),
  Add,
  IntNum(3)
)
BinOp(
  BinOp(
    IntNum(1),
    Add,
    IntNum(2)
  ),
  Add,
  IntNum(3)
)
Paren(
  Paren(
    BinOp(
      Paren(
        Paren(
          BinOp(
            IntNum(1),
            Add,
            IntNum(2)
          )
        )
      )
      Add,
      IntNum(3)
    )
  )
)

But I don't know whether this is the proper way to do it, or is it worth it at first place...

GreyCat commented 7 years ago

Well, it's generally not how AST parsing works. More or less, the general idea is to eliminate parenthesis and all these operator application priority steps, and that's actually what we need, as we can't rely on particular operator priorities (and parenthesis application) of a single language, as they can be different for different targets...

koczkatamas commented 7 years ago

Yes, but we can interpret parentheses in ksy expressions as "force parenthesis here" operators. I cannot think of any case where these "forced" parentheses would change the meaning of the expression in any of the target languages.

Of course sometimes we need to put additional parentheses into the target language which won't be in the AST where the target language's operator precedence does not match with the .ksy expression precedence.

So the AST would still describe the same expression, but would add support for additional used-only-for-clarity-parentheses.

Or this could be even an optional compiler option: keep original parentheses from ksy or not.

GreyCat commented 7 years ago

I've pushed what I have now into distinct branch optimal_expressions, so anyone can track my progress.

GreyCat commented 7 months ago

https://github.com/kaitai-io/kaitai_struct_compiler/pull/277 was merged, so I would continue using this issue to track remaining problems, this is roughly the plan:

I've started playing with comparisons, and immediately figured out that it's not so simple even with very basic set of comparison operators. Some languages have all of comparison operators on same level of precedence (e.g. Python), some languages have == and != as higher precedence (e.g. C++, Java, and many others).

A simple test: 1 < 2 == 3 < 4.

An also interesting bit is that modern gcc issues a warning on this:

expr_run.cpp: In function 'int main()':
expr_run.cpp:3:25: warning: suggest parentheses around comparison in operand of '==' [-Wparentheses]
    3 |         std::cout << (1 < 2 == 3 < 4) << std::endl;
      |                       ~~^~~

So, it looks like at least for Python (and likely for C++ too to avoid the warning) we need to modify generation to parenthesize. (1 < 2) == (3 < 4) yields correct "True" result in Python.

GreyCat commented 7 months ago

A few other data points:

So, looks like we'll have to build that distinction in the custom per-language precedence tables.

GreyCat commented 7 months ago

While experimenting with expression, I've whipped up a tool all-expression that allows me to quickly test some ideas in many languages we support:

$  ./all-expression -h
Usage: ./all-expression [OPTIONS] EXPRESSION"

Interpretes EXPRESSION in various programming languages using Docker images
and prints the result.

Options:
  --parallel, -p  Run all targets in parallel
  --help, -h      Show this help message

$ ./all-expression '1 + 2'
* C++ (gcc11): 3
* C++ (clang11): 3
* Go: 3
* Java: 3
* JavaScript: 3
* Lua: 3
* Nim: 3
* Perl: $VAR1 = 3;
* PHP: int(3)
* Python: 3
* Ruby: 3