Closed torhovland closed 1 year ago
@tbagrel1 you may want to comment on what you did in Ormolu.
In Ormolu, I dealt with operator chains (either term-level or type-level operators), like 1 + 2 * 3 + 4 + 5
, or Int :*: Int :+: Int
. I decided to build a n-ary tree in which all operations of the same priority are located on the same tree level. So I first flattened the tree, and then (recursively) splitted on the least prioritary operator at a given level, or introduced a subtree when adjacent leaves are "joined" with a most prioritary operator (see https://github.com/tweag/ormolu/blob/master/src/Ormolu/Printer/Meat/Declaration/OpTree.hs#L48). The two actions "split" or "group" gives the same result when the operator priority is a complete order, but they are complementary when we only have a partial order on priority/fixity.
.+.*.+.*. .+...+..
| | | | | --> | | |
1 2 3 4 5 1 * * (this can be either "split on +" or "group on *")
/ \ / \
2 3 4 5
And then each level of the tree uses the upper bound of the original "line discipline" of its nodes (where multiline > singleline). So basically I dropped recursion in favor of a more global/omniscient analysis of the operator tree.
Feel free to ask if you want more details about what I implemented in Ormolu.
Thanks for the explanation. In tree-sitter-formatter, we don't have the "luxury" of knowing anything about the rules of the language, such as operator precedence. We only know the syntax tree that tree-sitter gives us. So I think the best we can do is to choose between single-line and full multi-line formatting.
The problem reported above is that this may turn out oddly for recursive chains, because the inner terms may be single-line, while the outer terms are multi-line. So we may end up with something like this:
1
+ 2
* 3 + 4 + 5
But we should be able to solve this by making multi-line detection a little smarter.
Thanks for the explanation. In tree-sitter-formatter, we don't have the "luxury" of knowing anything about the rules of the language, such as operator precedence.
It is mostly externally configured in Ormolu, actually. Because the precedence of operators is configurable in Haskell. Though Ormolu has the advantage of knowing where to look to populate the configuration with sensible defaults.
I see. It might be possible to do something on the configuration level in the query files (where we specify indent level). But that would have to go in a separate issue. The first step (this issue) is to at least have a consistent multi-line formatting of recursive chains.
Agreed.
I mean, even without knowing anything about fixity/priority, you could still decide that the function arrow has a lower priority than anything else.
For term-level operators, it is also possible to make a base of sensible defaults (at least for arithmetic and boolean operators).
I don't see how you would get something like this:
1
+ 2
* 3 + 4 + 5
If you don't know anything about the operator priorities, then the n-ary tree stays "flattened" (that is exactly how it works in ormolu when unknown operators are used), and then we use the upper bound of the "line discipline" of the nodes; that is, multiline in this case.
So we would get
1
+ 2
* 3
+ 4
+ 5
I mean, even without knowing anything about fixity/priority, you could still decide that the function arrow has a lower priority than anything else.
In our formatter, "->" is just an anonymous node that happens to be provided by the OCaml grammar. In other grammars, it would not be a valid node. In the OCaml grammar file, we can say that we'd like soft line breaks after "->", but we cannot easily say that "->" has a lower priority than "+".
I don't see how you would get something like this:
1 + 2 * 3 + 4 + 5
If the grammar gave us a tree like the following, it would be fine:
But instead it gives us something like this:
But if you look at your example, you might end up with
(?used_slot: bool ref ->
override_flag ->
Env.t ->
Location.t ->
Longident.t loc ->
Path.t *
Env.t)
if both *
and ->
are treated with the same (unknown) priority. So you might not want a completely consistent formatting at the end of the day.
Yes, this is true. So being able to specify operator precedence in the query file, and honouring that when formatting, would be very useful.
About your last message, GHC parser first gives a binary tree, that is transformed into a (flattened) n-ary tree by Ormolu, and then this n-ary tree is processed (split & group) and then analyzed to determine the printing discipline
Yes, I actually already read that 🙂
Our challenge is that we have to make do with the tree that the specific grammar gives us. The OCaml grammar is yielding these recursive chains, while a different grammar for another language might give us something more tree-like.
Could you really get something much different than either a binary tree or a n-ary tree?
Our challenge is that we have to make do with the tree that the specific grammar gives us. The OCaml grammar is yielding these recursive chains, while a different grammar for another language might give us something more tree-like.
Without considering precedence for now, we may consider having a query thingy (the @property
things, I don't know how they are called) telling me that I want to collapse a chain into a list somehow. When I find an a + b
I mark it as @collapse_chain
. And somehow, the engine treats the entire chain as a single node.
We might have to combine that with the input line break detection, i.e. the same that we do for figuring out if a comment is supposed to be at the end of a line or on the next line. So that
something_short + something_else_short
can be kept on a single line if it is on a single line in the input, while
something_very_very_very_long
+ something_else_very_very_very_long
can be split if it is split in the input.
Note that this issue also affects formatting of tuples. For instance,
(
1, 2,
3, 4
)
is formatted as
(
1, 2,
3,
4
)
This is because tuples are parsed as left-associative nested couples by the grammar.
The formatter currently does this:
rather than this:
This is because the tree is recursive like this:
So the innermost functions aren't multi-line, hence won't get formatted as such.
When deciding if a node is multi-line, we should check if there is a line of same-type ancestors (such as
function_type
), where at least one of them is multi-line.