Open Manahip opened 2 years ago
@awelc Does move-analyzer plan to support the textDocument/formatting PSL protocol, and starcoin-ide also wants to integrate this function?
I've been meddling with Move ASTs in my Move-to-TypeScript transpiler. I can probably do a formatter myself. Do you guys know where I could get a grant to speed things up? We're not raising yet so any bit of cash would help.
@Manahip You can participate in https://github.com/starcoinorg/startrek organized by starcoin, there will be bonuses after graduation.
@Manahip I found a project that has implemented move-fmt, repository: https://github.com/victoryang00/move-fmt
@LemonHX Do you know the author?
The move-fmt project is for Move IR, an older (and simpler) instance of Move.
The move-fmt project is for Move IR, an older (and simpler) instance of Move.
The move-fmt project uses pest parsing to get the AST, and then does the formatting. If the move grammar he configured is older, we can update the grammar file https://github.com/victoryang00/move-fmt/blob/master/src/grammar.pest
The move fmt could use the internal move parser to get the ast and do the formatting https://github.com/move-language/move/blob/07f10682c63b7b758b0c7cd6ef91af4e86d25cb4/language/move-compiler/src/parser/mod.rs#L29-L132 .
In that case, move fmt could be consistent with the compiler even if there are some updates.
The move fmt could use the internal move parser to get the ast and do the formatting https://github.com/move-language/move/blob/main/language/move-compiler/src/parser/mod.rs#L29-L132 .
In that case, move fmt could be consistent with the compiler even if there are some updates.
this is a good idea.
this is a good idea.
I want to give this idea a try and get myself into it now. After we get the ast, the move fmt could do something similar like rustfmt https://github.com/rust-lang/rustfmt/blob/2403f827bf1e427a04ce45c88a64cbebe91041d7/src/rewrite.rs#L16-L19
We could define the Rewrite
trait for each ast type.
I am not sure if the idea being currently explored involves (somehow) re-using the existing formatter or writing one from scratch, but here's a general idea of what would be good to have:
Oh, and I guess I did not answer @yubing744's original question. The auto-formatter is indeed on our radar, but the work hasn't started and it's unlikely that we will have resources to start it soon. It would be amazing to have this implemented as a community contribution and I am happy to help with this as much as I can.
I consulted the original author of https://github.com/victoryang00/move-fmt, he is willing to donate this project to the Move community, I think we can continue to iterate this project, and then apply for transfer to https: //github.com/move-language.
Work that can now move forward includes:
I am not sure if starting with the existing formatter is the best idea. Not only using "official" Move parser would result in a different AST which would require modifying pretty much all the AST traversal code, but also from looking at the existing prototype, the formatter is a bit too simplistic - in particular it does not seem to be taking the text width into consideration at all when constructing the output.
If we want a tool that would be used by the generations of Move programmers to come, we need something at the same quality level as gofmt
or rustfmt
. There is a really good article about difficulties of writing a good auto-formatter, and handling line-breaks to accommodate max text width emerges as one of the more difficult problems. Admittedly, the article talks about auto-formatting Dart and doing this for Move may (should?) be simpler, but I am not convinced that adding the required logic to the existing codebase (along with all the refactorings that would be required to accommodate a different parsing infrastructure) is going to be simpler than writing the code from scratch.
Formatting is hard because:
comments which need to be preserved (this excludes e.g. using the current parser, which throws comments away even before the lexer is run)
Indeed, existing parser would have to be extended/modified. Arguably using it is still a preferred solution (e.g., for maintenance purposes) over building parsing infrastructure from scratch, though.
I put together what can be done:
The comments could be retrieved from MatchedFileCommentMap. It is explained in https://github.com/move-language/move/blob/main/language/move-compiler/src/parser/lexer.rs#L346-L349 although I am not quite sure whether it works in the correct way.
Anyway, I agree that the best way is to refactor the parser to include comments in the AST.
- Refer to rust-fmt-rfcs to determine move fmt-rfcs
rust-fmt-rfcs
has been iterated after several years already. I think we could decide some basic initial requirements(features) of the formatters and write a very first prototype of the formatter. These initial requirements should benefit the move-fmt code architecture. Then move-fmt should be published and we could go forward to move-fmt-rfcs based on the user feedback.
I wonder Is there any progress on this issue.
@awelc @wrwg
Implementing a formatter for programming language always with a lot of work.
A formatter just print a AST to a string.
first of all a language AST always has a lot of variant data structure.
like move expression.
pub enum Exp_ {
Value(Value),
// move(x)
Move(Var),
// copy(x)
Copy(Var),
// [m::]n[<t1, .., tn>]
Name(NameAccessChain, Option<Vec<Type>>),
// f(earg,*)
// f!(earg,*)
Call(NameAccessChain, bool, Option<Vec<Type>>, Spanned<Vec<Exp>>),
// tn {f1: e1, ... , f_n: e_n }
Pack(NameAccessChain, Option<Vec<Type>>, Vec<(Field, Exp)>),
// vector [ e1, ..., e_n ]
// vector<t> [e1, ..., en ]
Vector(
/* name loc */ Loc,
Option<Vec<Type>>,
Spanned<Vec<Exp>>,
),
// if (eb) et else ef
IfElse(Box<Exp>, Box<Exp>, Option<Box<Exp>>),
// while (eb) eloop
While(Box<Exp>, Box<Exp>),
// loop eloop
Loop(Box<Exp>),
// { seq }
Block(Sequence),
// fun (x1, ..., xn) e
Lambda(BindList, Box<Exp>), // spec only
// forall/exists x1 : e1, ..., xn [{ t1, .., tk } *] [where cond]: en.
Quant(
QuantKind,
BindWithRangeList,
Vec<Vec<Exp>>,
Option<Box<Exp>>,
Box<Exp>,
), // spec only
// (e1, ..., en)
ExpList(Vec<Exp>),
// ()
Unit,
...
}
This is just Expression
variants.
There are also Function
,Module
,Struct
,etc.
Implement a formatter you have to deal all the data structure.
Another complex thing about formatter is Comments
.
Most programming language support two forms of comments.
The tricky part is Block comment
.
Block comment
can write anywhere in source code.
In order to keep user's comments you have to keep comments in AST like below.
// f(earg,*)
// f!(earg,*)
Call(NameAccessChain,
Vec<Comment> , // keep in AST.
bool, Option<Vec<Type>>, Spanned<Vec<Exp>>),
This will make things below ugly.
In general We need keep a Vec<Comment>
in every AST structure.
Is there a way to slove this puzzle.
The key idea about this post to simplfy AST
to far more simpler tree type which I call it TokenTree
.
function(a) {
if (a) {
return 1
} else {
return 2
}
}
TokenTree
mainly contains two category.
SimpleToken
in previous code snippet,if
,return
,a
are SimpleToken
.Nested
in previous code snippet, paired ()
and paired {}
will form a Nested
Token.So a source program may represents like this.
pub enum TokenTree {
SimpleToken {
content: String,
pos: u32, // start position of file buffer.
},
Nested {
elements: Vec<TokenTree>,
kind: NestKind,
},
}
pub type AST = Vec<TokenTree>;
Instead of dealing a lot data structure we simple the puzzle to dump Vec<TokenTree>
.
TokenTree
is just another very simple version of AST
.
TokenTree
is very easy to parse,simple as.
...
if(tok == Tok::LParent){ // current token is `(`
parse_nested(Tok::LParent);
}
...
Right now verything looks fine.
But There are some token can be both SimpleToken
and Nested
.
Typical for a language support type parameter like Vec<X>
.
A Token <
can be type parameter sign
or mathematic less than
.
This can be solved by consult the AST
before parse TokenTree
.
Because we are writting a formatter for exist programming language.
It is always easy for us to get the real AST
.
We can traval the AST
the decide <
is either a SimpleToken
or Nested
.
TokenTree
.Vec<TokenTree>
is a tree type, It is very easy to decide how many ident,etc.
And comment can pour into result
base on the pos
relate to SimpleToken
.pos
.
As far as I know one of the main problems with full-fledged auto-formatting is deciding where and how to split lines when they get too long. I don't think a choice of a particular representation is going to help here much...
More generally, hopefully we will eventually get to writing the auto-formatter but at least on my side, this is not going to happen sooner than in a few (or several rather) months.
I just want to bring up a few ideas on this topic now since I have done some work on the formatter before although none is merged.
@awelc You are right that line splitting is a very difficult point and @tnowacki shared a post about the difficulty line splitting with. I also want to share it here for those who are interested.
However, the current main blocker for the formatter is not line splitting but having a syntax tree that is lossless or full fidelity!!. I don't think we have been in the trouble of line spliting yet. The particular representation proposed by @yuyang-ok is trying to solve the current blocker(a syntax tree that is lossless).
The reason to have another representation on the ast is that the comments are ignored in the current syntax tree which is also mentioned in @yuyang-ok comments.
There is another way to have a formatter without having another full fidelity syntax tree, though. The solution is to look back at the original source code checking whether there is some comments every time we check on ast based on the pos
. This could be a solution although I haven't tried at all. But this solution have some limitation on formatting and bad performance.
I think @yuyang-ok makes a very good point. I just want to extend the content here with something in my mind.
It would be great to introduce a red-green tree which is also used by the rust-analyzer. I would just drop rust-analyser doc link which talks a lot about the red-green tree and the things we could do with it. Some of the ideas are aligned wtih @yuyang-ok idea but it is more. We could manipulate the token trees directly on the formatter to genreate new formatted codes.
@awelc @zsluedem I think I want give the TokenTree
a try.
@awelc @zsluedem I think I want give the
TokenTree
a try.
On our side we normally prefer re-using (and modifying if necessary) the existing tool chain rather than building a new one (which I think is what the TokenTree
proposal involves) to avoid additional maintenance effort and difficulties related to the language (and existing toolchain) evolution. This does not have to necessarily stop you from pursuing the TokenTree
idea, and we are happy to see external contributions to the ecosystem, but I also wanted to clarify our own position on this.
@awelc Ok, thanks . TokenTree
will not modify the current tool chain at all.
@wrwg any updates on this?
Yes: Aptos foundation registry grant project is in the testing phase, should be released in Q1.
@wrwg excellent, thanks for the update. Is the repo open source at the moment, or perhaps in alpha? I'd be glad to try it out and start working it into our repo CI/CD
🚀 Feature Request
Move formatter.
move fmt
, just like howcargo fmt
formats rust code.Motivation
Useful in enforcing consistent coding style.