move-language / move

Apache License 2.0
2.26k stars 689 forks source link

[Feature Request] move fmt #166

Open Manahip opened 2 years ago

Manahip commented 2 years ago

🚀 Feature Request

Move formatter. move fmt, just like how cargo fmt formats rust code.

Motivation

Useful in enforcing consistent coding style.

yubing744 commented 2 years ago

@awelc Does move-analyzer plan to support the textDocument/formatting PSL protocol, and starcoin-ide also wants to integrate this function?

Manahip commented 2 years ago

I've been meddling with Move ASTs in my Move-to-TypeScript transpiler. I can probably do a formatter myself. Do you guys know where I could get a grant to speed things up? We're not raising yet so any bit of cash would help.

yubing744 commented 2 years ago

@Manahip You can participate in https://github.com/starcoinorg/startrek organized by starcoin, there will be bonuses after graduation.

yubing744 commented 2 years ago

@Manahip I found a project that has implemented move-fmt, repository: https://github.com/victoryang00/move-fmt

jolestar commented 2 years ago

@LemonHX Do you know the author?

wrwg commented 2 years ago

The move-fmt project is for Move IR, an older (and simpler) instance of Move.

yubing744 commented 2 years ago

The move-fmt project is for Move IR, an older (and simpler) instance of Move.

The move-fmt project uses pest parsing to get the AST, and then does the formatting. If the move grammar he configured is older, we can update the grammar file https://github.com/victoryang00/move-fmt/blob/master/src/grammar.pest

zsluedem commented 2 years ago

The move fmt could use the internal move parser to get the ast and do the formatting https://github.com/move-language/move/blob/07f10682c63b7b758b0c7cd6ef91af4e86d25cb4/language/move-compiler/src/parser/mod.rs#L29-L132 .

In that case, move fmt could be consistent with the compiler even if there are some updates.

yubing744 commented 2 years ago

The move fmt could use the internal move parser to get the ast and do the formatting https://github.com/move-language/move/blob/main/language/move-compiler/src/parser/mod.rs#L29-L132 .

In that case, move fmt could be consistent with the compiler even if there are some updates.

this is a good idea.

zsluedem commented 2 years ago

this is a good idea.

I want to give this idea a try and get myself into it now. After we get the ast, the move fmt could do something similar like rustfmt https://github.com/rust-lang/rustfmt/blob/2403f827bf1e427a04ce45c88a64cbebe91041d7/src/rewrite.rs#L16-L19

We could define the Rewrite trait for each ast type.

awelc commented 2 years ago

I am not sure if the idea being currently explored involves (somehow) re-using the existing formatter or writing one from scratch, but here's a general idea of what would be good to have:

Oh, and I guess I did not answer @yubing744's original question. The auto-formatter is indeed on our radar, but the work hasn't started and it's unlikely that we will have resources to start it soon. It would be amazing to have this implemented as a community contribution and I am happy to help with this as much as I can.

yubing744 commented 2 years ago

I consulted the original author of https://github.com/victoryang00/move-fmt, he is willing to donate this project to the Move community, I think we can continue to iterate this project, and then apply for transfer to https: //github.com/move-language.

Work that can now move forward includes:

  1. Replace the AST parser with the move official one
  2. Add Rust API to facilitate other project calls
  3. Refer to rust-fmt-rfcs to determine move fmt-rfcs
awelc commented 2 years ago

I am not sure if starting with the existing formatter is the best idea. Not only using "official" Move parser would result in a different AST which would require modifying pretty much all the AST traversal code, but also from looking at the existing prototype, the formatter is a bit too simplistic - in particular it does not seem to be taking the text width into consideration at all when constructing the output.

If we want a tool that would be used by the generations of Move programmers to come, we need something at the same quality level as gofmt or rustfmt. There is a really good article about difficulties of writing a good auto-formatter, and handling line-breaks to accommodate max text width emerges as one of the more difficult problems. Admittedly, the article talks about auto-formatting Dart and doing this for Move may (should?) be simpler, but I am not convinced that adding the required logic to the existing codebase (along with all the refactorings that would be required to accommodate a different parsing infrastructure) is going to be simpler than writing the code from scratch.

wrwg commented 2 years ago

Formatting is hard because:

awelc commented 2 years ago

comments which need to be preserved (this excludes e.g. using the current parser, which throws comments away even before the lexer is run)

Indeed, existing parser would have to be extended/modified. Arguably using it is still a preferred solution (e.g., for maintenance purposes) over building parsing infrastructure from scratch, though.

yubing744 commented 2 years ago

I put together what can be done:

  1. Modify the official Move parser to keep comments in the AST
  2. Refer to rust-fmt-rfcs to determine move fmt-rfcs
  3. Design move-fmt code architecture
  4. Initialize the move-fmt project in the language directory
zsluedem commented 2 years ago

The comments could be retrieved from MatchedFileCommentMap. It is explained in https://github.com/move-language/move/blob/main/language/move-compiler/src/parser/lexer.rs#L346-L349 although I am not quite sure whether it works in the correct way.

Anyway, I agree that the best way is to refactor the parser to include comments in the AST.

  1. Refer to rust-fmt-rfcs to determine move fmt-rfcs

rust-fmt-rfcs has been iterated after several years already. I think we could decide some basic initial requirements(features) of the formatters and write a very first prototype of the formatter. These initial requirements should benefit the move-fmt code architecture. Then move-fmt should be published and we could go forward to move-fmt-rfcs based on the user feedback.

yuyang-ok commented 1 year ago

I wonder Is there any progress on this issue.

yuyang-ok commented 1 year ago

@awelc @wrwg

Implementing a language formatter with less work.

Implementing a formatter for programming language always with a lot of work.

A formatter just print a AST to a string.

first of all a language AST always has a lot of variant data structure.

like move expression.

pub enum Exp_ {
    Value(Value),
    // move(x)
    Move(Var),
    // copy(x)
    Copy(Var),
    // [m::]n[<t1, .., tn>]
    Name(NameAccessChain, Option<Vec<Type>>),

    // f(earg,*)
    // f!(earg,*)
    Call(NameAccessChain, bool, Option<Vec<Type>>, Spanned<Vec<Exp>>),

    // tn {f1: e1, ... , f_n: e_n }
    Pack(NameAccessChain, Option<Vec<Type>>, Vec<(Field, Exp)>),

    // vector [ e1, ..., e_n ]
    // vector<t> [e1, ..., en ]
    Vector(
        /* name loc */ Loc,
        Option<Vec<Type>>,
        Spanned<Vec<Exp>>,
    ),

    // if (eb) et else ef
    IfElse(Box<Exp>, Box<Exp>, Option<Box<Exp>>),
    // while (eb) eloop
    While(Box<Exp>, Box<Exp>),
    // loop eloop
    Loop(Box<Exp>),

    // { seq }
    Block(Sequence),
    // fun (x1, ..., xn) e
    Lambda(BindList, Box<Exp>), // spec only
    // forall/exists x1 : e1, ..., xn [{ t1, .., tk } *] [where cond]: en.
    Quant(
        QuantKind,
        BindWithRangeList,
        Vec<Vec<Exp>>,
        Option<Box<Exp>>,
        Box<Exp>,
    ), // spec only
    // (e1, ..., en)
    ExpList(Vec<Exp>),
    // ()
    Unit,

    ...
}

This is just Expression variants.

There are also Function,Module,Struct,etc.

Implement a formatter you have to deal all the data structure.

Another complex thing about formatter is Comments. Most programming language support two forms of comments.

The tricky part is Block comment.

Block comment can write anywhere in source code.

In order to keep user's comments you have to keep comments in AST like below.

    // f(earg,*)
    // f!(earg,*)
    Call(NameAccessChain,
        Vec<Comment> , // keep in AST.
        bool, Option<Vec<Type>>, Spanned<Vec<Exp>>),

This will make things below ugly.

In general We need keep a Vec<Comment> in every AST structure.

Is there a way to slove this puzzle.

TokenTree solution

The key idea about this post to simplfy AST to far more simpler tree type which I call it TokenTree.

function(a) {
    if (a) { 
        return 1
    } else {
        return 2
    }
}

TokenTree mainly contains two category.

So a source program may represents like this.

pub enum TokenTree {
    SimpleToken {
        content: String,
        pos: u32,  // start position of file buffer.
    },
    Nested {
        elements: Vec<TokenTree>,
        kind: NestKind,
    },
}

pub type AST = Vec<TokenTree>;

Instead of dealing a lot data structure we simple the puzzle to dump Vec<TokenTree>.

TokenTree is just another very simple version of AST.

TokenTree is very easy to parse,simple as.

...
if(tok == Tok::LParent){ // current token is `(`
    parse_nested(Tok::LParent);    
}
...

Right now verything looks fine.

But There are some token can be both SimpleToken and Nested.

Typical for a language support type parameter like Vec<X>.

A Token < can be type parameter sign or mathematic less than.

This can be solved by consult the AST before parse TokenTree.

Because we are writting a formatter for exist programming language.

It is always easy for us to get the real AST.

We can traval the AST the decide < is either a SimpleToken or Nested.

how to generate base on TokenTree.

Vec<TokenTree> is a tree type, It is very easy to decide how many ident,etc.

And comment can pour into result base on the pos relate to SimpleToken.pos.

awelc commented 1 year ago

As far as I know one of the main problems with full-fledged auto-formatting is deciding where and how to split lines when they get too long. I don't think a choice of a particular representation is going to help here much...

More generally, hopefully we will eventually get to writing the auto-formatter but at least on my side, this is not going to happen sooner than in a few (or several rather) months.

zsluedem commented 1 year ago

I just want to bring up a few ideas on this topic now since I have done some work on the formatter before although none is merged.

@awelc You are right that line splitting is a very difficult point and @tnowacki shared a post about the difficulty line splitting with. I also want to share it here for those who are interested.

However, the current main blocker for the formatter is not line splitting but having a syntax tree that is lossless or full fidelity!!. I don't think we have been in the trouble of line spliting yet. The particular representation proposed by @yuyang-ok is trying to solve the current blocker(a syntax tree that is lossless).

The reason to have another representation on the ast is that the comments are ignored in the current syntax tree which is also mentioned in @yuyang-ok comments.

There is another way to have a formatter without having another full fidelity syntax tree, though. The solution is to look back at the original source code checking whether there is some comments every time we check on ast based on the pos. This could be a solution although I haven't tried at all. But this solution have some limitation on formatting and bad performance.

I think @yuyang-ok makes a very good point. I just want to extend the content here with something in my mind.

It would be great to introduce a red-green tree which is also used by the rust-analyzer. I would just drop rust-analyser doc link which talks a lot about the red-green tree and the things we could do with it. Some of the ideas are aligned wtih @yuyang-ok idea but it is more. We could manipulate the token trees directly on the formatter to genreate new formatted codes.

yuyang-ok commented 1 year ago

@awelc @zsluedem I think I want give the TokenTree a try.

awelc commented 1 year ago

@awelc @zsluedem I think I want give the TokenTree a try.

On our side we normally prefer re-using (and modifying if necessary) the existing tool chain rather than building a new one (which I think is what the TokenTree proposal involves) to avoid additional maintenance effort and difficulties related to the language (and existing toolchain) evolution. This does not have to necessarily stop you from pursuing the TokenTree idea, and we are happy to see external contributions to the ecosystem, but I also wanted to clarify our own position on this.

yuyang-ok commented 1 year ago

@awelc Ok, thanks . TokenTree will not modify the current tool chain at all.

alnoki commented 9 months ago

@wrwg any updates on this?

wrwg commented 9 months ago

Yes: Aptos foundation registry grant project is in the testing phase, should be released in Q1.

alnoki commented 9 months ago

@wrwg excellent, thanks for the update. Is the repo open source at the moment, or perhaps in alpha? I'd be glad to try it out and start working it into our repo CI/CD