Retain comments in AST - Githubissues

turbolent commented 4 years ago

Issue To Be Solved

Currently the AST only retains comments for declarations if they are docstrings (/// line comments, a and /** block comments).

To be able to use the AST for formatting, it must retain also non-docstring comments.

Retain comments for:

Declarations
Statements
Expressions

Definition of Done

[ ] Add AST elements for comments:
- [ ] Block comment
- [ ] Line comment
[ ] Add a leading/trailing comments property
- [ ] Declarations: Line and block comments
- [ ] Statements: Line and block comments
- [ ] Expressions: Block comments
[ ] Add comments to AST elements when parsing
- [ ] Declarations
- [ ] Statements
- [ ] Expressions
[ ] Tests

SupunS commented 3 years ago

I think we may have to capture the comments at a further granular level. Because a comment can exist at any place where whitespace can occur, not only at declarations/statements/expressions level, and we would have to retain those as well. e.g. let /*Im a comment*/ x: Int

A solution would be to:

Capture doc comments only at the declaration level. (It is useful to have them separately)
Capture other comments (and any trivia including whitespace, newlines, etc, if needed) at the token level.

Roslyn (C#) compiler follows a similar approach: https://github.com/dotnet/roslyn/blob/main/docs/wiki/Roslyn-Overview.md#syntax-trivia

This way, it also divides the whitespace/comments between two consecutive non-trivia tokens as leading and trailing trivia (rather than having leading trivia only, or trailing trivia only), so that the comments are attached to the correct node/token. And moving that node around (code organizing/formatting), would also move the comment along with it.

e.g: See the bellow comments, which is the 'usual' way followed by most people.

// (1) The comment immediately before the decl is usually about 'x' decl
let x: Int  // (2) A comment on the same line also talks about 'x' decl (or about `Int`)
// (3) Similarly, this is a comment about 'y' decl
let y: Int  // (4) This is also a comment about 'y' decl

In the above, comments (1) and (2) would be a part of x-var-decl. Moving let x: 5 around would also move the comments (1) and (2) along with it. Similarly, (3) and (4) would be a part of y-var-decl.

lkadian commented 3 years ago

hi @SupunS Great comment, agree with your proposed solution. Have you started work on this issue? I ask this because I had actually worked on this but never got around to completing it. I was planning to put up a PR soon, so would just like to confirm if you are also working on it so that we can avoid duplicate work!

SupunS commented 3 years ago

@lkadian I haven't done or started any work on this. Great to hear that you have already been working on it! :clap: Would love to have a PR!

Looping in @MaxStalker.

lkadian commented 3 years ago

This issue has been a bit trickier than I thought as it touches a lot of parts of the parsing code, and as a result it affects a lot of the tests. Posting some info on how I am implementing this, so that if anything seems completely off, someone can let me know! Also to mention that it's still in progress

The original issue mentions retaining trivia for declarations, statement and expressions, but as @SupunS pointed out it looks like we may need to capture leading/trailing trivia for each token. Basically, we'll have a full syntax tree where we hold every token, and each token will know about its leading/trailing trivia.

[x] Restructure lexer so that tokens and trivia are separate concepts
[x] Fix broken lexer tests
[x] Add lexertests

Currently we hold trivia as regular tokens, but we need these to be a separate entity. Then when scanning, we will emit each token accompanied with it's leading and trailing trivia:

type Token struct {
    Type  TokenType
    Value interface{}
    ast.Range
    LeadingTrivia  []Trivia
    TrailingTrivia []Trivia
}

Rules for leading/trailing trivia (taken from Swift):

A token owns all of its trailing trivia up to, but not including, the next newline character.
Looking backward in the text, a token owns all of the leading trivia up to and including the first contiguous sequence of newlines characters

[ ] Restructure parser to work with above changes and support all tokens in the AST (full syntax tree)
[ ] Fix broken parser tests
[ ] Add parser tests Currently not all tokens are stored in the AST as they are not needed in later stages. But if we want this sort of "full syntax tree", we will need to keep all tokens for each syntax node.

SupunS commented 3 years ago

Thanks, @lkadian for taking this forward and for the great progress! Sorry for not being in touch recently - I somehow miss updates from this thread, probably have drowned with other stuff.

Currently we hold trivia as regular tokens, but we need these to be a separate entity. Then when scanning, we will emit each token accompanied with it's leading and trailing trivia:
type Token struct {
  Type  TokenType
  Value interface{}
  ast.Range
  LeadingTrivia  []Trivia
  TrailingTrivia []Trivia
}
Rules for leading/trailing trivia (taken from Swift):

A token owns all of its trailing trivia up to, but not including, the next newline character.

Looking backward in the text, a token owns all of the leading trivia up to and including the first contiguous sequence of newlines characters

This sounds like a pretty good solution to me! :clap:

[ ] Restructure parser to work with above changes and support all tokens in the AST (full syntax tree)

[ ] Fix broken parser tests

[ ] Add parser tests Currently not all tokens are stored in the AST as they are not needed in later stages. But if we want this sort of "full syntax tree", we will need to keep all tokens for each syntax node.

That's a good point. I think we'll have to retain the full syntax tree, if we need to retain all the comments.

In the future, we could add an option to switch capturing the full syntax-tree on and off, depending on the use-case. For example, we need the full syntax tree and the trivia, when using in the IDE/playground during the development time, but don't really need them when parsing and running the code on-chain.

Maybe we can keep a flag for now, but have it 'on' by default, and later we can make it configurable. I'm not sure how big would that change (adding a flag) be. Another option is to add the flag altogether as a separate change, so that current changes don't get too big.

j1010001 commented 1 year ago

We can split this into 2 milestones:

Cover the majority of cases – we don't need to have perfect solution to cover every edge-case, good benchmark is to preserve all comments in contracts on Mainnet.
Cover the remaining edge-cases

onflow / cadence

Retain comments in AST #308

Issue To Be Solved

Definition of Done