rome / tools

Unified developer tools for JavaScript, TypeScript, and the web
https://docs.rome.tools/
MIT License
23.71k stars 660 forks source link

Proposal: CST-like Bogus AST Nodes #1848

Closed jamiebuilds closed 2 years ago

jamiebuilds commented 2 years ago

Description

Note: I'm using "bogus" instead of "unknown" (See #1819)

CST nodes have a lot of qualities that make it nice to represent both errors and error recovery.

Take the following code:

function %error% foo() {}

It should look something like this in CST form:

JsProgram:
  - JsFunctionStatement:
    - "function"
    - unexpected: "%error%"
    - JsIdentifierBinding: ["foo"]
    - JsParameterList: ["(", ")"]
    - JsFunctionBody: ["{", "}"]

We can tell a few things from this:

Just as easily we could have created this CST:

JsProgram:
  - JsFunctionStatement:
    - "function"
    - unexpected: "%"
    - JsIdentifierBinding: ["error"]
    - unexpected: "% foo"
    - JsParameterList: ["(", ")"]
    - JsFunctionBody: ["{", "}"]

Or this one:

JsProgram:
  - JsFunctionStatement:
    - "function"
    - unexpected: "%error% foo() {}"

Or this one:

JsProgram:
  - unexpected: "function %error% foo() {}"

It that sense, the CST is a very clear representation of how our error recovery works, and it provides a lot of flexibility for us to improve.

But we throw all that out as soon as we're using the AST.

In our current design we have a limited set of "Bogus" (Unknown) nodes where we can say an error exists. But it doesn't provide us the flexibility to recover inside of the node. It also forces us to pretend an error is more wide-spread than it really should be.

JsProgram:
  - BogusJsStatement

But what if we made these bogus nodes more CST-like?

BogusJsStatement {
  kind: JsFunctionStatement,
  children: Vec<Node | Token | Error>,
}

This provides us with all of the flexibility of the CST:

// Give up immediately:
BogusJsStatement {
  kind: JsFunctionStatement,
  children: [
    JsFunctionKeyword("function"),
    Unexpected("%error% foo() {}"),
  ],
}

// Don't assume the foo is a JsIdentifierBinding:
BogusJsStatement {
  kind: JsFunctionStatement,
  children: [
    JsFunctionKeyword("function"),
    Unexpected("%error% foo"),
    JsParameterList(paren_open: "(", list: List(), paren_close: ")"),
    JsFunctionBody(brace_open: "{", brace_close: "}")
  ],
}

// Assume the foo is a JsIdentifierBinding:
BogusJsStatement {
  kind: JsFunctionStatement,
  children: [
    JsFunctionKeyword("function"),
    JsUnexpected("%error%"),
    IdentifierBinding(text: "foo"),
    JsParameterList(paren_open: "(", list: List(), paren_close: ")"),
    JsFunctionBody(brace_open: "{", brace_close: "}")
  ],
}

AST navigation

Let's take this AST:

JsProgram {
  items: [
    BogusJsStatement {
      kind: JsFunctionStatement,
      children: [
        JsFunctionKeyword("function"),
        JsUnexpected("%error%"),
        IdentifierBinding(text: "foo"),
        JsParameterList(paren_open: "(", list: List(), paren_close: ")"),
        JsFunctionBody(brace_open: "{", brace_close: "}")
      ],
    }
  ]
}

There are a number of different operations an analyzer could run on this:

find_a_node::<JsScript>() // Some(JsScript)
find_a_node::<JsScript>()?.items() // Iterator(AnyJsStatement)
find_a_node::<JsScript>()?.items().get(0) // Some(AnyJsStatement)
find_a_node::<JsScript>()?.items().get(0)?.cast::<JsFunctionStatement>() // Err(BogusJsStatement)
find_a_node::<JsScript>()?.items().get(0)?.cast::<BogusJsStatement>() // Ok(BogusJsStatement)
find_a_node::<JsFunctionStatement>() // None
find_a_node::<BogusJsStatement>() // Some(BogusJsStatement)
find_a_node::<JsFunctionBody>() // Some(JsFunctionBody)

To summarize:

While I think you should be able to access the BogusJsStatement, I don't think it should give you access to its children.

My biggest concern is that analyzers, particularly third-party ones, would eventually end up depending on indexes of the BogusJsStatement and it would make every change to our error recovery a breaking change.

Luckily, Rust allows us to make it so that those indexes are not a public API:

impl BogusJsStatement {
  pub(crate) fn children();
}

However, we should consider what types we return for child nodes trying to reach back up the tree into the bogus node.

find_a_node::<JsFunctionBody>().parent() // ???

FunctionBody::parent() could return any of:

I'm not sure which is the best choice, but the last two of those would require us to codegen .parent() methods everywhere.

MichaReiser commented 2 years ago

Luckily, Rust allows us to make it so that those indexes are not a public API:

This is true but I don't think it adds much value because authors can always fall back to bogus.syntax().children() and then continue to depend on offsets.

The question would then also be why we even need the function in the first place because it would then only be visible to the parser crate (or the future, an AST crate) but the parser has no interest in ever iterating over the AST. I assume what you wanted is to make this method "public to rome only". I don't know of a way to do that in Rust that has no "friendly assemblies" that are allowed to poke into internals.

FunctionBody::parent() could return any of:

The AST currently doesn't expose a parent() method. The only way to get a node's parent is to use body.syntax().parent() and it's then to you to cast it to the appropriate type (it returns a SyntaxNode<JsLangauge>). I don't see a reason for not adding it though.

Result<AnyJsNode, AnyBogusJsNode>

AnyBogusJsNode should be part of the AnyJsNode. These are possible JsNodes. It also breaks with the rest of our API where we carefully removed the need for Result to express that a node can either be a valid or a bogus node.

I'm not sure which is the best choice, but the last two of those would require us to codegen .parent() methods everywhere.

or write them by hand... The main challenge I see is that every union needs its dedicated name. What would you call AnyJsFunction | AnyBogusJsNode and how to defer the name inside of the codegen? What about syntaxes that have many possible different parents:

class_declaration.parent() -> ExportDeclarationClause | BlockStatement | FunctionBody 

We would probably need to have a JsAnyClassDeclarationParent but that will lead to many additional unions.

ematipico commented 2 years ago

For @jamiebuilds

My first question is: what's the different between an Unknown node and a Bogus node? Unless they are the same, then I didn't understand the whole proposal.

While I think you should be able to access the BogusJsStatement, I don't think it should give you access to its children.

How would that work with linters? If we decorate the whole statement - like in this example - in a bogus, analyzers won't be able see anything. Unless we will have analyzers with different access rights.

For @MichaReiser

or write them by hand...

I believe .parent() will end up to be a combination of manual code and generated code. A child that doesn't have a union as a parent, can be predicted without issues. For unions, probably we should provide a cast method, similar to what we do when we want to get a certain child.

MichaReiser commented 2 years ago

I believe .parent() will end up to be a combination of manual code and generated code. A child that doesn't have a union as a parent, can be predicted without issues. For unions, probably we should provide a cast method, similar to what we do when we want to get a certain child.

My main worry is about the number of types this will introduce. We also need to consider what parents a node can have if a AST is invalid. For example, a FunctionBody can be a child of an UnknownStatement if the function uses typescript syntax (or is invalid for other reasons).