RFC: codegen AST related codes

Boshen commented 4 months ago

Requirements

no build.rs published to the user
all generated code are checked into git
no nightly
Rust code is source of truth, need to read parse #[visited_node]

Workflow

a user changes the oxc_crate
a watch change picks it up
parse all #[visited_node]
saves them into a model, e.g. struct ASTModel { name: String, attributes: Vec<Attributes> }
generate code and save file

Code to generate

ASTKind (Milestone 1)
GetSpan (Milestone 2)
Hash, PartialEq implementations (Milestone 3)
ASTBuilder (Milestone 4)
Visit (Milestone 5)
VisitMut (Milestone 5)
Traverse, remove current js code (Milestone 6)
~typescript definition for napi and wasm~ moving the scope to AST transfer.

Every milestone should be a stack of PRs, since they will involve changing other files as well.

Things we need to discuss:

Rust or JavaScript based code generation? traverse currently uses JavaScript, I think it's fine, but does it scale to supporting these many code patterns?

update: The consensus is stay with our current approach and parse #[visited_node] as the source of truth.

Topic not part of this RFC:

Change or reduce AST visit patterns

Boshen commented 4 months ago

@overlookmotel @rzvxa @Dunqing

I think we can start prototyping once we decide on the generation method.

I vote for all Rust, but I'm not sure about its implications or downsides.

rzvxa commented 4 months ago

👍for all Rust, Now that we know our requirements and are done with prototyping we could leverage syn for our code generation.

It is harder to iterate on but we won't need to parse and generate rust code ourselves, We can treat it like a global macro if you are willing to use nightly for codegen, We quite literally can use !#[code_gen] at the top of the AST definition file, macro expand it for the build and call it a day.

If we want to break our ast definition into separate files or don't use nightly at all; We can manually parse the string using syn and vice versa.

Nice to have: we might want to generate the AST itself from our definition file, So it is easy to do AST-wide refactors in the future.

Boshen commented 4 months ago

Good catch, we should move all ast implementations to another file, so we can parse ast nodes easily.

Or maybe we should scan the file using syn, and extract everything marked as #[visited_node].

On nightly: hard no, sorry.

rzvxa commented 4 months ago

Good catch, we should move all ast implementations to another file, so we can parse ast nodes easily.

Or maybe we should scan the file using syn, and extract everything marked as #[visited_node].

I'm more in favor of having the definition file(s) for ast containing only the struct/enum and its attribute/derives, With this we are not only simplifying the build process(since any item in the global scope of definition file is an AST type) but also provide a way to add to the codegen, For example, if you remember for ast_node_id I wanted to add an ast_node attribute that would add the id and implement its trait, Which slows down the build process. But if we can easily add such things as a part of the code gen process we get the change fast and free (in terms of build times). But if we only process the types and use it to generate the visitors/traversals we won't be able to change ast types themselves. On the other hand, if we read these structs and modify them before generating the actual AST types we get to do magical things such as reducing all inherit macros in the generated code, It would be a much better experience if the actual type is there, Both in terms of LSP and documentation, Not to mention easier to understand if the resulting code is there instead of an inherit.

Boshen commented 4 months ago

I wanted to follow how React Compiler does this

but @overlookmotel prefers writing it in Rust because it feels more natural.

I don't mind either.

Also good catch on wanting to add common attributes to the nodes.

rzvxa commented 4 months ago

I think in the end both the rust syntax and these JSON files are the same thing. Having it as rust syntax and parsing it to get our serialized data is more steps for sure but I have to agree with @overlookmotel, It is much easier to understand and reason about our schema this way, Unless we want to share these schemas with other projects I think having ast_def.rs is > than ast_def.json.

overlookmotel commented 4 months ago

Rust or JS?

I vote Rust too. I only wrote Traverse codegen in JS to be able to bang it out quickly. It's a hack, and brittle, as it doesn't parse the Rust code properly - it's basically using regexes - so it could easily break. My intent was always to "do it properly" and rewrite in Rust using syn etc, once the impl was stable.

Design

I propose we do this in 2 parts:

A build script to build a schema of AST types.
Other scripts to generate code from that schema.

The schema would be generated by:

Parsing all relevant .rs files (NB: includes files in oxc_span and oxc_syntax, not just oxc_ast).
Generating a StructDefinition / EnumDefinition for each type.
Linking up the types (i.e. resolve Bar in struct Foo { bar: Bar } to enum Bar defined elsewhere).

Schema types could be something like https://github.com/overlookmotel/layout_inspect/blob/master/layout_inspect/src/defs.rs (we don't need align and size at present - we can miss those parts out)

This schema can then be output as both:

JSON
Rust types data serialized with rkyv which can be loaded into memory as a static (see https://github.com/oxc-project/oxc-browserslist/issues/23)

Why?

2 reasons:

Firstly, AST transfer needs a schema just like this - so we can kill 2 birds with 1 stone.

Secondly, the schema can also be used in proc macros. Fixing oxc-project/oxc#4297 requires replacing the dodgy inherit_variants! macro with a proc macro. Problem is that proc macros only have access to the section of code they're applied to, but inherited enums need to know what variants other enums have. The #[inherit_variants] proc macro can get that info from the schema.

(see "Option 3" in oxc-project/oxc#4297)

We don't need to implement rkyv serialization initially. We can just use JSON to serialize the schema to start with. rkyv is an optimization to avoid having serde and serde_json as dependencies of the macro crate, which will massively reduce the compile time impact of those macros.

Generating AST types

Personally I am anti generating the AST types themselves. I think we should write the AST types in Rust (as at present) and then generate everything around them.

I laid out my reasoning for this in oxc-project/oxc#4297 (at length, sorry probably way too much length).

When we want to alter the AST types themselves, we can use proc macros. This does have a downside in terms of compile time, but the rkyv-serialized schema thing is designed to reduce that impact to a minimal level.

These are just my thoughts... feel free to disagree!

rzvxa commented 4 months ago

Generating AST types

Personally I am anti generating the AST types themselves. I think we should write the AST types in Rust (as at present) and then generate everything around them.

I laid out my reasoning for this in oxc-project/oxc#4297 (at length, sorry probably way too much length).

When we want to alter the AST types themselves, we can use proc macros. This does have a downside in terms of compile time, but the rkyv-serialized schema thing is designed to reduce that impact to a minimal level.

These are just my thoughts... feel free to disagree!

What if we have definitions (schema) written as rust struct/enums? Then if we even need to output a JSON schema we can create it from the syntax data instead of json information.

Look at this and tell me it isn't possible to write all of this in rust https://github.com/facebook/react/blob/main/compiler/crates/react_estree_codegen/src/ecmascript.json.

What are the differences between these 2 pieces of string other than deserialization method?

        "StringLiteral": {
            "fields": {
                "value": {
                    "type": "String",
                    "hermes_convert_with": "convert_string_value"
                }
            }
        },

struct StringLiteral {
    #[hermes_convert_with("convert_string_value")]
    value: String,
}

My point is that we can have our cake and eat it too, We can write definitions in the rust itself, yet still use this information and build upon it to generate the actual AST types.

rzvxa commented 4 months ago

Example usecase: codegen/ast_def.rs

#[magic_attr]
struct StringLiteral<'a> {
    #[hermes_convert_with("convert_string_value")]
    pub value: Atom<'a>,
}

ast/ast.rs

#[hermes_stuff]
struct StringLiteral<'a> {
    #[hermes_convert_with("convert_string_value")]
    pub value: Atom<'a>,
    magical_field: MagicType<'a>,
}

impl<'a, 'm> BlackMagic<'a, 'm>for StringLiteral<'a> {
    fn cast(&'m self) -> MagicMisile<'a, 'm> {
        self.magical_field.cast(MagicKind::Black).into()
    }
}

rzvxa commented 4 months ago

AST definition file should have its own prelude so we are confined to the use of known types in our ast definitions(happens in the proc macro/codegen). This way we can disallow the use statements or path qualifiers. With this imposed limitation on our ast definition we no longer have to resolve and link types, We just use a hashmap since they are all known beforehand. Ast uses other AST types, Box, Vec, Atom, str, and a handful of other types but there is no random type there(as it should be).

overlookmotel commented 4 months ago

What are the differences between these 2 pieces of string other than deserialization method?

Agreed. No difference. Except... then we have 2 .rs files which both look like they are the AST definitions file. Which do you edit? As I said under option "2c" on oxc-project/oxc#4297:

However, there's a major DX problem. IDEs' "jump to definition" would take you to the built folder, not src. I am sure people would end up editing the files in built and then be surprised when all the code they wrote is lost when the build script re-runs and overwrites these files.

One of our goals is for the codebase to be accessible to new contributors. The AST is a central pillar of Oxc, probably the first place you go when you start exploring the codebase. So I think it's particularly important that the AST "just makes sense". Entering a world of confusion on your very first step could be a massive turn-off.

To summarize: as I understand it, the major pluses and minuses are:

Generate AST from schema: Minimize compile time as work happens in build script, rather than in macros.
Write AST as Rust types and alter them with proc macros as required (as now): Less magic to understand.

So, my view is:

In this particular case, increased approachability of codebase is more important than compile time.
By avoiding using serde in macros, we can minimize the compile time impact.

But... I've said my piece now. It's a preference rather than a strongly held view - I can see both sides of the argument. If you both feel strongly that you prefer to codegen AST, please feel free to overrule me. I won't be upset!

Only thing I ask is please do try to go through my thinking in oxc-project/oxc#4297, and try to follow my reasoning. If you get my arguments, but don't agree with them, no problem. But please do consider them.

MagicKind::Black

🤣🤣🤣 Love it!

rzvxa commented 4 months ago

One of our goals is for the codebase to be accessible to new contributors. The AST is a central pillar of Oxc, probably the first place you go when you start exploring the codebase. So I think it's particularly important that the AST "just makes sense". Entering a world of confusion on your very first step could be a massive turn-off.

It is ironic that it is my exact point for going in the opposite direction😆 I feel since AST doesn't change often after we hit stable release, it wouldn't result in confusion it is almost impossible for a first-time contributor to have to change it if it is all up to spec, All of our traits and impls are in a separate file that isn't generated. What we are generating with the AST are additions that would happen to the structures such as inherits and adding fields or macros.

I think since these are the actual things existing on the type, jumping to the generated definition is self-documenting because of no magic code, basically, what you see is what you get.

so I can see oxc_ast with this structure(I think the generated directory communicates well enough that these files shouldn't be edited directly, And if it isn't enough we can always add a giant docs comment at the top of the file stating this fact):

oxc_ast/
|_src/
   |_______generated/
                  |_______ast.rs(or js.rs,ts.rs,...)
                  |_______also contains visit/VisitMut,Visit and traverse/Traverse, etc.
   |_______ast_impl.rs
   |_______lib.rs
   |_______....
oxc_ast_def(or maybe oxc_ast_gen)/
|_src/
   |______ast_def.rs(or js_def.rs,ts_def.rs,...)

Then we manually export these generated codes in our lib.rs.

overlookmotel commented 4 months ago

I think for you to make the call on this @Boshen!

Boshen commented 4 months ago

Haha, can we start somewhere small?

I propose:

Gather the AST nodes somehow (we don't need the attributes yet),

and layout the backbone of codegen for

ASTKind
GetSpan
Visit
VisitMut
Traverse
ASTBuilder
typescript definition for napi and wasm

once this is in place, we can experiment with different ast definition strategies and pick the one that fits my requirements.

If parsing #[visit_node] works for the time being, then we can stick with it and iterate later.

The pipeline is

Definition -> Model in Rust -> an easy mechanism for writing out boilerplates -> code.

The first milestone will just be the backbone of the [Model in Rust -> an easy mechanism for writing out boilerplates -> code] part.

overlookmotel commented 4 months ago

I'm going to go for a walk and mull this over. Give me an hour and I'll come back with some comments.

rzvxa commented 4 months ago

To back my point, This is a common pattern in our current AST definition:

inherit_variants! {
/// Expression
///
/// Inherits variants from [`MemberExpression`]. See [`ast` module docs] for explanation of inheritance.
///
/// [`ast` module docs]: `super`
#[visited_node]
#[repr(C, u8)]
#[derive(Debug, Hash)]
#[cfg_attr(feature = "serialize", derive(Serialize, Tsify))]
#[cfg_attr(feature = "serialize", serde(untagged))]
pub enum Expression<'a> {
    BooleanLiteral(Box<'a, BooleanLiteral>) = 0,
    NullLiteral(Box<'a, NullLiteral>) = 1,
    NumericLiteral(Box<'a, NumericLiteral<'a>>) = 2,
    BigintLiteral(Box<'a, BigIntLiteral<'a>>) = 3,
    RegExpLiteral(Box<'a, RegExpLiteral<'a>>) = 4,
    StringLiteral(Box<'a, StringLiteral<'a>>) = 5,
    TemplateLiteral(Box<'a, TemplateLiteral<'a>>) = 6,

    Identifier(Box<'a, IdentifierReference<'a>>) = 7,

    MetaProperty(Box<'a, MetaProperty<'a>>) = 8,
    Super(Box<'a, Super>) = 9,

    ArrayExpression(Box<'a, ArrayExpression<'a>>) = 10,
    ArrowFunctionExpression(Box<'a, ArrowFunctionExpression<'a>>) = 11,
    AssignmentExpression(Box<'a, AssignmentExpression<'a>>) = 12,
    AwaitExpression(Box<'a, AwaitExpression<'a>>) = 13,
    BinaryExpression(Box<'a, BinaryExpression<'a>>) = 14,
    CallExpression(Box<'a, CallExpression<'a>>) = 15,
    ChainExpression(Box<'a, ChainExpression<'a>>) = 16,
    ClassExpression(Box<'a, Class<'a>>) = 17,
    ConditionalExpression(Box<'a, ConditionalExpression<'a>>) = 18,
    FunctionExpression(Box<'a, Function<'a>>) = 19,
    ImportExpression(Box<'a, ImportExpression<'a>>) = 20,
    LogicalExpression(Box<'a, LogicalExpression<'a>>) = 21,
    NewExpression(Box<'a, NewExpression<'a>>) = 22,
    ObjectExpression(Box<'a, ObjectExpression<'a>>) = 23,
    ParenthesizedExpression(Box<'a, ParenthesizedExpression<'a>>) = 24,
    SequenceExpression(Box<'a, SequenceExpression<'a>>) = 25,
    TaggedTemplateExpression(Box<'a, TaggedTemplateExpression<'a>>) = 26,
    ThisExpression(Box<'a, ThisExpression>) = 27,
    UnaryExpression(Box<'a, UnaryExpression<'a>>) = 28,
    UpdateExpression(Box<'a, UpdateExpression<'a>>) = 29,
    YieldExpression(Box<'a, YieldExpression<'a>>) = 30,
    PrivateInExpression(Box<'a, PrivateInExpression<'a>>) = 31,

    JSXElement(Box<'a, JSXElement<'a>>) = 32,
    JSXFragment(Box<'a, JSXFragment<'a>>) = 33,

    TSAsExpression(Box<'a, TSAsExpression<'a>>) = 34,
    TSSatisfiesExpression(Box<'a, TSSatisfiesExpression<'a>>) = 35,
    TSTypeAssertion(Box<'a, TSTypeAssertion<'a>>) = 36,
    TSNonNullExpression(Box<'a, TSNonNullExpression<'a>>) = 37,
    TSInstantiationExpression(Box<'a, TSInstantiationExpression<'a>>) = 38,

    // `MemberExpression` variants added here by `inherit_variants!` macro
    @inherit MemberExpression
}
}

It is only this way because it isn't feasible to copy all of these manually and keep track of them when one changes.

So with code-gening the AST we still have this syntax for our definitions to keep it consistent,

/// Expression
///
/// Inherits variants from [`MemberExpression`]. See [`ast` module docs] for explanation of inheritance.
///
/// [`ast` module docs]: `super`
#[visited_node]
#[repr(C, u8)]
#[derive(Debug, Hash)]
#[cfg_attr(feature = "serialize", derive(Serialize, Tsify))]
#[cfg_attr(feature = "serialize", serde(untagged))]
pub enum Expression<'a> {
    BooleanLiteral(Box<'a, BooleanLiteral>) = 0,
    NullLiteral(Box<'a, NullLiteral>) = 1,
    NumericLiteral(Box<'a, NumericLiteral<'a>>) = 2,
    BigintLiteral(Box<'a, BigIntLiteral<'a>>) = 3,
    RegExpLiteral(Box<'a, RegExpLiteral<'a>>) = 4,
    StringLiteral(Box<'a, StringLiteral<'a>>) = 5,
    TemplateLiteral(Box<'a, TemplateLiteral<'a>>) = 6,

    Identifier(Box<'a, IdentifierReference<'a>>) = 7,

    MetaProperty(Box<'a, MetaProperty<'a>>) = 8,
    Super(Box<'a, Super>) = 9,

    ArrayExpression(Box<'a, ArrayExpression<'a>>) = 10,
    ArrowFunctionExpression(Box<'a, ArrowFunctionExpression<'a>>) = 11,
    AssignmentExpression(Box<'a, AssignmentExpression<'a>>) = 12,
    AwaitExpression(Box<'a, AwaitExpression<'a>>) = 13,
    BinaryExpression(Box<'a, BinaryExpression<'a>>) = 14,
    CallExpression(Box<'a, CallExpression<'a>>) = 15,
    ChainExpression(Box<'a, ChainExpression<'a>>) = 16,
    ClassExpression(Box<'a, Class<'a>>) = 17,
    ConditionalExpression(Box<'a, ConditionalExpression<'a>>) = 18,
    FunctionExpression(Box<'a, Function<'a>>) = 19,
    ImportExpression(Box<'a, ImportExpression<'a>>) = 20,
    LogicalExpression(Box<'a, LogicalExpression<'a>>) = 21,
    NewExpression(Box<'a, NewExpression<'a>>) = 22,
    ObjectExpression(Box<'a, ObjectExpression<'a>>) = 23,
    ParenthesizedExpression(Box<'a, ParenthesizedExpression<'a>>) = 24,
    SequenceExpression(Box<'a, SequenceExpression<'a>>) = 25,
    TaggedTemplateExpression(Box<'a, TaggedTemplateExpression<'a>>) = 26,
    ThisExpression(Box<'a, ThisExpression>) = 27,
    UnaryExpression(Box<'a, UnaryExpression<'a>>) = 28,
    UpdateExpression(Box<'a, UpdateExpression<'a>>) = 29,
    YieldExpression(Box<'a, YieldExpression<'a>>) = 30,
    PrivateInExpression(Box<'a, PrivateInExpression<'a>>) = 31,

    JSXElement(Box<'a, JSXElement<'a>>) = 32,
    JSXFragment(Box<'a, JSXFragment<'a>>) = 33,

    TSAsExpression(Box<'a, TSAsExpression<'a>>) = 34,
    TSSatisfiesExpression(Box<'a, TSSatisfiesExpression<'a>>) = 35,
    TSTypeAssertion(Box<'a, TSTypeAssertion<'a>>) = 36,
    TSNonNullExpression(Box<'a, TSNonNullExpression<'a>>) = 37,
    TSInstantiationExpression(Box<'a, TSInstantiationExpression<'a>>) = 38,

    @inherit MemberExpression
}

But we give the end user this well-documented code to read:

/// Expression
///
/// Inherits variants from [`MemberExpression`]. See [`ast` module docs] for explanation of inheritance.
///
/// [`ast` module docs]: `super`
#[visited_node]
#[repr(C, u8)]
#[derive(Debug, Hash)]
#[cfg_attr(feature = "serialize", derive(Serialize, Tsify))]
#[cfg_attr(feature = "serialize", serde(untagged))]
pub enum Expression<'a> {
    BooleanLiteral(Box<'a, BooleanLiteral>) = 0,
    NullLiteral(Box<'a, NullLiteral>) = 1,
    NumericLiteral(Box<'a, NumericLiteral<'a>>) = 2,
    BigintLiteral(Box<'a, BigIntLiteral<'a>>) = 3,
    RegExpLiteral(Box<'a, RegExpLiteral<'a>>) = 4,
    StringLiteral(Box<'a, StringLiteral<'a>>) = 5,
    TemplateLiteral(Box<'a, TemplateLiteral<'a>>) = 6,

    Identifier(Box<'a, IdentifierReference<'a>>) = 7,

    MetaProperty(Box<'a, MetaProperty<'a>>) = 8,
    Super(Box<'a, Super>) = 9,

    ArrayExpression(Box<'a, ArrayExpression<'a>>) = 10,
    ArrowFunctionExpression(Box<'a, ArrowFunctionExpression<'a>>) = 11,
    AssignmentExpression(Box<'a, AssignmentExpression<'a>>) = 12,
    AwaitExpression(Box<'a, AwaitExpression<'a>>) = 13,
    BinaryExpression(Box<'a, BinaryExpression<'a>>) = 14,
    CallExpression(Box<'a, CallExpression<'a>>) = 15,
    ChainExpression(Box<'a, ChainExpression<'a>>) = 16,
    ClassExpression(Box<'a, Class<'a>>) = 17,
    ConditionalExpression(Box<'a, ConditionalExpression<'a>>) = 18,
    FunctionExpression(Box<'a, Function<'a>>) = 19,
    ImportExpression(Box<'a, ImportExpression<'a>>) = 20,
    LogicalExpression(Box<'a, LogicalExpression<'a>>) = 21,
    NewExpression(Box<'a, NewExpression<'a>>) = 22,
    ObjectExpression(Box<'a, ObjectExpression<'a>>) = 23,
    ParenthesizedExpression(Box<'a, ParenthesizedExpression<'a>>) = 24,
    SequenceExpression(Box<'a, SequenceExpression<'a>>) = 25,
    TaggedTemplateExpression(Box<'a, TaggedTemplateExpression<'a>>) = 26,
    ThisExpression(Box<'a, ThisExpression>) = 27,
    UnaryExpression(Box<'a, UnaryExpression<'a>>) = 28,
    UpdateExpression(Box<'a, UpdateExpression<'a>>) = 29,
    YieldExpression(Box<'a, YieldExpression<'a>>) = 30,
    PrivateInExpression(Box<'a, PrivateInExpression<'a>>) = 31,

    JSXElement(Box<'a, JSXElement<'a>>) = 32,
    JSXFragment(Box<'a, JSXFragment<'a>>) = 33,

    TSAsExpression(Box<'a, TSAsExpression<'a>>) = 34,
    TSSatisfiesExpression(Box<'a, TSSatisfiesExpression<'a>>) = 35,
    TSTypeAssertion(Box<'a, TSTypeAssertion<'a>>) = 36,
    TSNonNullExpression(Box<'a, TSNonNullExpression<'a>>) = 37,
    TSInstantiationExpression(Box<'a, TSInstantiationExpression<'a>>) = 38,

    // begin inherit from `MemberExpression` (if you know what inherit means in this context good for you, Otherwise you still get all the information without needing to jump 2 or more times first to the inherit_variants and then to the MemberExpression to understand it.)

    /// `MemberExpression[?Yield, ?Await] [ Expression[+In, ?Yield, ?Await] ]`
    ComputedMemberExpression(Box<'a, ComputedMemberExpression<'a>>) = 48,
    /// `MemberExpression[?Yield, ?Await] . IdentifierName`
    StaticMemberExpression(Box<'a, StaticMemberExpression<'a>>) = 49,
    /// `MemberExpression[?Yield, ?Await] . PrivateIdentifier`
    PrivateFieldExpression(Box<'a, PrivateFieldExpression<'a>>) = 50,

    // end inherit from `MemberExpression`
}

We would also generate the relevant match macros and those are also there to read(without nested declarative macro calls which can confuse newcomers) and maybe even omit the discriminates(in the definition file) since now we can calculate them at comptime (but having them explicitly defined is better in my opinion).

I can't help but find a generated file more understandable than how @inherit and inherit_variants works.

We can call our definition file ast.d.rs to further communicate that it isn't a normal rust file(and there are things like @inherit available).

If we don't want to have magical @inherit we can use this syntax in our definition:


#[inheritance]
enum E {
   A(A),
   #[inherit]
   B(B),
}

overlookmotel commented 4 months ago

Not going to go into the detail of your comment, Rez, because this thread is getting too long. But I agree inherit_variants! is a mess as it is now. There are various ways to fix it, and one of those ways is what you've suggested.

More generally, I've had a think, and have a few comments:

To gen or not to gen?

I have loosened my opposition to generating the AST types. I think what I was really opposed to is defining the schema as JSON (like Hermes/React Compiler does) which is verbose, ugly, and hard to read. Defining the schema as Rust types is much better. If we do the following, I think it would answer all my criticisms:

Name the "schema" Rust files as .def.rs or something to indicate they are "special".
Name all generated files .gen.rs, or put them in a generated directory.
Link the "schema" Rust files into a crate somehow so Rust Analyser gives you "jump to definition" and all that good stuff (jump to definition within the schema, not from outside).
In codegen-ed files, precede every type with a comment // This file is generated. Do not edit. Instead edit source at ../codegen/js.def.rs:321.
Split up codegen-ed AST types into multiple files, not one massive long one.

(not claiming all these ideas as my own, just summarizing in one place)

How to begin

As Boshen says, we don't need to jump to doing this straight away. We can initially write our codegen for e.g. Visit by reading the existing .rs files in oxc_ast, like Traverse's codegen does.

If we're agreed that in the end our "schema" will be written as Rust types, making the change later to "schema files" should not involve a huge change to the codegen.

Schema

AST transfer needs a schema of all the AST types as JSON. So could we please build our codegen foundation along these lines:

Parse definition Rust files.
Generate schema (like layout_inspect's definition types).
Generate output Rust code from the schema.

If we do it like that, we can kill 2 birds with one stone, rather than having to build a 2nd parser/codegen for AST transfer later.

(I think this is the plan anyway, but just repeating to make sure)

`Visit` and `VisitMut`

We should factor in https://github.com/oxc-project/oxc/issues/3392

TypeScript definitions

TS definitions will be trickier to generate from Rust types schema because there isn't an exact correspondence between Rust types and JS types, due to the modifications we do with custom serde serializers to conform to ESTree format.

I suggest making generating TS defs a separate effort. I was intending to tackle that as part of AST transfer impl - probably easier because AST transfer deals in the JS types not the Rust types.

rzvxa commented 4 months ago

I am glad that you've backed off a little bit on your opposition, I believe all of your points are addressable(And I have a vague idea of how to achieve the rust-analyzer compatibility with preluding types and have a mock macro for inheritance to prevent syntax errors).

Since you've written the Traverse codegen, Would you like to also give this one a go?

I can start working on it next week so it is up to you to decide.

Just a note: Even if we don't want to have rust schema we might want to consider starting with extracting type definitions from impls in our ast files, Having separate files like js.rs and js_impl.rs can help to simplify the rust parsing part, We no longer have to parse a bunch of random items and check if they are AST type or not. I think private ast fields all can be pub(crate) as a compromise to get them separated into 2 files.

rzvxa commented 4 months ago

Gleam uses Cap'n proto for its codegen. https://github.com/gleam-lang/gleam/blob/main/compiler-core/schema.capnp

Boshen commented 4 months ago

Apparently we can't transfer an issue from a private repo to a public repo.

I'll chat with @rzvxa on discord and set the milestones once he finishes all cfg related code.

I'll create the issues in oxc when we have a consensus on what to build - the final version is in this issue's description.

Boshen commented 4 months ago

Question: what should be done with code surrounded by inherit_variants!? Feels like a blocker if we want #[visited_node] to be the source of truth.

overlookmotel commented 4 months ago

We could also codegen for all AST types:

Hash impls (which should ignore any Span, ScopeId, SymbolId, ReferenceId fields).
PartialEq impls (also ignoring above fields).
CloneIn impls (#50).

NB: CloneIn could also be implemented on AST types with #[derive(CloneIn)], but it may improve compile time if we codegen the impls, to avoid macro expansion at compile time.

overlookmotel commented 4 months ago

Apparently we can't transfer an issue from a private repo to a public repo.

Oh dear! It would be ideal for all of the conversation in this issue to be made public, so people can see not only what decisions were reached, but also the reasons for reaching them.

Personally, I think this is quite important. When I find code in Oxc which I don't understand the rationale for, I often look at git blame to find out in what PR that code was introduced, and then I read the surrounding issues. This helps me to understand why things have been designed the way they have.

As a workaround, can we temporarily make backlog repo public, transfer the issue to oxc repo, then quickly make backlog repo private again??

overlookmotel commented 4 months ago

Question: what should be done with code surrounded by inherit_variants!? Feels like a blocker if we want #[visited_node] to be the source of truth.

The codegen will need to parse and understand the @inherits Expression syntax. This is what Traverse's current codegen does.

We can use that understanding to codegen all the impls which inherit_variants! macro currently generates, so inherit_variants! becomes much slimmer than it is now.

overlookmotel commented 4 months ago

Sorry @rzvxa I didn't answer your question:

Since you've written the Traverse codegen, Would you like to also give this one a go?

I'd be happy to, but I only have a couple of days before I go away, and am keen to finish off correcting spans and scopes in the transformer before I do. So if you're willing to work on this, by all means go ahead.

rzvxa commented 4 months ago

ASTKind (Milestone 1)

I think this one should be done after Visit and VisitMut Since without those there are going to be a lot of errors that we have to mark as todo. So we have to do a bunch of manual refactoring just to redo it again after generating the new visitors.

Otherwise, it is the easiest to generate and has already been added to the ast_codegen.

Boshen commented 4 months ago

ASTKind (Milestone 1)

I think this one should be done after Visit and VisitMut Since without those there are going to be a lot of errors that we have to mark as todo. So we have to do a bunch of manual refactoring just to redo it again after generating the new visitors.

Otherwise, it is the easiest to generate and has already been added to the ast_codegen.

Oh ok, pick the easiest one to have it codegened so we can get it reviewed, merged, and replaced 😅

rzvxa commented 4 months ago

ASTKind (Milestone 1)

I think this one should be done after Visit and VisitMut Since without those there are going to be a lot of errors that we have to mark as todo. So we have to do a bunch of manual refactoring just to redo it again after generating the new visitors. Otherwise, it is the easiest to generate and has already been added to the ast_codegen.

Oh ok, pick the easiest one to have it codegened so we can get it reviewed, merged, and replaced 😅

I've already did, GetSpan is ready to replace the old one. https://github.com/oxc-project/oxc/pull/3852

Didn't mark it for review so you guys can give a green light to the underlying architecture first.

Edit: We are basically done with both `AstKind` and `GetSpan` generators, We just hold onto the `AstKind` till the visitor generators get ready.

Dunqing commented 4 months ago

These visit functions are outdated, do we have a better way to reimplement them here? It is now difficult to manually update https://github.com/oxc-project/oxc/blob/2fd55b0da22c6c8af9ec23dca976c2b47e5957bb/crates/oxc_semantic/src/builder.rs#L454-L1650

overlookmotel commented 4 months ago

These visit functions are outdated

Outdated as in field visitation order is wrong?

overlookmotel commented 4 months ago

I've moved this issue to main oxc repo as we're currently working on this.

Dunqing commented 4 months ago

These visit functions are outdated

Outdated as in field visitation order is wrong?

or missing visits to some fields

Boshen commented 4 months ago

These visit functions are outdated, do we have a better way to reimplement them here? It is now difficult to manually update

https://github.com/oxc-project/oxc/blob/2fd55b0da22c6c8af9ec23dca976c2b47e5957bb/crates/oxc_semantic/src/builder.rs#L454-L1650

Consensus was manually patch them, we eat our own 💩

rzvxa commented 4 months ago

It can be fine to have a slightly different visitation order for some implementations of our visits if it is necessary for them. The good thing about the generated code is that we have the correct walk order by default but it shouldn't prevent anyone from doing custom walks. For example, a specific implementation might require visiting a statement and then visiting the decorators for it.

It is fine as long as we acknowledge the difference and know it has a reason to be that way.

rzvxa commented 4 months ago

I'm mostly scared of scope events since they can drift apart from the source of truth and we use the source of truth version everywhere but the semantics.

Maybe we should generate the scope events as their own functions? something like scope::enter_statement(visitor) that uses the scope attribute information.

rzvxa commented 4 months ago

or implement scope visit as a trait for our types so we can write stmt.enter_scope(visitor).

Dunqing commented 4 months ago

We need a visit_scope_id method. This way we can easily set the current_scope_id for different ast's scope_id in the semantic builder

overlookmotel commented 4 months ago

Could we pass the AST node (&mut T) into enter_scope? That would avoid need for adding visit_scope_id method.

rzvxa commented 4 months ago

Could we pass the AST node (&mut T) into enter_scope? That would avoid need for adding visit_scope_id method.

I was thinking about this when I saw the @Dunqing's comment. But the type erasure will prevent us from doing anything useful to the T. Maybe we can implement a trait so this can be bound to something like Scoped<T>?

If we don't want to pass in T we can pass AstType + &mut node.scope_id

overlookmotel commented 4 months ago

we can pass AstType + &mut node.scope_id

Maybe that's better. I think just &node.scope_id would do it. That removes need for generics (and only & is required, not &mut, since it's a Cell).

rzvxa commented 4 months ago

That removes need for generics

I used AstType instead of AstKind for the same reason, but if we don't need it, then we shouldn't pass it in. I thought we may want to do some checking depending on the type we are entering but with scopeFlags, it should be enough information already.

and only & is required, not &mut, since it's a Cell

This one I just forgot, You are absolutely right.

Dunqing commented 4 months ago

pass scope_id to enter_scope just enough for now

rzvxa commented 4 months ago

pass scope_id to enter_scope just enough for now

It seems we all agree on this so I'll go ahead and implement it

overlookmotel commented 4 months ago

He done it! #4168

Boshen commented 4 months ago

RFC is done, remains issues can be tracked in https://github.com/oxc-project/oxc/issues/3819

oxc-project / oxc