oxc-project / backlog

backlog for collborators only
1 stars 0 forks source link

Visit GAT (Generic Associated Types) + Fold pattern #90

Open rzvxa opened 4 months ago

rzvxa commented 4 months ago

A bit of context before you read this: oxc-project/oxc#4112

I'm working on the AST template macro, My current approach is something like this:

a. use oxc_ast_codegen to generate a visitor with support for every visitable AST type(placed at oxc_ast_quote/src/generated/{here}) that would:

  1. accept an interop resolver or map
  2. visit each node
  3. visit inner nodes to get their TokenStream
  4. produce a TokenStream to create that node at runtime
  5. short circuit on literals and enums that are either unit or wrapping literals

b. proc-macro

  1. substitute and map the interpolation spans to their inner value(which would support identifiers for now but it can be extended to accept rust expressions as well) i. In the initial implementation we use raw string manipulation here ii. as our next step we can add support for QuoteInterpolation in our parser with an ast_quote_parser feature to prevent multiple iterations on the source code(but with caching, we might not need this).
  2. parse the macro input with oxc_parser
  3. check for errors and retrieve the valid program
  4. use the visitor generated in step a to create something like Statement(BlockStatement { ... })
  5. add support for generating tracing boilerplate to mimic what ast_builder would've done before as our single point of construction(e.g. counting allocations in debug builds)

As you might already see it feels like a good place to introduce the Fold pattern to our arsenal, So I'm re-proposing one of my old suggestions which is adding a generic associated type Result to both Visit and VisitMut this way we won't need a new visiting paradigm while being able to selectively return from our visitor implementations.

It goes like this:

pub trait VisitResult: Default {
    fn extend(&mut self, next: Self);
}

impl VisitResult for () {
    fn extend(&mut self, next: Self) {}
}

pub trait Visit<'a>: Sized {
    type Result: VisitResult;

    fn visit_program(&mut self, it: &Program<'a>) -> Self::Result {
        walk_program(self, it)
    }
    // ...
}

pub fn walk_program<...>(visitor: V, it: &Program<'a>) -> V::Result {
    let mut result = V::Result::default();
    let kind = AstKind::Program(visitor.alloc(it));
    visitor.enter_node(kind);
    visitor.enter_scope(
        {
            let mut flags = ScopeFlags::Top;
            if it.source_type.is_strict() || it.directives.iter().any(Directive::is_use_strict)
            {
                flags |= ScopeFlags::StrictMode;
            }
            flags
        },
        &it.scope_id,
    );
    result.extend(visitor.visit_directives(&it.directives));
    if let Some(hashbang) = &it.hashbang {
        result.extend(visitor.visit_hashbang(hashbang));
    }
    result.extend(visitor.visit_statements(&it.body));
    visitor.leave_scope();
    visitor.leave_node(kind);
    result
}

Our current implementations would only need to add this one line:

impl<'a> Visit<'a> for X {
    type Result = ();
    // ...
}

A similar implementation in rustc.

rzvxa commented 4 months ago

@oxc-project/core Would you guys care to pitch in? Both in terms of my general approach and this fold pattern. I'm still in really early stages and haven't even gotten to a working demo so it's just the right time to bikeshed over this.

rzvxa commented 4 months ago

The macro in question is going to have multiple variants and one configurable version which are something like this:

// ideally
let js_stmt = quote_js!(let x = 123).into_in(&self.alloc);
let ts_stmt = quote_ts!(let x: number = 123).into_in(&self.alloc);
let jsx_stmt = quote_jsx!(<div>{123}</div>).into_in(&self.alloc);
let tsx_stmt = quote_tsx!(<div>{123 as T}</div>).into_in(&self.alloc);

// worst case, it'd have all of the above plus
let js_expr = quote_js_expr!(123).into_in(&self.alloc);
let ts_expr = quote_ts_expr!(123 as T).into_in(&self.alloc);

// with configuration
let ast = quote_ast! {
    #![allocator(&self.alloc), source_type(js), as(Vec<'a, Statement<'a>>)]
    let x = 123;
    let y = 456;
    let z = x + y;
};

For now, our interpolation syntax is something like Ruby's which is an invalid js syntax, See this comment for more information.

let stmt = quote_js!(let #{name} = 123).into_in(&self.alloc);

As of right now return value of quote macros are lazy and need to call into_in with the allocator to actually create the nodes(or maybe .build(&self.alloc) to be more explicit.

overlookmotel commented 4 months ago

A lot here, I'm going to go through it one by one:

Parsing style

I suggest parsing everything as TSX, to avoid needing quote_js, quote_ts etc. Just quote_ast!.

In practice I can't imagine many (or any) cases where a snippet needs to be parsed specifically as JS/JSX/TS/TSX to be valid. In unlikely event that we do hit such a case, we could just write out the code explicitly (as we do now) and not use the macro. Simplify for the common case.

Macro name

How about ast!, snippet! or fragment!?

Syntax

I don't think ast!(...).into_in(self.allocator) will work. Where there's multiple levels of Box/Vecs required, the macro will need to insert alloc calls in its output.

I suggest:

trait Builder<'a> {
  fn allocator() -> &'a Allocator;
}

Implement this trait on everything which has access to allocator - e.g. AstBuilder, TransformCtx, TraverseCtx.

Then you can use any of these as first arg to macro for a less verbose style:

let stmt = ast!( self, let x = #{foo}; );
let stmt = ast!( ctx, let x = #{foo}; );
let stmt = ast!( ast, let x = #{foo}; );

Macro can access the allocator via Builder::allocator for use in Box::new_in, Vec::new_in, into_in etc.

Builder trait could be later extended with a method to generate AstNodeIds, if we end up adding them to every node.

Type coercion

syn's parse_quote! macro manages to determine output type via type coercion. I think ideal API (from macro user's perspective) would be if we do the same. So only 1 macro, and no "config" required. e.g. all these would be valid:

let assignment: AssignmentExpression = ast!( self, x = 123 );
let expr: Expression = ast!( self, x = 123 );
let stmt: Statement = ast!( self, x = 123 );
let stmts: Vec<Statement> = ast!( self, x = 123 );

Similarly for interpolation:

let expr: Expression = get_expression_somehow();
// `expr` gets converted to a `Statement`
let func_decl: Statement = ast!( self, function foo() { #{expr} } );

I don't know how parse_quote! achieves this, but I'm guessing something like .into_in(). Maybe it's impractical, or will increase compile times too much, but I think worth investigating.

Scopes

Ideally macro would also generate correct scope/symbol info, e.g. for things like this:

let stmt: Statement = ast!( self, function foo(x) { return x; } );
let name: BindingIdentifier = get_binding_id_somehow();
// NB: `name` in `return #{name}` gets converted to an `IdentifierReference`
let stmt = ast!( self, function foo(#{name}) { return #{name}; } );

Access to scope tree and symbols could be via Builder trait.

Not sure how it'd know what the parent scope is. Maybe macro should only generate internal scope tree, and linking the snippet into existing scope tree should be left to the user.

Probably the stuff with identifiers is too tricky for initial implementation. We could leave that to macro user initially and add more capability to the macro later.

Visitor + macro

This architecture makes sense to me. Nice!

Fold pattern

I'm not opposed to this. Presumably the generic will make Rust compiler generate much more code (multiple versions of Visit, walk_*), but that's probably worth it if it solves a problem.

However, I don't entirely see how extend would work for this use case. Are you sure that's enough?

If it does work, but we don't want to alter current Visit trait, we could alternatively create a new trait in which visit_* methods return Self::Result. Thanks to your codegen, generating a whole new trait is not such a burden.

Side note: Just to mention something I've probably not said out loud before: I am hoping we can condense Visit, VisitMut and Traverse into a one-size-fits-all visitation trait in future. I think Traverse could be extended to do everything that Visit and VisitMut do (and more), and be just as performant. That's just a vague thought at the moment though - may not come to fruition.

GAT

Sorry if I'm ignorant, but what is a GAT?

(side note: I think it's preferable to avoid abbreviations and unexplained compiler jargon in issues if possible - there are smart people out there who could be great contributors to Oxc, but may not be steeped in knowledge of compilers - let's try to be accessible)

Boshen commented 4 months ago

We need a meeting session so I can have a clear mental model of this, before we write down any code.

rzvxa commented 4 months ago

Parsing style

I suggest parsing everything as TSX, to avoid needing quote_js, quote_ts etc. Just quote_ast!.

In practice I can't imagine many (or any) cases where a snippet needs to be parsed specifically as JS/JSX/TS/TSX to be valid. In unlikely event that we do hit such a case, we could just write out the code explicitly (as we do now) and not use the macro. Simplify for the common case.

You are probably right on this one, We might not need to differentiate between them at all. All of these are different invocations of the ast macro with different configurations so at the end if you really need to parse with a specific configuration you can use that.

Macro name

How about ast!, snippet! or fragment!?

I think having the quote in the name is better than having generic names such as ast!, expr! or snippet!. We are quoting another type of code here and it feels natural to use because we already have quote (quote crate) and parse_quote (syn) in the rust ecosystem. Suppose someone has seen any proc-macro code before they'd understand what a quote_js or quote_ast would do(or at least have a guess). I just went with the convention established in that space and used by SWC as well.

Syntax

I don't think ast!(...).into_in(self.allocator) will work. Where there's multiple levels of Box/Vecs required, the macro will need to insert alloc calls in its output.

I planned to wrap the result as a lazy closure and have it be there lazily until you call into_in or possibly build on it. I'd like to avoid passing anything but the valid JS as our macro input. look at this example:

let a = quote!((ctx), (ctx, ));

It just confuses me because we don't want to wrap our code in strings, You can't easily see where the rust expr ends and js one starts.

If Instead of that we create a type that would be something like this:

struct LazyQuoteAst<'a, R, F: FnOnce(&'a Allocator) -> R> {
    fun: F
}

impl<...> LazyQuoteAst<...> {
    pub fn build(self, alloc: &'a Allocator) -> R {
        self.fun(alloc)
    }
}

We can output this in our macro:

LazyQuoteAst { fun: |alloc| {
    // stuff...
}}

And use it like so:

let lazy_a = quote!(let x = 123;);
let lazy_b = quote!(#{lazy_a} let y = 456;);
let stmts = lazy_b.build(some_random_allocator_with_no_need_for_traits);

Type coercion

syn's parse_quote! macro manages to determine output type via type coercion. I think ideal API (from macro user's perspective) would be if we do the same. So only 1 macro, and no "config" required.

This one is something I'm really fond of, It is a bit complicated because we need to rely on our return type to know the result and that should happen in runtime. syn gets away with this because they already parse in peace meals and can afford to have a dedicated Parse implementation for each resulting type, We would like to create the code regardless of the result.

With that we are going to experience runtime errors as opposed to comptimes that's why I think it is going to be hard to adopt this without sacrificing something(either perf or runtime safety).

I do a really nifty trick for distinguishing between when we return Statement vs. Vec<Statement>. I parse the input and see how many statements are there if one we return one otherwise we return a vector. We can force a single ExpressionStatement to be parsed as Expression but it creates other problems for us when we want to explicitly parse expressions as statements.

Scopes

I think this should be left for the user, It isn't part of a template system. And you can easily have your identifier references with something like this:

let name: BindingIdentifier = get_binding_id_somehow();
// NB: `name` in `return #{name}` gets converted to an `IdentifierReference`
let stmt = ast!( self, function foo(#{name}) { return #{get_binding_reference_somehow(name)}; } );

It has a much better macro hygiene as well. It won't do anything magical for you, It just allows you to write less boilerplate.

Fold pattern

I'm not opposed to this. Presumably the generic will make Rust compiler generate much more code (multiple versions of Visit, walk_*), but that's probably worth it if it solves a problem.

About the compile time costs, We only pay for it where we need it so anything with a unit result would be the same as before. But I think we shouldn't go too crazy with it, Right now the plan is to only use it in our comptime, It shouldn't be a common pattern in the project but it allows custom folds which can help in some sticky situations.

However, I don't entirely see how extend would work for this use case. Are you sure that's enough?

As I've said, I'm yet to get to a working demo but It seems promising, think of it as an accumulation on results instead of a traditional fold, If we ever need anything we can just keep it in our result type(for example it can contain ASTNodeId or the nodes themselves).

For this specific case I was thinking about the result as a TokenStream so extend would do just that, We want to add together a bunch of token streams and return them. There might be some edge cases that I haven't thought of yet, We'll find them as we move on with the implementation.

If it does work, but we don't want to alter current Visit trait, we could alternatively create a new trait in which visit_* methods return Self::Result. Thanks to your codegen, generating a whole new trait is not such a burden.

I didn't want to introduce yet another trait for it but a Fold trait might be a good choice here since we can generate it for our macro only and don't put it in the production. It won't have any effect on my process so it is up to Boshen I guess.

Side note: Just to mention something I've probably not said out loud before: I am hoping we can condense Visit, VisitMut and Traverse into a one-size-fits-all visitation trait in future. I think Traverse could be extended to do everything that Visit and VisitMut do (and more), and be just as performant. That's just a vague thought at the moment though - may not come to fruition.

Oh, we talked about this in one of our meetings I remember the discussion. If I'm not mistaken we talked about the possibility of using traverse underneath. But the visitor pattern is just too universal to get rid of. Every developer has used a visitor pattern before so it is easy to understand, Traverse on the other hand is unconventional and only there to solve a limitation of visitors specifically with rust borrow checker and aliasing.

That's why I think we should keep visitors as an easy means to visit nodes where you don't need all of the freedom of Traverse.

Every beginner programming guide has a section for visitors so people should be familiar with it. Just search top 20 design patterns and you'll see every single result would have a visitor section.

GAT

Sorry if I'm ignorant, but what is a GAT?

(side note: I think it's preferable to avoid abbreviations and unexplained compiler jargon in issues if possible - there are smart people out there who could be great contributors to Oxc, but may not be steeped in knowledge of compilers - let's try to be accessible)

Sir, you are absolutely right. Generic Associated Type is a relatively new term(stabilized in 2022?) And I should've used the full name here just to avoid confusion.

Let me know if I've missed anything.

rzvxa commented 4 months ago

We need a meeting session so I can have a clear mental model of this, before we write down any code.

Aye aye cap⚓My schedule was a bit weird this week(and sadly continues to be through the following week), I'll hit you up on Discord so we can figure out a time that would work for everyone.

overlookmotel commented 4 months ago
let lazy_a = quote!(let x = 123;);
let lazy_b = quote!(#{lazy_a} let y = 456;);
let stmts = lazy_b.build(some_random_allocator_with_no_need_for_traits);

Ah ha this is nice. We can also inject other info with this pattern e.g.:

Supporting nested lazies is quite a complication, maybe should skip that initially. It's not too bad for user to do this instead:

let not_lazy_a = quote!(let x = 123;).build(alloc);
let lazy_b = quote!(#{not_lazy_a} let y = 456;);
let stmts = lazy_b.build(alloc);

I think scopes should be left for the user

I disagree. Ideally macro would do it for them. From experience updating the transformer to set scopes/symbols/references correctly, it's a lot of boilerplate code, and quite error-prone.

Type coercion

Ah yes, we can't use type coercion as parsing happens at compile time.

But I do think it's important to be able to generate all kinds of AST nodes. Options:

let expr = quote_expression!( x = 1 );
let expr = quote!( Expression: x = 1 );

I prefer the latter, as otherwise there'll be a ton of quote_* macros, and I think it's good to have the output type as clear as possible in the code (IDE colours Expression, which makes it easier to read).

We can still use parser without altering it, by wrapping the code snippet in appropriate surrounding code before parsing, and then just pull out the part of the AST we want. e.g.:

let params = quote!( FormalParameters: {x = 1}, y, z ).build(alloc);

would wrap the snippet as (function( {x = 1}, y, z ) {}) before parsing, and return program.body[0].expression.expression.params. This is a little bit inefficient (but very little - parser is fast) and that cost is at compile time, not run time.

I'm not saying we have to do all this in first iteration, but just trying to define what our end goal is.

rzvxa commented 3 months ago

Ah ha this is nice. We can also inject other info with this pattern e.g.:

* `quote!(let x = 123;).span(span).build(alloc);`

* `quote!(let x = 123;).scope(scope_id).build(alloc);` (to support generating scope binding for `x`)

Wow! I like how you expanded on this builder to do the scope stuff. It also allows the end user to provide their custom logic via traits. For example, you may want a custom strategy for the allocation or scope management of a specific node(e.g. identifier references), You implement a custom trait for your specific node, And on paper, you should be golden.

trait LazyScope<R> {
    fn build_in_scope(self, semantic: &Semantic) -> R;
}

impl<'a, F> LazyScope<IdentifierReference<'a>> for LazyQuoteAst<'a, IdentifierReference<'a>, F>
    where F: FnMut(&'a Allocator) -> IdentifierReference<'a>
{
    fn build_in_scope(self, &'a alloc: &'a Allocator, semantic: &Semantic) -> IdentifierReference<'a> {
        let node = self.build(alloc);
        // stuff...
        node
    }
}

Supporting nested lazies is quite a complication, maybe should skip that initially. It's not too bad for user to do this instead:

let not_lazy_a = quote!(let x = 123;).build(alloc);
let lazy_b = quote!(#{not_lazy_a} let y = 456;);
let stmts = lazy_b.build(alloc);

Yes sir, We can leave it for later.

I disagree. Ideally macro would do it for them. From experience updating the transformer to set scopes/symbols/references correctly, it's a lot of boilerplate code, and quite error-prone.

I totally agree with your point, What I was trying to say with "...It isn't part of a template system..." was that it should happen through other means than a proc-macro, But you are making quite a strong case for having some simple infrastructure to help with it, And maybe adding it to the quote macros is the way to go. Let's have a chat about it to figure out our requirements for this. It has to wait for the next iterations but if it benefits us to have such a thing in our macro then we should do it.

But I do think it's important to be able to generate all kinds of AST nodes

Yes, we should find a way to do so, To be honest, it is quite a complicated process to be able to produce all of the ast kinds(especially with how interpolations should be substituted). I suggest for the start we only provide 2 macros, quote_stmt for statements and quote_expr for expressions. Later on, we can expand our infrastructure to support different kinds and then we can expose a more general macro called quote.

I think to support this we need an actual parser to parse interpolations which can happen in our parser with a feature flag or we can have 2 passes on the input source.