oxc-project / oxc

⚓ A collection of JavaScript tools written in Rust.
https://oxc.rs
MIT License
11.98k stars 432 forks source link

Implement `serde::Serialize` on AST types via `#[generate_derive]` #6347

Open overlookmotel opened 2 weeks ago

overlookmotel commented 2 weeks ago

We currently use serde's derive macros to implement Serialize on AST types.

We could use #[generate_derive] to generate these impls instead.

Why is that a good thing?

1. Reduce compile time

serde's macro is pretty expensive at compile time for the NAPI build. We can remove it.

2. Reduce boilerplate

serde's derive macro is less powerful than ast_tools. Because Serialize is a macro, all it knows about is the type that #[derive(Serialize)] is on. Whereas ast_tools builds a schema of the entire AST, so it knows not just about the type it's deriving impl for, but also all the other types too, and how they link to each other.

Currently we have to put #[serde] attributes everywhere:

#[ast]
#[cfg_attr(feature = "serialize", derive(Serialize, Tsify))]
#[serde(tag = "type")]
pub struct ClassBody<'a> {
    #[serde(flatten)]
    pub span: Span,
    pub body: Vec<'a, ClassElement<'a>>,
}

#[ast]
#[cfg_attr(feature = "serialize", derive(Serialize, Tsify))]
#[serde(tag = "type")]
pub struct PrivateIdentifier<'a> {
    #[serde(flatten)]
    pub span: Span,
    pub name: Atom<'a>,
}

#[ast]
#[cfg_attr(feature = "serialize", derive(Serialize, Tsify))]
pub struct Span {
    start: u32,
    end: u32,
}

Instead, we can use ast_tools in 2 ways to remove this boilerplate:

  1. Make things that we implement on every type the defaults, so they don't need to be stated over and over.
  2. Use ast_tools's knowledge of the whole AST to move the instruction to flatten Span onto Span type itself. "flatten this" instruction does not need to be repeated on every type that contains Span.
#[ast]
#[generate_derive(ESTree)]
pub struct ClassBody<'a> { // <-- no `#[serde(tag = "type")]` attr
    pub span: Span, // <-- no `#[serde(flatten)]` attr
    pub body: Vec<'a, ClassElement<'a>>,
}

#[ast]
#[generate_derive(ESTree)]
pub struct PrivateIdentifier<'a> { // <-- no `#[serde(tag = "type")]` attr
    pub span: Span, // <-- no `#[serde(flatten)]` attr
    pub name: Atom<'a>,
}

#[ast]
#[generate_derive(ESTree)]
#[estree(flatten)] // <-- `flatten` is here now
pub struct Span {
    start: u32,
    end: u32,
}

I think this is an improvement. How types are serialized is not core to the function of the AST. I don't see moving the serialization logic elsewhere as "hiding it away", but rather a nice separation of concerns.

3. Open the door to different serializations

In example above Serialize has been replaced by ESTree. This is to allow for different serialization methods in future. For example:

Different serializers for plain JS AST and TS AST

When serializing a plain JS file, could produce JSON which skips all the TS fields, to make an AST which exactly aligns with canonical ESTree. We'd add #[ts] attribute to all TS-related fields, and ESTreeJS serializer would skip those fields. This would make the AST faster to deserialize on JS side.

The other advantage is the TS-less AST should perfectly match classic ESTree, so we can test it in full using Acorn's test suite.

Users who are not interested in type info can also request the cheaper JS-only AST, even when parsing TS code.

Serialize to other AST variants

e.g. #[generate_derive(Babel)] to serialize to a Babel-compatible JSON AST.

const {program} = parse(code, {flavor: 'babel'});

Not sure if this is useful, but this change makes it a possibility if we want to.

4. Simplify implementation of custom serialization

Currently we have pretty complex custom Serialize impls for massaging Oxc's AST into ESTree-compatible shape in oxc_ast/src/serialize.rs.

We can remove most of them if we use ast_tools to generate Serialize impls for us, guiding it with attributes on the AST types themselves:

#[ast]
#[generate_derive(ESTree)]
pub struct ObjectPattern<'a> {
    pub span: Span,
    pub properties: Vec<'a, BindingProperty<'a>>,
    #[estree(append_to_previous)]
    pub rest: Option<Box<'a, BindingRestElement<'a>>>,
}

5. Simply AST transfer code

AST transfer's JS-side deserializer (and eventually serializer too) can be simplified in same way, generating code for JS-side deserializer which matches the Rust-side one exactly, without writing the same logic twice and having to keep them in sync.

6. TS type generation

What "massaging" of the Rust AST we do to turn it into an ESTree-compatible JSON AST is now encoded as static attributes. We can use this to generate TS types, and we can get rid of Tsify.

How difficult is this?

serde's derive macro looks forbiddingly complex. But this is because it handles every conceivable case, almost all of which we don't use. The output it generates for our AST types is actually not so complicated.

So creating a codegen for impl Serialize I don't think would be too difficult.

overlookmotel commented 2 weeks ago

JS-only AST (as discussed in point 3 above) has been requested by a user: https://github.com/oxc-project/oxc/issues/6284

Personally, I think it's a completely reasonable ask.

overlookmotel commented 2 weeks ago

@Boshen we have a contributor (@ottomated - see #6284) keen to work on this. Before he gets going, do you see any problem with my proposal above?

overlookmotel commented 2 weeks ago

I spoke to Boshen. He's happy with the direction of this PR. Sounds like @ottomated is ready to get stuck in to implementation.

I suggest doing this in phases:

1. Generate Serialize impls with oxc_ast_tools

Replace:

#[ast]
#[cfg_attr(feature = "serialize", derive(Serialize))]
#[serde(tag = "type", rename = "RestElement")]
pub struct AssignmentTargetRest<'a> {
    #[serde(flatten)]
    pub span: Span,
    #[serde(rename = "argument")]
    pub target: AssignmentTarget<'a>,
}

with:

#[ast]
#[generate_derive(Serialize)]
#[serde(tag = "type", rename = "RestElement")]
struct AssignmentTargetRest {
    #[serde(flatten)]
    pub span: Span,
    #[serde(rename = "argument")]
    pub target: AssignmentTarget<'a>,
}

2. Remove #[serde] attrs boilerplate

#[ast]
#[generate_derive(Serialize)]
#[serde(rename = "RestElement")] // <-- `tag = "type"` removed
struct AssignmentTargetRest {
    // `#[serde(flatten)]` removed - `#[serde(flatten)]` on `Span` struct instead
    pub span: Span,
    #[serde(rename = "argument")]
    pub target: AssignmentTarget<'a>,
}

Handle these in oxc_ast_tools codegen instead.

3. Replace Tsify

4. Rename trait to ESTree

#[ast]
#[generate_derive(ESTree)]
#[estree(rename = "RestElement")]
struct AssignmentTargetRest {
    pub span: Span,
    #[estree(rename = "argument")]
    pub target: AssignmentTarget<'a>,
}

I'm actually not quite sure how to do this, while still using serde::Serialize under the hood.

5. Remove the custom Serialize impls

#[ast]
#[generate_derive(ESTree)]
pub struct ObjectPattern<'a> {
    pub span: Span,
    pub properties: Vec<'a, BindingProperty<'a>>,
    #[estree(append_to_previous)]
    pub rest: Option<Box<'a, BindingRestElement<'a>>>,
}

This is the tricky/interesting part. The idea is create a kind of domain-specific language (DSL) to cover the various transformations needed to go from Rust AST to JS ESTree AST. That DSL is the #[estree(...)] attributes.

The advantage of a DSL which is static is that we can generate multiple things from it:

  1. Serialize impls.
  2. Deserialize impls (so we can provide an oxc-codegen NPM package).
  3. TS type defs.
  4. "raw" transfer serializer/deserializer.

I'm not completely sure how far we can get with the DSL approach. "append to previous" is a pattern that's used in several types, so it makes sense to make an #[estree(append_to_previous)] attr for it.

But for odd transforms which are only used in one place, we may prefer something like this:

#[ast]
#[generate_derive(ESTree)]
#[estree(via(MyTypeShim))]
pub struct MyType {
    one: u32,
    two: u32,
}

struct MyTypeShim {
    sum: u32,
}

impl From<&MyType> for MyTypeShim {
    fn from(mt: &MyType) -> Self {
        MyTypeShim { sum: mt.one + mt.two }
    }
}

#[estree(via(...))] is analogous to #[serde(from)] and #[serde(into)]. But I'm hoping we can use just 1 "intermediary" type to go in both directions.

Is this a good plan?

The above "mini-roadmap" is a suggestion rather than a list of demands! Am totally open to different ways to split up the work.

But I do think we should split it up into multiple steps somehow, because (a) smaller PRs are easier to review and (b) if the effort doesn't reach the finish line, we'll at least get part of the way, and others can continue it later on.

overlookmotel commented 1 day ago

Where we're up to

6404 and the smaller PRs that followed it has got us to this stage:

derive_estree.rs

We still use #[derive(Serialize)] on a few custom Serialize impls in serialize.rs. Tsify is completely gone.

Next steps

In my opinion the next steps are:

1. Generate TS type defs for oxc-parser package

The reason I think we should do this first is it'd be great to get all the type defs checked into git as a single file, so we'll notice if the types mistakenly get changed during further work.

2. Fix serialization of RegExpLiteral

Currently JSON AST for RegExpLiteral contains the entire parsed regexp Pattern. This is a huge deviation from ESTree, and the serialization of RegExps is generally a mess.

JSON AST should just contain strings for pattern and flags, as ESTree does.

We can remove the EmptyObject hack. That type only exists to produce a value field in the JSON AST, and is otherwise a pointless annoyance!

3. Improve TS type defs

Previously, type defs were in this style:

export interface BooleanLiteral extends Span {
    type: "BooleanLiteral";
    value: boolean;
}

Now they're like this:

export type BooleanLiteral = ({
    type: 'BooleanLiteral';
    value: boolean;
}) & Span;

I am no TypeScript expert, but I understand from Boshen that the two are almost equivalent, but that there is a slight difference - the interface style gives nicer error messages.

Our ts types came from typescript-eslint, I would model them as such https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/ast-spec/src/expression/ArrayExpression/spec.ts

Are we able to go back to interface?

4. Clean up #[estree] attrs

5. Reduce #[estree(flatten)] boilerplate

See "phase 2" in previous comment.

Primarily I'm talking about Span here. It'd be great not to need #[estree(flatten)] on every single span: Span field.

Note: TSThisParameter has a this_span: Span field which should not be flattened. typescript-eslint doesn't include a span for this, so we can just skip serializing that field, rather than needing an #[estree(no_flatten)] workaround.

6. DSL

🤷 Open to suggestions on how to approach this one!