oxc-project / backlog

backlog for collborators only
0 stars 0 forks source link

Make AST `#[repr(C)]` #53

Open overlookmotel opened 2 weeks ago

overlookmotel commented 2 weeks ago

I am in favour of making all AST types #[repr(C)].

The effect of #[repr(C)] is to make the memory layout of types predictable.

Why?

1: Treat Rust types as a serialization format

(copied from https://github.com/oxc-project/oxc/pull/3115)

Oxc's direction appears to be increasingly to build up control over the fundamental primitives we use, in order to unlock performance and features. We have our own allocator, our own custom implementations for Box and Vec, our own IndexVec. The AST is the central building block of Oxc, and taking control of its memory layout feels like a step in this same direction.

Oxc has a major advantage over other similar libraries in that it keeps all the AST data in an arena. This opens the door to treating the AST either as Rust types or as pure data (just bytes). That data can be moved around and manipulated beyond what Rust natively allows.

However, to enable that, the types need to be well-specified, with completely stable layouts. #[repr(C)] is the only tool Rust provides to do this.

Once the types are #[repr(C)], various features become possible:

Allowing the AST to be treated as pure data will likely unlock other "next level" features further down the track (caching for "edge bundling" comes to mind).

2: AST transfer

This isn't necessary for AST transfer, but it's desirable, as then we can get rid of a bunch of code in current implementation which has to do introspection of the type layouts at runtime. If types are #[repr(C)], it can be done in build script instead.

3: Memory access optimizations

Control of in-memory type layouts can also allow some runtime optimizations.

For example, if all span field was always placed at the same offset in every struct, then GetSpan::span could be reduced to a single CPU instruction with no branching.

The problem with #[repr(C)]

#[repr(C)] does have the annoying requirement that you have to order the fields of structs in the exact order that they'll be laid out in memory. Without #[repr(C)], Rust automatically re-orders the fields to minimize padding (broadly speaking, order fields in descending order of type alignment). But with #[repr(C)] you have to do this manually.

e.g. BinaryExpression needs to be defined as:

#[repr(C)]
pub struct BinaryExpression<'a> {
    pub left: Expression<'a>,
    pub right: Expression<'a>,
    pub span: Span,
    pub operator: BinaryOperator,
}

This is not ideal. We want field order as written in *.rs files to be:

  1. The order of visitation.
  2. The order that makes sense semantically - memory layout should be an invisible implementation detail.

Proposed solutions

#[repr_stable(field_order(left, right, span, operator))]
pub struct BinaryExpression<'a> {
    pub span: Span,
    pub left: Expression<'a>,
    pub operator: BinaryOperator,
    pub right: Expression<'a>,
}

repr_stable is a proc macro which re-orders the fields in the memory layout. But the field order as written can be the same as it is now - a semantically sensible order.

Alternatively: Rather than specifying memory field order in the #[repr_stable] attribute, codegen (https://github.com/oxc-project/oxc/pull/3815) can figure out what memory layout should be and pass this info to the macro via "side channel" of the AST schema. That way, all types will be optimized for ideal memory layout without any manual work.

That also opens the door to doing most of the work in the codegen, and making the proc macro "dumb", so minimizing its compile-time impact. Pushing this to its furthest extent, something like "Option 2d: Cache macro expansion output" in #14 would make the macro basically free (almost 0 compile time cost).

rzvxa commented 2 weeks ago

Allowing the AST to be treated as pure data will likely unlock other "next level" features further down the track (caching for "edge bundling" comes to mind).

Just a note to myself and others, If we want to go down the binary serialization route, Make sure to add a static versioning to types so we can iterate on it and assert layout compatibility across versions(when the layout changes).