Open overlookmotel opened 1 month ago
JS-only AST (as discussed in point 3 above) has been requested by a user: https://github.com/oxc-project/oxc/issues/6284
Personally, I think it's a completely reasonable ask.
@Boshen we have a contributor (@ottomated - see #6284) keen to work on this. Before he gets going, do you see any problem with my proposal above?
I spoke to Boshen. He's happy with the direction of this PR. Sounds like @ottomated is ready to get stuck in to implementation.
I suggest doing this in phases:
Serialize
impls with oxc_ast_tools
Replace:
#[ast]
#[cfg_attr(feature = "serialize", derive(Serialize))]
#[serde(tag = "type", rename = "RestElement")]
pub struct AssignmentTargetRest<'a> {
#[serde(flatten)]
pub span: Span,
#[serde(rename = "argument")]
pub target: AssignmentTarget<'a>,
}
with:
#[ast]
#[generate_derive(Serialize)]
#[serde(tag = "type", rename = "RestElement")]
struct AssignmentTargetRest {
#[serde(flatten)]
pub span: Span,
#[serde(rename = "argument")]
pub target: AssignmentTarget<'a>,
}
Serialize
impls alone.Serialize
to ESTree
yet.#[serde]
attributes.Tsify
alone for now.derive
feature on serde
dependency.#[serde]
attrs boilerplate#[ast]
#[generate_derive(Serialize)]
#[serde(rename = "RestElement")] // <-- `tag = "type"` removed
struct AssignmentTargetRest {
// `#[serde(flatten)]` removed - `#[serde(flatten)]` on `Span` struct instead
pub span: Span,
#[serde(rename = "argument")]
pub target: AssignmentTarget<'a>,
}
Handle these in oxc_ast_tools
codegen instead.
Tsify
[x] Completed in #6755.
oxc_ast_tools
generates type def file itself.
Remove const TS_APPEND_CONTENT: &'static str = "...";
manual type defs.
ESTree
#[ast]
#[generate_derive(ESTree)]
#[estree(rename = "RestElement")]
struct AssignmentTargetRest {
pub span: Span,
#[estree(rename = "argument")]
pub target: AssignmentTarget<'a>,
}
I'm actually not quite sure how to do this, while still using serde::Serialize
under the hood.
Serialize
impls#[ast]
#[generate_derive(ESTree)]
pub struct ObjectPattern<'a> {
pub span: Span,
pub properties: Vec<'a, BindingProperty<'a>>,
#[estree(append_to_previous)]
pub rest: Option<Box<'a, BindingRestElement<'a>>>,
}
This is the tricky/interesting part. The idea is create a kind of domain-specific language (DSL) to cover the various transformations needed to go from Rust AST to JS ESTree AST. That DSL is the #[estree(...)]
attributes.
The advantage of a DSL which is static is that we can generate multiple things from it:
Serialize
impls.Deserialize
impls (so we can provide an oxc-codegen
NPM package).I'm not completely sure how far we can get with the DSL approach. "append to previous" is a pattern that's used in several types, so it makes sense to make an #[estree(append_to_previous)]
attr for it.
But for odd transforms which are only used in one place, we may prefer something like this:
#[ast]
#[generate_derive(ESTree)]
#[estree(via(MyTypeShim))]
pub struct MyType {
one: u32,
two: u32,
}
struct MyTypeShim {
sum: u32,
}
impl From<&MyType> for MyTypeShim {
fn from(mt: &MyType) -> Self {
MyTypeShim { sum: mt.one + mt.two }
}
}
#[estree(via(...))]
is analogous to #[serde(from)]
and #[serde(into)]
. But I'm hoping we can use just 1 "intermediary" type to go in both directions.
The above "mini-roadmap" is a suggestion rather than a list of demands! Am totally open to different ways to split up the work.
But I do think we should split it up into multiple steps somehow, because (a) smaller PRs are easier to review and (b) if the effort doesn't reach the finish line, we'll at least get part of the way, and others can continue it later on.
Serialize
impls are now codegen-ed by oxc_ast_tools
(phase 1 on "roadmap" above)We still use #[derive(Serialize)]
on a few custom Serialize
impls in serialize.rs. Tsify
is completely gone.
In my opinion the next steps are:
oxc-parser
package[x] Completed in #6755.
Generate all the types as a single .d.ts
file.
Include it in oxc-parser
package (not sure how to combine them with the types generated by napi-rs
).
Continue to use wasm-bindgen
for WASM. But replace all the short const TS_APPEND_CONTENT
statements with one giant one including all the types as a single string.
The reason I think we should do this first is it'd be great to get all the type defs checked into git as a single file, so we'll notice if the types mistakenly get changed during further work.
Currently JSON AST for RegExpLiteral
contains the entire parsed regexp Pattern
. This is a huge deviation from ESTree, and the serialization of RegExps is generally a mess.
JSON AST should just contain strings for pattern
and flags
, as ESTree does.
We can remove the EmptyObject
hack. That type only exists to produce a value
field in the JSON AST, and is otherwise a pointless annoyance!
Previously, type defs were in this style:
export interface BooleanLiteral extends Span {
type: "BooleanLiteral";
value: boolean;
}
Now they're like this:
export type BooleanLiteral = ({
type: 'BooleanLiteral';
value: boolean;
}) & Span;
I am no TypeScript expert, but I understand from Boshen that the two are almost equivalent, but that there is a slight difference - the interface
style gives nicer error messages.
Our ts types came from typescript-eslint, I would model them as such https://github.com/typescript-eslint/typescript-eslint/blob/main/packages/ast-spec/src/expression/ArrayExpression/spec.ts
Are we able to go back to interface
?
#[estree]
attrs#[estree(rename_all = "camelCase")]
from enums. Just make camelCase the default. It only seems to be present on fieldless enums.#[estree(untagged)]
. I think we can decide whether to tag based on if enum is fieldless or not?#[tsify(type = "...")]
with #[estree(type(...))]
(or #[estree(type = "...")]
). That's just a renaming thing.#[estree(flatten)]
boilerplateSee "phase 2" in previous comment.
Primarily I'm talking about Span
here. It'd be great not to need #[estree(flatten)]
on every single span: Span
field.
Note: TSThisParameter
has a this_span: Span
field which should not be flattened. typescript-eslint doesn't include a span for this
, so we can just skip serializing that field, rather than needing an #[estree(no_flatten)]
workaround.
🤷 Open to suggestions on how to approach this one!
@overlookmotel re: RegExp serialization - this could be a pretty big issue. oxc_codegen handles printing regex patterns by accessing the source text, but I don't know of a way to do this from within a Serialize impl.
Yes, that's not easy within a Serialize
impl. It's possible to do with a custom Serializer
which holds a reference to the source text, but it'd be a pain.
But personally, I don't think codegen is doing the right thing anyway. It should be converting RegExpPattern
back to a string with pattern.to_string()
. Ditto flags.to_string()
.
And that we can do in serialize
.
RegExp parsing was only added fairly recently, so there are still some kinks in how it integrates / doesn't integrate with the rest of Oxc - that's why it's a bit messy at present.
What's our next steps?
I've updated comments above to tick off the items which are now complete. I believe next steps are:
Step 2 – I'm not sure where the code for that should live. In oxc_codegen, I guess? Or do we want to avoid creating an entire codegen context for serialization, and it should live in oxc_regular_expression?
Step 4 – I can get on that, should be pretty quick
Step 5 – also should be pretty easy, wonder if any field named span: Span
should just be hardcoded.
@ottomated feel free to work on 4 and 5, I can take a look at 2 later.
@ottomated feel free to work on 4 and 5, I can take a look at 2 later.
Sounds good. Let me know if there's anything else I need to do for the initial npm release before then, I think I got all the CI right but not sure.
@overlookmotel The cases where enums don't have rename_all = "camelCase" are Type and Kind enums: FunctionType, FormalParameterKind, ClassType, MethodDefinitionType, PropertyDefinitionType, and AccessorPropertyType.
The cases where enums don't have rename_all = "camelCase" are Type and Kind enums
Ah ha! I hadn't realized there are so many. If they're still in the minority, could we make camelCase the default, and add #[estree(no_rename_variants)]
to these odd ones?
This isn't so important either way, just a thought. It's just always nice to remove repeated #[estree]
boilerplate where we can.
I noticed that we still have tsify
, we can try and get rid of it.
Step 5 – also should be pretty easy, wonder if any field named span: Span should just be hardcoded.
Personally I think it's preferable to make everything explicit in attributes on the AST types, rather than hiding away what's happening in ast_tools
. ast_tools
contains the implementation of generating Serialize
impls, but how it does that is guided by the #[estree]
attrs. Ideally we don't want anyone reading the code to have to delve into ast_tools
. We should (at some point) document what all the #[estree]
attributes mean, and just reading those docs should be sufficient to understand.
This will be particularly important if we introduce other JSON output "flavors" in future e.g. #[estree_js]
(i.e. without TS fields) or #[babel]
. At that point, being able to read the AST types and see "oh in ESTree this type is flattened, but in Babel it's not" will be helpful.
I noticed that we still have
tsify
, we can try and get rid of it.
You mean the #[tsify]
attrs? tsify
crate has been fully removed, and ast_tools
now processes those attrs. It's included in "step 4" above to rename them to #[estree]
attrs. At present the word "tsify" in these attrs is misleading.
Steps 4 and 5 should be completed by #6933, #6934, and #6935. I think the next step should be some kind of testbed for identifying differences between our serialization and acorn's, to help identify what features the DSL needs.
Those 3 PRs are merged (great!). I agree next step is to add conformance tests.
However, I think probably we should start with checking that our AST conforms to typescript-eslint
, rather than Acorn. Obviously our AST won't conform to Acorn's because we have all the additional TS fields. And to make it so, we're going to have to mark all the TS fields with #[ts]
and build a 2nd serializer which skips them.
So, in my view, it'd be preferable to get the existing AST conforming to standard, and then tackle the "plain JS" serializer after. What do you think?
The odd thing looking at typescript-eslint
is that I couldn't find a lot of tests. I was expecting to find tons of test fixtures like e.g. Babel has. Any idea where they're hiding?
One problem we'll run into with comparing to Acorn/typescript-eslint
is that our spans are counted in UTF-8 bytes, instead of UTF-16 characters. Probably this will affect very few tests, as the two are equivalent when source text is all ASCII, which is the common case. But it's something we'll need to tackle to pass all tests (related: #959).
Obviously our AST won't conform to Acorn's because we have all the additional TS fields.
Once these are marked with an attribute macro, we can test against acorn by ignoring them and running on pure js only to begin with. No need for a second serializer yet, we can just make the current serializer turn None
into undefined
instead of null
, which should be implemented anyway.
Are the typescript-eslint tests here?
Once these are marked with an attribute macro, we can test against acorn by ignoring them and running on pure js only to begin with. No need for a second serializer yet, we can just make the current serializer turn
None
intoundefined
instead ofnull
, which should be implemented anyway.
I'm not sure that approach is ideal, for performance reasons.
If None
fields are null
in JSON AST, then every object representing an AST node of a certain kind always has the same "shape" i.e. the same fields, in same order.
But if None
translates to undefined, that has to mean that field is not present at all in the object when it's None
, because JSON does not support {"foo":undefined}
. JSON.stringify({foo: undefined}) === '{}'
. So then the shape of those objects is variable. For example, a CallExpression
will sometimes have a typeParameters
field, and sometimes it won't. As far as JS engine is concerned, those are 2 completely different "shapes".
So then a function which processes CallExpression
s will not be monomorphic, because it can get called with 2 different "shapes". This impedes the JS engine's ability to optimize that function, which can have quite a large negative effect on performance. https://mathiasbynens.be/notes/shapes-ics
We can produce 2 different "flavors" of AST, each of which consistently has TS fields, or doesn't have TS fields. But producing an AST which flips between the two patterns is going to make writing performant code using the AST impossible.
So... the question, in my opinion, is which "flavor" of AST do we want to test first? I was proposing the with-TS version, since that's what we already have.
If your particular interest is the TS-less AST, and that's what you want to work on, then it's totally legitimate to say you want to do that first. But I do think we'd need a 2nd serializer for that.
Are the typescript-eslint tests here?
Yes, those are the ones I found. But there aren't very many!
I was hoping to find something more like this with loads of .js
files and a corresponding .json
file containing the parsed AST for each: https://github.com/babel/babel/tree/main/packages/babel-parser/test/fixtures
It seems hard to believe that there isn't a large set of fixtures for typescript-eslint
somewhere, but I just can't locate them!
If there aren't any, we do have another option. We have all the Test262 and TypeScript fixtures already. We could parse them all with typescript-eslint
and dump the ASTs to JSON files. Then we run Oxc's parser on all the same fixtures, serialize each to JSON using our serializer, and diff against the typescript-eslint
JSON ASTs. If our AST matches typescript-eslint
's, then they should all be identical.
Hmm, the monomorphism argument is interesting. However, I think it might be overoptimizing at this point:
const raw_ast = new Uint8Array(...);
const types = ["Program", "Expression", "Literal", ...];
const node = {
get type() {
return types[raw_ast[0]];
}
// ...
}
If the optimization you're worried about is JIT rather than transfer speeds, then it would be more effective to serialize to a compact binary format and fully deserialize on the JS side, rather than raw transfer.
Looking at acorn, they don't seem to have a lot of tests either. I think Test262 is the way to go.
Right now, while we're using JSON, serializing these fields as undefined saves a LOT of deserialization time and FFI overhead by making the JSON smaller
You might be right. But I think you also might not be! It may well be that JSON.parse
also benefits from predictable object shapes. I don't know the details of V8's internals well enough to say for sure either way. But, while working on the initial POC impl of raw transfer (#2457), I found that consistent object shapes had a very significant effect - removing a single ...
spread operation from the deserializer code yielded a 40% speed-up.
So personally I don't think we can assume necessarily that shorter JSON = faster deserialization.
This (overly simple) benchmark seems to hint that my guess may be correct: https://jsbench.me/1sm2tbh3lb/1
JSON.parse
on the JSON with unpredicable object shapes (3rd test case) is 27% slower than the first two. Even the one where the JSON string is shorter (4th case) is still 19% slower.
When we're at the raw transfer stage...
The initial implementation of raw transfer will just deserialize the whole AST to JS objects, so it replaces JSON.parse
, but should produce identical output to current JSON-based serializer. So essentially it is a compact binary format, just using Rust's native type layouts as the binary format instead of e.g. protobuff.
Later on, we'll produce an AST visitor which performs lazy deserialization. I'd imagine that some fields (e.g. type
) would be eagerly deserialized, and only more expensive fields (other nodes) would be behind getters. Either way, as you say, at that point the issue goes away - it will always produce objects with the same shape, and JIT optimizer will get to work at least some of its magic. But all of that is some way off, and raw transfer will be introduced initially behind an "experimental" flag, with the current JSON-based scheme still the default. So current JSON-based scheme will be around for some time, and I'm not keen to hobble its performance.
It sounds like we're maybe not going to agree on this. I feel quite strongly about consistent object shapes, and it sounds like you are coming from a different direction.
So... can we put that aside for a minute and talk about other ways to crack this nut?
If what you're really interested in working on is the TS-less AST, let's consider how hard it'd be to build a 2nd serializer, which can then be tested against Acorn's fixtures. I am guessing we'd need to:
#[ts]
.ESTreeJsSerializer
and ESTreeTsSerializer
which both implement serde::Serializer
. They'd both be just thin wrappers around serde
's original serializer.impl Serialize for T { fn serialize(&self, serializer: ESTreeTsSerializer) { ... } }
and impl Serialize for T { fn serialize(&self, serializer: ESTreeJsSerializer) { ... } }
. The 2nd would skip serializing fields marked #[ts]
.Do you think that'd work? If so, do you think it's achievable without a huge amount of effort?
Wow, very unintuitive result to me but I verified it by parsing checker.ts
, naively replacing all null fields with undefined, and benchmarking those
It looks like even though the json is a million characters longer, it still deserializes faster. Of course, this doesn't measure FFI transfer or serde's serialization, but it looks like your point is correct. I still believe my point in #6284 about working with the AST objects is valid but that's less of a priority.
The thing we absolutely will need to do is marking the fields with #[ts]
. After that, not sure what the best way to have multiple serialization options is. Feels like having double the impls would increase binary size by quite a lot. Other possibilities:
Thanks loads for testing out my hunch properly. Unintuitive results are what keeps life interesting!
You are right that the extra fields may cost something on serialization side (although also sometimes doing more work unconditionally is cheaper than an unpredictable branch). And maybe larger strings cost more to pass from Rust to JS. On the other hand, I am as sure as I can be that for the consuming code on JS side, consistent object shapes is a definite perf win. Whether all these various factors add up to a gain or loss in total is hard to say, and may depend on the use case, size of ASTs, and number of ASTs processed in a session (do the functions receiving AST nodes get called enough for JS engine to optimize them?). But my guess is that on balance consistent object shapes probably wins.
Some kind of global context like how SWC does it
You're right that 2 serializers would affect binary size. How much I don't know. If we want to reduce that, we could instead use a runtime flag e.g.:
struct ESTreeSerializer {
inner: serde_json::ser::Serializer<Vec<u8>>,
include_ts_fields: bool,
}
impl<'s> Serializer for &'s mut ESTreeSerializer {
fn serialize_bool(self, value: bool) -> Result<()> {
self.inner.serialize_bool(value)
}
// ... etc ...
}
impl Serialize for CallExpression {
fn serialize(&self, serializer: &mut ESTreeSerializer) -> Result {
let mut map = serializer.serialize_map(None)?;
// ...
if serializer.include_ts_fields {
map.serialize_entry("typeParameters", &self.type_parameters)?;
}
// ...
}
}
Is that what you meant by "global context"? I'm not familiar with SWC's serialization method.
Custom json serializer (shouldn't be that hard, right?)
Serde's implementation is designed to be flexible, supporting multiple serialization formats, so not particularly optimized for JSON. So yes, we could very likely produce a more efficient serializer designed specifically for JSON format, and with static knowledge of the object shapes, to condense what is currently many individual writes to the buffer down to a single write for chunks of statically knowable text e.g. {"type":"CallExpression","start":
. And yes, maybe it's not so difficult - you already did the hardest part.
This is definitely worth trying, and might be a significant speed-up. But... I think we should leave that until later. I suggest we stick with serde for now - as the old saying goes: get it right first, then optimize.
I still believe my point in #6284 about working with the AST objects is valid but that's less of a priority.
Can you explain this a bit more please?
I like the runtime flag more than multiple impls. I think SWC uses a global static cell that configures serialization options.
ast-js.d.ts
and ast-ts.d.ts
.Another idea: we could type the fields like this:
interface Foo {
typeAnnotation?: TypeAnnotation | null;
}
notice the ?:
- we would still send null
over the wire but if the consumer is building the nodes they wouldn't have to provide every field. Would take a bit more effort during deserialization maybe, but that's way later.
I like the runtime flag more than multiple impls.
OK cool. Let's go with that.
I think SWC uses a global static cell that configures serialization options.
Why on earth do they do that? That's a dreadful idea! How then do you serialize multiple ASTs on different threads simultaneously with different options? Plus the cost of synchronization on every access. Ahem... I'll stop being rude about SWC now...
6284 is more related to typescript DX when doing codegen in JS.
OK, now I get it. As I think I mentioned, I don't use TypeScript myself, so I forget that this kind of thing is a pain when you're not in wild west anything-goes type-less JS land.
We actually should be able to solve this by generating multiple type definition files, i.e.
ast-js.d.ts
andast-ts.d.ts
.
I'm in favour of that. The JS and TS "flavoured" ASTs are different structures, and should have separate types.
typeAnnotation?: TypeAnnotation | null;
I'm less keen on this, but maybe we could do it too if it makes a huge difference to ergonomics. As you say, getting data back into Rust side to run codegen is a way off.
We could also at some point introduce utility builder methods like @babel/types. They put the optional params last, and if you leave out those params they default to null
/false
. e.g. t.arrayPattern(elements)
returns {type: "ArrayPattern", elements, decorators: [], optional: false, typeAnnotation: null}
.
Of course, this is not ideal from perspective of writing monomorphic code, but it's certainly convenient.
@overlookmotel Running into a few issues trying to take the approach we talked about. The Serialize
trait requires the serialize
function to take a type parameter, making it impossible to restrict the type of serializer
to &mut ESTreeSerializer
. If I make a custom ESTreeSerialize
trait, we run into issues with nested serialization:
let mut map = serializer.serialize_map(None)?;
map.serialize_entry("language", &self.language)?;
// ^ the trait bound `source_type::Language: Serialize` is not satisfied
There might be a way to make this work within serde's infrastructure, but it's beyond me if so.
Damn it! I thought I did this before, but now I realize it was rkyv that I built a custom serializer for, not serde
. rykv helpfully defines their Serialize
trait as trait Serialize<S: Serializer>
which allows this kind of thing.
I got some of the way here.
The sticking points are:
FlatMapSerializer
SerializerMap::serialize_entry
requiring the value
param to implement Serialize
.I can see a few possible ways forwards:
SerializerMap
+ get rid of FlatMapSerializer
.serde_json
and alter it to make the Serializer
have state.serde
altogether!To expand a bit on these options, as I see them:
A good first step would be to get rid of FlatMapSerializer
. This is mostly used for Span
, and because our codegen already knows what Span
's field are, we could replace this:
self.span.serialize(serde::__private::ser::FlatMapSerializer(&mut map))?;
with:
map.serialize_entry("start", &self.span.start)?;
map.serialize_entry("end", &self.span.end)?;
Then we'd expand on my starting point (or probably you did much the same). We'd need to add a WrappedSerializerMap
, and probably implement WrappedSerialize
on Option
, and some other bits.
serde_json
Maybe this isn't so bad. We could use existing serde_json::ser::Formatter
which contains some of the harder stuff, rather than forking it.
Maybe this option ends up quite similar to option (1).
Smuggle state into serialize
methods in values instead of storing it in the Serializer
.
Generate wrapper types for all AST types and implement Serialize
on them:
#[derive(Clone, Copy)]
struct State { include_ts_fields: bool }
struct UnaryExpressionWithState<'a, 'b>(&'b UnaryExpression<'a>, State);
struct ExpressionWithState<'a, 'b>(&'b Expression<'a>, State);
struct UnaryOperatorWithState<'b>(&'b UnaryOperator, State);
struct SpanWithState<'b>(&'b Span, State);
impl<'a, 'b> Serialize for UnaryExpressionWithState<'a, 'b> {
fn serialize<S: Serializer>(&self, serializer: S) -> Result<S::Ok, S::Error> {
let value = self.0;
let state = self.1;
let mut map = serializer.serialize_map(None)?;
map.serialize_entry("type", "UnaryExpression")?;
let span = SpanWithState(&value.span, state);
span.serialize(serde::__private::ser::FlatMapSerializer(&mut map))?;
let operator = UnaryOperatorWithState(&value.operator, state);
map.serialize_entry("operator", &operator)?;
let argument = ExpressionWithState(&value.argument, state);
map.serialize_entry("argument", &argument)?;
if state.include_ts_fields {
// Different logic for TS
}
map.end()
}
}
impl<'a, 'b> Serialize for ExpressionWithState<'a, 'b> { /* ... */ }
impl<'b> Serialize for UnaryOperatorWithState<'b> { /* ... */ }
impl<'b> Serialize for SpanWithState<'b> { /* ... */ }
Advantage: No need to reimplement any of serde_json
.
Similar to what you said SWC does, but using thread local storage rather than a global static
.
serde
Completely reimplement serde
. As you said above, maybe this isn't so difficult. And probably we can make it more efficient anyway by combining a bunch of separate string pushes into one. So it's tempting.
However, I can see a large downside: Eventually we'll want to implement Deserialize
on types too (for getting AST back into Rust from JS), and that's a lot more complex than Serialize
. Not having serde's help with that might be pretty daunting.
serde_state crate looks interesting, though appears unmaintained.
On balance, I think my preference is option 3 (pass state in values). This would be a real chore if we had to write it by hand, but luckily we can generate all the wrappers with the codegen, and it's the least intrusive.
I think I would lean towards option 1, which seems like the least amount of work for something that won't impact performance. 2 seems like a lot of work, 3 feels ugly to me, and 4 is I think what swc does actually use.
Alternatively, option 5 is also attractive. It's the kind of thing that could start as simple code and be incrementally optimized. Deserialization could even still use serde — it's not like we've implemented any Deserialize traits yet.
A good first step would be to get rid of
FlatMapSerializer
Harder than it sounds because of BindingPatternKind (an enum) being flattened.
(3) is indeed ugly, but I thought it'd be the easiest to do.
I'm not a fan of using thread local storage - it has a cost every time you access it. But, beyond that, whichever method you prefer.
If you do want to go for option (5) and remove serde_json
entirely, then go for it. It'd probably be good to use CodeBuffer
as the string builder, which is fairly well-optimized (and I intend to optimize it further in future). If you want to do that, please move CodeBuffer
into oxc_data_structures
crate (in a separate PR).
Yes BindingPatternKind
is a pain! But maybe we could at least simplify flattening structs. That'd cover Span
, which is the majority of cases, and would avoid having to deal with BindingPatternKind
.
We may need to get other external state (source text) into the serializer - https://github.com/oxc-project/oxc/pull/7211#issuecomment-2470544434. Option (3) was not so bad when State
was just a single bool
, but looks much less attractive if we also need it to contain a &str
(16 bytes).
We currently use
serde
's derive macros to implementSerialize
on AST types.We could use
#[generate_derive]
to generate these impls instead.Why is that a good thing?
1. Reduce compile time
serde
's macro is pretty expensive at compile time for the NAPI build. We can remove it.2. Reduce boilerplate
serde
's derive macro is less powerful thanast_tools
. BecauseSerialize
is a macro, all it knows about is the type that#[derive(Serialize)]
is on. Whereasast_tools
builds a schema of the entire AST, so it knows not just about the type it's deriving impl for, but also all the other types too, and how they link to each other.Currently we have to put
#[serde]
attributes everywhere:Instead, we can use
ast_tools
in 2 ways to remove this boilerplate:ast_tools
's knowledge of the whole AST to move the instruction to flattenSpan
ontoSpan
type itself. "flatten this" instruction does not need to be repeated on every type that containsSpan
.I think this is an improvement. How types are serialized is not core to the function of the AST. I don't see moving the serialization logic elsewhere as "hiding it away", but rather a nice separation of concerns.
3. Open the door to different serializations
In example above
Serialize
has been replaced byESTree
. This is to allow for different serialization methods in future. For example:Different serializers for plain JS AST and TS AST
When serializing a plain JS file, could produce JSON which skips all the TS fields, to make an AST which exactly aligns with canonical ESTree. We'd add
#[ts]
attribute to all TS-related fields, andESTreeJS
serializer would skip those fields. This would make the AST faster to deserialize on JS side.The other advantage is the TS-less AST should perfectly match classic ESTree, so we can test it in full using Acorn's test suite.
Users who are not interested in type info can also request the cheaper JS-only AST, even when parsing TS code.
Serialize to other AST variants
e.g.
#[generate_derive(Babel)]
to serialize to a Babel-compatible JSON AST.Not sure if this is useful, but this change makes it a possibility if we want to.
4. Simplify implementation of custom serialization
Currently we have pretty complex custom
Serialize
impls for massaging Oxc's AST into ESTree-compatible shape in oxc_ast/src/serialize.rs.We can remove most of them if we use
ast_tools
to generateSerialize
impls for us, guiding it with attributes on the AST types themselves:5. Simply AST transfer code
AST transfer's JS-side deserializer (and eventually serializer too) can be simplified in same way, generating code for JS-side deserializer which matches the Rust-side one exactly, without writing the same logic twice and having to keep them in sync.
6. TS type generation
What "massaging" of the Rust AST we do to turn it into an ESTree-compatible JSON AST is now encoded as static attributes. We can use this to generate TS types, and we can get rid of
Tsify
.How difficult is this?
serde
's derive macro looks forbiddingly complex. But this is because it handles every conceivable case, almost all of which we don't use. The output it generates for our AST types is actually not so complicated.So creating a codegen for
impl Serialize
I don't think would be too difficult.