Node type for Julia syntax

julia-vscode / JuliaWorkspaces.jl

Underlying engine for LanguageServer.jl

Other

5 stars 2 forks source link

Node type for Julia syntax #7

Open davidanthoff opened 1 year ago

davidanthoff commented 1 year ago

At the moment JuliaSyntax produces SyntaxNode for us, and we use that to detect test items and friends. The question, though, is whether that is the right node type for us? My understanding is that it doesn't fully round-trip, which seems like something we need.

I guess one alternative would be to parse into CSTParser.EXPR instead, i.e. reuse our old data structure but use the parsing logic from JuliaSyntax. Not clear to me whether that is a good idea? Would probably be useful if those that have worked more with CSTParser chimed in on that question. Things that to me look not super ideal are: 1) I'd prefer it to be an immutable, 2) I dislike the meta field, I think we should move to a world where any state from the linter/semantic analysis is stored outside of the syntax tree, 3) the parent field also doesn't strike me as ideal, as it means that one can never reuse a child-subtree in a new tree, at least if we move to immutable, 4) having val as String is probably also not ideal? Doesn't that mean a lot of allocations that one could potentially avoid by just using say a SubString that is based on the full original doc, or something like that? 5) I think one of the design goals of EXPR was that it is somewhat similar to Expr with the idea that at some point it might become the official parser for Julia. I think that goal is no longer relevant, so I'm wondering whether one would actually design the data structure differently today if that goal is no longer?

So maybe we should create a new type?

Another consideration is how this plays with the JuliaFormatter types, my understanding is that it uses yet another node type internally, somehow? Maybe there is an opportunity to design one node type that can power both JuliaWorkspace and JuliaFormatter?

davidanthoff commented 1 year ago

I've been reading a bit more Roslyn stuff, in particular https://github.com/KirillOsenkov/Bliki/wiki/Roslyn-Immutable-Trees and https://www.dotnetcodegeeks.com/2015/05/inside-the-net-compiler-platform-performance-considerations-during-syntax-analysis-speakroslyn.html.

It seems to me that they chose this whole green/red node design for memory and performance reasons, but that one really only gets those benefits if one expects the red tree to only be materialized sometimes, and then partially. I am wondering whether that is a) actually ever the case in the LS, and b) still necessary 20 years later? I should say that this has been Zac's position all along :)

On a) it seems to me that there is always some analysis that we will want to run after every single edit. In particular, it seems to me that we would want to always run our diagnostics/linting, and that we will always need a full tree for that. So right now I have a hard time seeing any scenario where we would have a fully materialized green tree, but no red tree. If that is so, then this whole two-part design is probably only creating more memory pressure, right?

On b), it is probably good to remember that the Roslyn design was probably created about 20 years ago... So this was a design that allowed real-time editing of source code 20 years ago. To me it seems that the complexity of code hasn't really gone up, whereas compute power of course has, so maybe this is just all complexity that is no longer needed?

So right now I am tempted to not go with the JuliaSyntax green/red design, but instead just create a new type structure that mimics the Roslyn red tree. If we were, some of the points from my previous post here wouldn't fully apply anymore.

But a key question is whether there are other benefits to this red/green design? So far I haven't found any, but maybe others have?

CC @pfitzseb and @c42f

davidanthoff commented 1 year ago

Another observation: Roslyn's red tree has different types for every kind of syntax node... Have we thought about that? Is there a downside to that?

c42f commented 1 year ago

A few comments on this.

Firstly, I think the design space for the "right tree" design in Julia is wide open, and I don't think JuliaSyntax's GreenNode / SyntaxNode necessarily hits the right places. Mainly because the "right tree" depends on collecting a bunch of use cases and that's not really been done. Not by me at least. Whether there's even "one right tree type" is something I'm skeptical about. Maybe we want a set of functions for tree traversal and mutation instead, and allow different representations as required by use case.

One good thing about JuliaSyntax's parser is that it doesn't use any tree type at all: the tree is represented in a token and range array. This is good for low GC pressure during parsing. Given this, one can opt out of JuliaSyntax.GreenNode as the most raw representation and instead implement JuliaSyntax.build_tree() for your own tree type.

I do have a soft spot for GreenNode because it's got a very well-defined and simple semantic. However it's cumbersome to use directly because one doesn't have direct access to the source text given a green node.

SyntaxNode is more convenient there, but it's also basically AST-like and discards all the trivia. So it's not quite right either.

IIUC rust-analyzer has something called SyntaxNode which is more like the Roslyn red tree - it includes trivia and acts as a convenient cursor onto rust-analyzer's green tree. They still have a green tree, though. (I think rust may include strings for the tokens in their green tree?)

In terms of generalizing the JuliaSyntax tree types to fit other use cases, @timholy has already started this with TypedSyntax.TypedSyntaxNode which is just a kind of JuliaSyntax.TreeNode with different TreeData - see

https://github.com/JuliaDebug/Cthulhu.jl/pull/345 https://github.com/JuliaLang/JuliaSyntax.jl/pull/193

Roslyn's red tree has different types for every kind of syntax node... Have we thought about that? Is there a downside to that?

I've thought about this but I don't really like it. It makes all children abstractly typed and forced dynamic dispatch in a lot of places where we'd otherwise have static dispatch if we used a single node type like SyntaxNode.

An apparent benefit of having many node types is that each type can have named child accessors specifically for that syntactic construct. Then users don't need to remember what node[i] means for each node kind and each i! It's kind of crazy to have this detailed knowledge embedded in user code far and wide. But I think a good getproperty with manual dispatch on the kind might work just as well for this.

Maybe the only place multiple node types would be really neat would be if we wanted the AST to be extensible with new node types. In a crude sense Expr is extensible like this, because the head field can be any arbitrary Symbol and this has some benefits over the fixed JuliaSyntax.Kind. Perhaps this can be improved.

Another observation related to GC, is that for certain use cases one doesn't need to build pointer-based tree data structure at all. It's possible and efficient to represent the entire thing as a small number of arrays and just use cursors to access nodes rather than a pointer to an individual node struct. (Possible caveats around ad-hoc mutation of such trees apply to the obvious implementations.) Anyway, it may be worth looking at the graph data structures elsewhere in the Julia ecosystem and see how they make these things efficient.

davidanthoff commented 1 year ago

Whether there's even "one right tree type" is something I'm skeptical about.

Yep, that completely resonates. I want to use this issue here to figure out what the right tree type for the language server would look like.

Mainly because the "right tree" depends on collecting a bunch of use cases and that's not really been done. Not by me at least.

Nor by me ;) So that is why I am tempted to just follow the design/lead of something else. The Roslyn design is clearly motivated to enable exactly the kind of functionality that we want to provide with the LS, and it was built by some of the most experienced folks in that space, so in my mind that makes it a good starting point. In particular the red tree, after all they managed to built an amazing diagnostic, refactoring etc. functionality on top of that, so presumably it is a useful API for that kind of stuff.

One good thing about JuliaSyntax's parser is that it doesn't use any tree type at all

Yep, that is really fantastic! I do think that presumably the most useful way to do this is to utilize the parser from JuliaSyntax, but maybe have our own tree node types for the LS in the end.

but it's also basically AST-like and discards all the trivia

Yep, that then definitely doesn't make it the right type for the LS.

I do have a soft spot for GreenNode because it's got a very well-defined and simple semantic.

I feel the same way. I am just wondering what benefit it would have to go through two layers of tree types for the LS... The Roslyn rational is simple and clear: memory and GC pressure. But I just don't see how that applies if one builds a red tree for every green tree as well.

Do you have a sense why Rust went for the green/red tree design? Are they building the red trees lazily/on demand?

It makes all children abstractly typed and forced dynamic dispatch in a lot of places

Julia dynamic dispatch is presumably a lot slower than a vtable call say in C#, right? In that case it would presumably be a really bad idea :)

c42f commented 1 year ago

Yep, that is really fantastic! I do think that presumably the most useful way to do this is to utilize the parser from JuliaSyntax, but maybe have our own tree node types for the LS in the end.

Yea I suggest you create a tree type which works for you, and if it makes sense for general use we can incorporate it back into JuliaSyntax at some point. We may need to generalize JuliaSyntax.build_tree() or related machinery to be able to better support your custom tree type without you needing to use the internal fields of JuliaSyntax.ParseStream or reimplement the tree traversal based on the range/token arrays here:

https://github.com/JuliaLang/JuliaSyntax.jl/blob/bd19475bb616c8344d5212a64cc434ecdd1fb5e4/src/parse_stream.jl#L970

but it's also basically AST-like and discards all the trivia

Yep, that then definitely doesn't make it the right type for the LS.

I should add, as a caveat, that a link to the green tree is always available (as node.raw) but this probably isn't super helpful :)

The Roslyn rational is simple and clear: memory and GC pressure. But I just don't see how that applies if one builds a red tree for every green tree as well.

I suppose it depends on whether you need to materialize the full tree all at once to do linting. (eg, does linting top-level statements one at a time work?)

Do you have a sense why Rust went for the green/red tree design?

I don't know why exactly. But I get a sense that the rust-analyzer people are quite obsessed with performance and compactness of these data structures. So they probably thought carefully about it.

One possibility is that typical source code contains a lot of reusable green tree fragments (leaves and near-leaves). So practically this compresses the green tree a fair bit and they don't materialize their red tree. I haven't experimented with this though.

Are they building the red trees lazily/on demand?

Yes, I think (vague memory!) that the rust-analyzer red tree is more like a cursor onto the green tree data, and the precise lifetime/ownership control which Rust provides means that the cursor generally lives allocation free on the program stack.

IIRC rust provides access to nontrivia children by iterating the green tree child list on demand and ignoring trivia. They also have a typed accessor interface for the fields of the various syntax node kinds built on top of this. But this isn't based on dynamic dispatch, it's based on pattern matching the node kind, I think. (Somewhat like I'm suggesting with getproperty above)

Julia dynamic dispatch is presumably a lot slower than a vtable call say in C#, right?

Presumably yes. I haven't read the generic function dispatch code in detail though. I think there might be optimizations for common special cases and a lookup cache (but tree traversal with may different node types is likely to thrash the cache).

timholy commented 1 year ago

In cases where all the dispatch types are concrete it's pretty well optimized; it's when you have to do subtyping that things slow down. But if everything is concrete then it's really hardly any different from an @enum; I'm a big fan of how it's currently designed and wouldn't recommend changing to dispatch via the type system.

c42f commented 1 year ago

Yes I don't intend to change to language-based dynamic dispatch. I think pattern matching is the right way to do dispatch on these kind of data structures, in the sense that it can be both expressive and maximally efficient. The rust tooling agrees :)

The main problem right now is that we don't have a neat/standard way to express pattern matching over our trees (writing a pile of if statements based on kind and children() is not neat!)

@benchung has used MLStyle for this in SemanticAST.jl, for example, you can see this is very succinct here: https://github.com/BenChung/SemanticAST.jl/blob/master/src/analyzer.jl I think we could have something like that eventually (though user code needing to know the meaning of the child ordering of SyntaxNode ... still feels problematic to me)

BenChung commented 1 year ago

Regarding meta, one thing that I've done for JuliaTypechecker that has worked out fairly well is to use Overseer.jl to attach entities to AST nodes and then drop random components onto those entities that store the various metadata that I'm creating while I'm doing analysis. I'm not sure how practical this is (I think that there's a lot of challenges in keeping them up to date/incrementally updating them) but it might be something to think about.

c42f commented 1 year ago

might be something to think about.

Absolutely! I vaguely remember discussing using ECS for annotating nodes once on Zulip? Cool that you're trying it out :)

For attaching metadata sparsely to nodes ECS would be great. If sparsity is typical for analyses it's probably the way to go.

If non-sparse, a straightforward set of attribute arrays looked up based on node id would be fine instead. I think this is how the graphs ecosystem tends to treat things?

c42f commented 1 year ago

The other option is to just embed the annotations/attributes directly in the tree nodes which is the approach we've got with TreeData in JuliaSyntax.TreeNode{TreeData}. But this makes composition of different analysis passes much harder without recreating the entire tree each time.

c42f commented 1 year ago

One possibility is that typical source code contains a lot of reusable green tree fragments (leaves and near-leaves). So practically this compresses the green tree a fair bit and they don't materialize their red tree.

I've done some experiments with this now. The fact that GreenNode is immutable has the interesting side effect that we never heap-allocate most of the green nodes individually: all the green node leaves are stored inline in the child array of their parents.

Some numbers for base/abstractarray.jl which is a moderately large (~100 kb / 3000 line) Julia file:

There are 26534 tokens and 32731 nodes but only 6299 allocations in parsing this file to a green tree. In some sense, we are already winning by a factor of 5x here in terms of GC pressure by having GreenNode be immutable.
81% are leaf nodes which are never separately allocated
8% of nodes have only leaves as children. 73 % of these can be reused leading to only 4466 allocations. (But the lookup to do the reuse is slower than just allocating in this case and the win here doesn't seem amazing.)
For perspective, parsing without tree construction via JuliaSyntax.parse!() costs only 90 allocations.

We see that fully materializing a green tree (or any tree of immutable nodes with high branching factor) is quite GC friendly in Julia! If we used an ECS (or parallel arrays) for attributes and tagged the nodes with IDs, we'd be able to have this even for trees with attachable metadata.

None of this applies to SyntaxNode which is about as bad as you might expect - parsing the same file costs 64108 allocations (but we currently do a lot of extra work here, including converting all strings into Julia values).

c42f commented 1 year ago

I've summarized and recorded some of these thoughts here:

https://github.com/JuliaLang/JuliaSyntax.jl/issues/230

davidanthoff commented 1 year ago

Oh, this is very interesting! So essentially our situation is completely different from Roslyn: there they get rid of GC and memory pressure by aggressively caching and reusing leave nodes (and a smaller number of non-leave nodes). But for us the leave nodes actually don't end up on the heap in the first place. So that is probably great in terms of GC, but at the same time we presumably use more memory than a Roslyn like design? That is probably a good tradeoff?

Thinking about this a bit more, this might on the flipside make a design with more structured node types (say the Roslyn red story where there is a SyntaxNode that has SyntaxNode and SyntaxToken children, and SyntaxToken that has SyntaxTrivia children) less problematic than in Roslyn because we might in general end up with way less allocations because some of these structs could be stored inline in arrays, rather all ending up individually on the heap.

c42f commented 1 year ago

But for us the leave nodes actually don't end up on the heap in the first place.

Yep :)

I don't think we should worry too much more about allocation here - it seems we're in quite a different situation from the C# runtime back when Roslyn was being designed.

we presumably use more memory than a Roslyn like design?

Unclear, but I suspect not: if Julia separately allocated each node we would need memory for the type tag pointer - that lives in the 8 bytes of heap memory prior to the node itself. GreenNode is currently only 16 bytes in size, so an additional 8 bytes of type tag is quite a lot of overhead. As I showed above, even trying to cache the second level nodes doesn't save a whole lot of allocation overall. I suspect the two approaches are similar. But without needing to worry about caching we've avoided some complexity.

c42f commented 1 year ago

We may need to generalize JuliaSyntax.build_tree() or related machinery to be able to better support your custom tree type

https://github.com/JuliaLang/JuliaSyntax.jl/pull/268 should help with this - I generalized build_tree to be able to produce Expr directly.