rust-analyzer / rowan

Apache License 2.0
689 stars 57 forks source link

Add SyntaxPtr type #70

Closed CAD97 closed 3 years ago

CAD97 commented 4 years ago

This adds a "trace" style SyntaxPtr "pointer". The pointer can be used to store and resolve paths from the root node.

Paths of up to 15 indices can be stored inline, if all indexes are < 240. Indices > 2095 are unrepresentable (but having that many children seems unlikely), and indices inbetween 240 and 2095 are stored as two "unit"s. Up to 255 units can be stored outlined. That is just a limit of the current implementation rather than the design, though; it would be fairly simple to restructure the outline case to use a u16 length rather than sharing the u8 length. (The single length is simpler, however.)

Unfortunately, this does not change the asymptotic behavior of resolution of SyntaxPtr from the previous (Kind, TextRange) pointers, as we still have to walk w × d nodes to create the red node with stored cumulative offset. However, this pointer design will immediately be O(d) when the .children_with_tokens() iterator has O(1) .nth(), which becomes possible with sorbus.

Closes https://github.com/rust-analyzer/rust-analyzer/issues/1185,k but only receives benefit once the sorbus green tree gives us a random access iterator with cumulative offsets.

matklad commented 4 years ago

Indices > 2095 are unrepresentable

This is an uncomfortably low limit. We have a real-world struct with 800 fields in Rust:

https://github.com/rust-analyzer/rust-analyzer/blob/master/crates/ra_syntax/test_data/accidentally_quadratic#L3-L812

matklad commented 4 years ago

Hm, can we use SmalVec<u8> to represent traces, together with varint encoding?

https://github.com/matklad/varint/blob/master/src/lib.rs

I think I've actually tried this (that's why matklad/varint is a thing), but I don't remember why I didn't make the change in the end...

CAD97 commented 4 years ago

The reason I didn't use SmallVec directly is that it's (ptr: *mut u8, cap: usize, len: usize) where we only need ~SmallSlice<u8> (ptr: *mut u8, len: usize) (and a u16 depth is clearly more than enough).

The 0x0FFF upper bound can be fairly easily extended by taking the simple 0xF_ extender byte and allowing more of them. Allowing two extender bytes would allow encoding 0x00..0xF0 in one byte, 0xF0..0x0FF0 in two bytes, and 0x0FF0..0xFFFF in three bytes (if we stop there, 0xFFF0 if we want to keep going).

I chose this simple encoding because it is conceptually simple (just moving around nybbles) but we could go all the way and use UTF-8's encoding if we are fine with paying the higher decoding cost. (But I don't think it carries any benefit over this adhoc solution until we store indices larger than u16... and sorbus doesn't actually support children arrays longer than u16::MAX anyway.)

Why [matklad] didn't make the change in the end

Probably because the change is strictly negative right now. Pointer resolution still has to touch exactly the same number of nodes, so you're just storing bigger pointers for no reason (until .children_with_tokens() has sublinear .nth()).

CAD97 commented 4 years ago

(I've marked this as a draft as it's explicitly not useful until sorbus (or an equivalent tree rewrite) is implemented.)

CAD97 commented 3 years ago

Closing as outdated for now.