rust-analyzer / rowan

Apache License 2.0
676 stars 57 forks source link

Consider using succinct data structures for read-only trees #107

Open matklad opened 3 years ago

matklad commented 3 years ago

I've recently been reading on succinct data structures literature, and I am wondering if they might be applicable to rowan. The tagline of SD is to represent data using approximately as few bits as entropy allows, but with keeping all the operations fast.

In particular, for trees, we generally use 2 * n * sizeof<usize> bytes (parent/children pointers). SD allows using roughly 2n bits, while still allowing for efficient parent/child queries. This works due to the following tricks:

It would be interesting to see if it makes sense to use something like this for rowan.

Some reasons to do this:

Some reasons not to do this:

CAD97 commented 2 years ago

Rowan is such nerd catnip for me.

https://doi.org/10.1145/2601073

Fully Functional Static and Dynamic Succinct Trees

For the dynamic case, where insertion/deletion (indels) of nodes is allowed, the existing data structures support a very limited set of operations. Our data structure builds on the range min-max tree to achieve 2n + O(n/log n) bits of space and O(log n) time for all operations supported in the static scenario, plus indels. We also propose an improved data structure using 2n + O(nlog log n/log n) bits and improving the time to the optimal O(log n/log log n) for most operations. We extend our support to forests, where whole subtrees can be attached to or detached from others, in time O(log1+ϵ n) for any ϵ > 0. Such operations had not been considered before.

The forest operations seem especially relevant to rust-analyzer/rowan use cases, since the main mutation use case is in grafting subtrees. It's definitely worth looking into... and I'm going to think about this until I experiment some with it... I sense a sorbus 0.2 coming some time in the future 😅

An interesting constraint that Rowan has over pure academic trees is the actual data storage on each node (in a side table for succinct data structures, typically, AIUI). Both interior nodes (for length offsets of children, for sublinear position search) and terminal nodes (for the actual text) want to themselves be dynamically sized, so size analysis is obviously more complicated, along with node deduplication.

CAD97 commented 2 years ago

One property of Rowan that I'm fairly sure sure succinct trees would lose is the independence of nodes. Currently, if you have GreenNode, you keep just that node and its children alive. With a succinct tree, however, I'm fairly certain that the tree is one unit w.r.t. ownership. (Though I need to dig into how exactly the forest operations work.)

Is this restriction problematic for what rust-analyzer wants? Node payloads would still be able to be cached and deduplicated separately from the trees themselves.

If we're okay giving up partial sub-ownership of a green tree, we can reduce memory usage even without a succinct tree by using indices into an arena rather than pointers (as well as unlocking parent pointers in the green tree, if we want those).