mozilla / mentat

UNMAINTAINED A persistent, relational store inspired by Datomic and DataScript.
https://mozilla.github.io/mentat/
Apache License 2.0
1.65k stars 115 forks source link

[edn] Investigate whether or not a LinkedList is better than a Vec or VecDeque for edn::Value::List #231

Open victorporof opened 7 years ago

victorporof commented 7 years ago

@rnewman Do we really want a LinkedList for this? https://github.com/mozilla/mentat/blob/rust/edn/src/types.rs#L39

Having a VecDeque would make things so much easier, especially for https://github.com/mozilla/mentat/issues/345 because then we could use slices.

rnewman commented 7 years ago

It's unlikely that EDN values will be extensively mutated in place; I think the main cases will be parsing, construction for serialization, and accessing parsed data. That eliminates the main reason for a vec-like data structure: efficient random access.

The data structure in use doesn't really matter for parsers and constructors. If anything, using an EDN list (parens) over a vec (square brackets) implies that perhaps the construction is also list-like — constant-time append is expected. It's unkind to impose the same time and space complexity on two well understood and different data representations. For access to EDN, using a list-like data structure makes the most sense, because it exposes a list-like API. Again, if you wanted a vec, you'd use [].

I assume your question is based on splitting at .... Perhaps you should consider working with iterators instead of slices?

Or if you really need slices, why not convert to whatever representation makes your life easier? The differ is not performance-critical code, being intended for use in testing.

victorporof commented 7 years ago

Agreed. Converting to a sliceable structure would indeed work, but that means slight memory overhead (which as you said, doesn't matter that much). Iterators also work, but that means we'd have to create new iterators for every slice, which also has a bit of overhead. Basically, representing lists as contiguous memory makes just as much sense. And when it comes to pattern matching specifically, vecs and lists are the same.

My point is I've yet to see a need anywhere for distinguishing between vecs and lists in any way other than some trivial surface API which we don't actually use. Changing stuff to fit our needs makes more sense to me than adhering to some abstract definition.

Maybe a better question is: why are edn vecs and lists different? Is the constant-time append the only surfaced implementation detail?

rnewman commented 7 years ago

Iterators also work, but that means we'd have to create new iterators for every slice, which also has a bit of overhead.

You should act as if iterators are free in Rust, because most of the time they are. All an iterator is is some definition of how to get the next item; for a Vec, an iterator compiles down to the same machine instructions as a for loop in C… perhaps even better, because the array bounds checks can be eliminated. Indeed, Rust's for loop over a vector is exactly a call to into_iter and iterator methods. An iterator over a linked list is the same pointer dereferencing that you'd write by hand.

Maybe a better question is: why are edn vecs and lists different?

Simply because Clojure has both lists and vectors, and has syntax for both. (It also has sets and maps, and syntax for both, even though you can implement a set with a map!)

Is the constant-time append the only surfaced implementation detail?

There is an implication, by association with Clojure and a long line of languages, that vectors are contiguous, random-access, expensive to insert into, and have non-constant append time complexity. There's an implication that lists are expensive to index into, are iterable, and have constant time prepend and len. In Java you'd use ArrayList for EDN vectors and LinkedList for EDN lists. In Rust and modern programming there's the implication that vectors are more likely to exhibit beneficial memory layout and cache characteristics, but that's a choice the consumer can make: use []!

Building up lists back-to-front, and using them as a stack, is cheap, with guaranteed constant time operations (right up until you consider the magic that modern processors do). Building up vectors of a known size is cheap.

Implementing both via a Vec would be surprising to someone familiar with a language like Java. Implementing them both via a VecDeque is less surprising, in that inserting at the front is amortized-cheap — an important characteristic of a list — but it's not really constant-time, because the ring buffer needs to be resized periodically.

The only use Mentat has for lists is to express code-like forms:

[:find ?x :where [(fulltext ?x "foo")]]

These are always processed sequentially, they're always small and we could use any data structure we like. The question is whether our public EDN interface, which someone might reuse, behaves as someone might expect lists and vectors to behave, or if we leak internal (simplifying) choices.

rnewman commented 7 years ago

Additional data point: in our parser we turn these into a Vec so that we can get a slice out.