Open jeromekelleher opened 3 years ago
@molpopgen, do you think my statement about the struct-of-arrays quintuply linked tree model being more efficient than standard linked tree structures is true, at least for naive implementations? I'm imaging comparing against a few other implementations:
children
, and allocate these naively (so presumably they don't have much locality)My intuition would be that for large trees we should see performance for (say) the parsimony algorithm implemented with these different ways increase - I may be entirely wrong though! It's probably a good idea to keep parsimony as a running real-world example.
Any thoughts?
So, if your hypothetical linked-list of trees has the form, Container[Tuple(TreeThing, PrevTreeThing, NextTreeThing)]
, then my intuitions align with yours. Even if both containers (Container
and TreeThing
) are flattened linked lists, meaning contiguous arrays as opposed to "pointers all over the machine", I'm guessing that it is still true. At the very least, by coupling the edge table to the N-tuply-linked-lists removes a whole ton of redundancy that a tree-first implementation needs to store.
The main way to improve the current tskit setup, I think, may be to collect the "right" parts of those lists into structs, and have some smaller number of arrays of structs. Any time you access more than one feature of node u
's role in the tree (say, its parent and its left sibling), you've got cache misses. But knowing exactly what to change here would mean loads of time with perf
to record misses on big trees/tree sequences on algorithms that we think most people care about most of the time.
Almost as if by magic, an array-of-structs implementation of this stuff is out there because I was curious myself...
I guess my goal here is to convince people that the way we're laying things out in memory is a good idea, and I want to challenge latent assumptions that linked structures in memory are the only/best way to do trees. I'm sure there's lots of people out there who automatically assume a simple C++ linked object representation of a tree is the best possible way of doing things (by virtue of being written in C++!).
So, I'm not really bothered about improving tskit, or about having anything that could realistically work well in the context of having the tree changing as we go along the genome. This is just a single tree - how well does tskit stack up against textbook implementations?
So, I'm not really bothered about improving tskit, or about having anything that could realistically work well in the context of having the tree changing as we go along the genome. This is just a single tree - how well does tskit stack up against textbook implementations?
Yeah, I'm on board here--my point was more that I think the current way must be quite close to optimal, and that improving is tough.
I guess my goal here is to convince people that the way we're laying things out in memory is a good idea, and I want to challenge latent assumptions that linked structures in memory are the only/best way to do trees. I'm sure there's lots of people out there who automatically assume a simple C++ linked object representation of a tree is the best possible way of doing things (by virtue of being written in C++!).
Yep, agreed. Hopefully no one thinks that pointer-based linked lists would be the way to go anymore, but who knows.
Edit: deleted my idea, as it is "changing along the genome"...
Something like this is what I had in mind @molpopgen:
from __future__ import annotations
import dataclasses
from typing import List
import msprime
@dataclasses.dataclass
class Node:
time: float
flags: int
parent: Node | None = None
children: List[Node] = dataclasses.field(default_factory=list)
ts = msprime.sim_ancestry(5, random_seed=234)
tree = ts.first()
object_map = {}
for u in tree.nodes(order="postorder"):
tsk_node = ts.node(u)
node_obj = Node(tsk_node.time, tsk_node.flags)
object_map[u] = node_obj
for v in tree.children(u):
child_obj = object_map[v]
child_obj.parent = node_obj
node_obj.children.append(child_obj)
root = node_obj
print(root)
I'd imagine the starting point would be an idiomatic modern C++ version of this, and then we'd sequentially make it less idiomatic to gain performance (presumably).
So, if I'm following you here, we need to:
I don't have a good sense of how current code bases are doing 1 in practice. I certainly hope no one is using the built-in linked list at this point in time--its performance impacts should be well-known by now.
I don't think the textbook pointer-based tree structure is a straw-man @molpopgen - have a look at the Usher code for example. In the first instance, what I'd like to do is this:
I'm happy to do most of the work on this, but it would be super helpful if you could set up the C++ infrastructure needed to do it. I guess we can set tskit as a git submodule here?
The goal here is to analyse the effect of different ways of laying out tree structures in memory when we have BIG trees on a real world application. Parsimony seems like a good example, since it requires pre and postorder traversal, and has a fairly typical dynamic programming structure.
I just read the usher header and I see what you mean. I'll put something together.
See also #25
Following on from the section describing the "on file" data entities in the model, I think we should have a section describing the "in memory" data structures.
This is the quintuply linked tree, which we should explain with an example.
Some things to discuss: