tskit-dev / tskit-paper

2 stars 4 forks source link

In-memory data structures and trees API #2

Open jeromekelleher opened 3 years ago

jeromekelleher commented 3 years ago

Following on from the section describing the "on file" data entities in the model, I think we should have a section describing the "in memory" data structures.

This is the quintuply linked tree, which we should explain with an example.

Some things to discuss:

jeromekelleher commented 3 years ago

@molpopgen, do you think my statement about the struct-of-arrays quintuply linked tree model being more efficient than standard linked tree structures is true, at least for naive implementations? I'm imaging comparing against a few other implementations:

My intuition would be that for large trees we should see performance for (say) the parsimony algorithm implemented with these different ways increase - I may be entirely wrong though! It's probably a good idea to keep parsimony as a running real-world example.

Any thoughts?

molpopgen commented 3 years ago

So, if your hypothetical linked-list of trees has the form, Container[Tuple(TreeThing, PrevTreeThing, NextTreeThing)], then my intuitions align with yours. Even if both containers (Container and TreeThing) are flattened linked lists, meaning contiguous arrays as opposed to "pointers all over the machine", I'm guessing that it is still true. At the very least, by coupling the edge table to the N-tuply-linked-lists removes a whole ton of redundancy that a tree-first implementation needs to store.

The main way to improve the current tskit setup, I think, may be to collect the "right" parts of those lists into structs, and have some smaller number of arrays of structs. Any time you access more than one feature of node u's role in the tree (say, its parent and its left sibling), you've got cache misses. But knowing exactly what to change here would mean loads of time with perf to record misses on big trees/tree sequences on algorithms that we think most people care about most of the time.

Almost as if by magic, an array-of-structs implementation of this stuff is out there because I was curious myself...

jeromekelleher commented 3 years ago

I guess my goal here is to convince people that the way we're laying things out in memory is a good idea, and I want to challenge latent assumptions that linked structures in memory are the only/best way to do trees. I'm sure there's lots of people out there who automatically assume a simple C++ linked object representation of a tree is the best possible way of doing things (by virtue of being written in C++!).

So, I'm not really bothered about improving tskit, or about having anything that could realistically work well in the context of having the tree changing as we go along the genome. This is just a single tree - how well does tskit stack up against textbook implementations?

molpopgen commented 3 years ago

So, I'm not really bothered about improving tskit, or about having anything that could realistically work well in the context of having the tree changing as we go along the genome. This is just a single tree - how well does tskit stack up against textbook implementations?

Yeah, I'm on board here--my point was more that I think the current way must be quite close to optimal, and that improving is tough.

molpopgen commented 3 years ago

I guess my goal here is to convince people that the way we're laying things out in memory is a good idea, and I want to challenge latent assumptions that linked structures in memory are the only/best way to do trees. I'm sure there's lots of people out there who automatically assume a simple C++ linked object representation of a tree is the best possible way of doing things (by virtue of being written in C++!).

Yep, agreed. Hopefully no one thinks that pointer-based linked lists would be the way to go anymore, but who knows.

Edit: deleted my idea, as it is "changing along the genome"...

jeromekelleher commented 3 years ago

Something like this is what I had in mind @molpopgen:

from __future__ import annotations
import dataclasses
from typing import List

import msprime

@dataclasses.dataclass
class Node:
    time: float
    flags: int
    parent: Node | None = None
    children: List[Node] = dataclasses.field(default_factory=list)

ts = msprime.sim_ancestry(5, random_seed=234)
tree = ts.first()

object_map = {}
for u in tree.nodes(order="postorder"):
    tsk_node = ts.node(u)
    node_obj = Node(tsk_node.time, tsk_node.flags)
    object_map[u] = node_obj
    for v in tree.children(u):
        child_obj = object_map[v]
        child_obj.parent = node_obj
        node_obj.children.append(child_obj)
root = node_obj

print(root)

I'd imagine the starting point would be an idiomatic modern C++ version of this, and then we'd sequentially make it less idiomatic to gain performance (presumably).

molpopgen commented 3 years ago

So, if I'm following you here, we need to:

  1. Define a tree data structure
  2. The preorder iterator method should follow naturally from the tree.
  3. mimic the above operations?

I don't have a good sense of how current code bases are doing 1 in practice. I certainly hope no one is using the built-in linked list at this point in time--its performance impacts should be well-known by now.

jeromekelleher commented 3 years ago

I don't think the textbook pointer-based tree structure is a straw-man @molpopgen - have a look at the Usher code for example. In the first instance, what I'd like to do is this:

  1. Define a C++ equivalent of the above Python chunk to create a pointer-based tree from an input tskit trees file, using the tskit C API.
  2. Use this structure to implement Hartigan parsimony, like here. We should use recursion in the first instance for the postorder and preorder traversals needed, but if this analysis turns out to be interesting we could do less naive things as well.
  3. Plot the timing of this for increasing tree size. To make things interesting I guess we should use "sensible" input genotypes which require a fairly small number of mutations to explain. The easiest way to do this would be to take the genotypes we want to match as input in the specified tskit trees sequence, and let the tskit API generate them. Random genotypes wouldn't be a good input I think, as it would end up putting emphasis on the code for creating and storing the mutations, which wouldn't be typical (the nominal case being, we should expect a small number of mutations for each input set of genotypes).

I'm happy to do most of the work on this, but it would be super helpful if you could set up the C++ infrastructure needed to do it. I guess we can set tskit as a git submodule here?

The goal here is to analyse the effect of different ways of laying out tree structures in memory when we have BIG trees on a real world application. Parsimony seems like a good example, since it requires pre and postorder traversal, and has a fairly typical dynamic programming structure.

molpopgen commented 3 years ago

I just read the usher header and I see what you mean. I'll put something together.

jeromekelleher commented 2 years ago

See also #25