tskit-dev / tskit

Population-scale genomics
MIT License
153 stars 72 forks source link

Method to create brick table #1345

Open awohns opened 3 years ago

awohns commented 3 years ago

One helpful enhancement to tskit is to create a "brick table" from a tree sequence. A brick table is an augmented edge table where edges are bifurcated when they have different descendants across adjacent trees. These bifurcated edges can be used to trace tracts of genome which are inherited as units. This is useful in applications where one is interested in patterns of descent where position matters, for instance if looking at which bits of an ancestral haplotype are inherited by its descendants. The term "brick" refers to how the result can be visualised as stacked bricks.

For example, consider the following tree sequence: image

Note that the edge from node 6 to node 7 exists in both trees, but has different descendants. The brick version of this tree sequence would bifurcate this edge, resulting in two edges with 6 as the child and 7 as the parent, the first from 0-0.62 and the second from 0.62-1.

I've implemented this in a straightforward manner using ts.trees() and ts.edge_diffs() where we identify nodes with different parents across adjacent trees. We then climb the adjacent trees from the node with different parents, bifurcating edges along this path which straddle both trees.

I'm not sure what the best name for this method is... one idea is to simply call it split_edges. It would also make sense add an option map_edges, along the lines of map_nodes in simplify. If this parameter is specified, the method would returns a tuple containing the resulting tree sequence and a numpy array mapping edge IDs in the current tree sequence to their corresponding edge IDs in the returned tree sequence.

If this sounds interesting to the community, I can create a PR with my implementation.

petrelharp commented 3 years ago

Yes! And, I like split_edges. Question: is it different descendant nodes or different descendant samples? I think the latter?

jeromekelleher commented 3 years ago

Sounds good to me too - split_edges sounds right. I guess we should consider how we can parameterise this so that we can specify how much to split. You could imagine a version of split edges where you split every edge in each tree, so that the tree sequence becomes a JBOT (I don't know if this is useful, just thinking through the possibilities). I guess we could have a by="tree"|"node"|"sample" argument, defaulting to "sample"?

We don't need to actually implement these, I just want to make sure we're not painting ourselves into a corner API wise.

petrelharp commented 3 years ago

Thinking ahead, perhaps the next thing to implement is the analogue of genetic_relatedness but for edges, and so if we've first split up the edge table then we'll get it for bricks. (need to write down the details though)

hyanwong commented 3 years ago

I like the description of a "brick" table. It makes it much clearer what the split_edges function is doing. It would probably be even clearer if we could visualise the edges as horizontal bars (i.e. really like Tetris bricks). There's an argument that there should be some viz method that shows this somehow (perhaps ts.tables.draw_svg(), but then I'm not sure how we represent the mutations table).

We should mention that tables.edges.squash_edges provides what is essentially the opposite operation (but note that we might want to implement tables.squash_edges to produce a valid tree sequence if nodes are not indexed in time order (see https://github.com/tskit-dev/tskit/issues/808)

p.s. another way to think of a "brick table" is that if you pick any edge, all the edges below it can be dropped downwards out of the brick wall, creating a pyramid-shaped hole. There's an inverted version where all the edges above any focal edge can be pulled upwards and out off the brick wall creating a hole shaped like an inverted pyramid. And the case where all the edges above can be pulled up and below can be dropped down is (I think) the JBOT.

awohns commented 3 years ago

@hyanwong, I have a manually created visualisation of the "tetris blocks" corresponding to the example above. I've coloured each brick (edge) in the example and created bricks corresponding to them below:

Screenshot 2021-05-19 at 11 50 32

I guess this is the kind of thing you were thinking of?

I think that tables.edges.squash would undo a split_edges operation, but without considering descendants of an edge. split_edges seems like it should be a method of a tree sequence rather than the edges table.

I like your pyramid-based mental viz of the brick table!

hyanwong commented 3 years ago

Wow. Cool. I didn't think of it in terms of a 3d wall. I was thinking more like Screenshot 2021-05-19 at 18 28 58

where each line is a long thin "brick". But that doesn't capture the dependencies correctly, so you are right - it requires more dimensions.