tskit-dev / tskit

Population-scale genomics
MIT License
151 stars 70 forks source link

Ways to visualize diploid tree sequences that aren't sequences of trees #621

Closed petrelharp closed 2 years ago

petrelharp commented 4 years ago

In particular, we need pictures of the "first-generation" individuals that trip people up.

Edit: see below; this issue has evolved from the initial intention.

bhaller commented 4 years ago

Excerpted from an email I sent to Peter just now:

Well, yes, the notes do already state that the first generation is retained but is not part of the sample, and that that is necessary for recapitation. But I guess I had not really thought about / realized that first-generation individuals that are not ancestral to the sample would (a) also be present (of course), and (b) would still be there even after recapitation, but without any simulated histories (again, of course, but I never thought about it), and so (c) would need to be excluded or otherwise dealt with during subsequent analysis to avoid getting confused. That’s not an issue that I remember ever coming up, but I can see it being surprising/confusing for people. It’s one of those things where, if the tree sequences were visualized in some way, it would be immediately obvious, but when just working with them as a black-box object in Python it is easy not to realize or to think through the implications.

I know tskit currently provides some visualization tools that show a given tree graphically. I think I’m thinking of something a little different. I’d maybe like to see a visualization tool that is more just about the times and relationships of individuals, nodes, and samples, without worrying about the particular tree structures that connect them through time. This representation would let me see things like:

I’m imagining a depiction that is a sort of summary of the entire tree sequence, rather than a picture of a specific tree. The y-axis would be time, with the more distant past at the top I suppose (that’s what would be intuitive to me, anyway). The x-axis is meaningless and used only for visual separation of elements; ideally x positions would be chosen to minimize the number of line crossings in the drawn graph, but that’s a frill. Each node would be shown as a small dot, and each individual would be a (gray) circle that encloses two dots (for diploids); nodes that do not have a recorded individual associated with them would be shown as standalone dots not enclosed by a “individual circle”. Lines would connect a later-time dot to an earlier-time dot if any portion of the genome was inherited from that earlier-time dot (without passing through an intervening dot), so it would be more of a directed graph than a tree, and would represent an overlay of all of the tree sequences onto a single diagram. Samples would be shown as red dots, while non-samples ancestral to the sample would be black dots, and nodes not ancestral to the sample would be white dots. I’ve attached a sort of mock-up of how I imagine a model like Jean’s (a SLiM haploid clonal model – i.e., diploids where every second genome is empty and has no ancestral connections – with occasional horizontal gene transfer) would look before and after recapitation; I don’t know that it makes any actual sense as a diagram of a real tree sequence, but it illustrates what I mean, I hope. Possibly the lines of the graph could be colored to show something useful, too – perhaps the weighting of that line across all of the tree sequences (how common of a path of inheritance it is for the sample), or perhaps how many mutations occurred along all of the tree-sequences branches represented by it, or something of that sort (perhaps things like this could be options chosen by the user).

With a visualization tool like this, I could write a python script that would read in a .trees file from SLiM, recapitate, change the sample in some way, simplify, overlay mutations, etc., and I could call the visualization function at each step of the process to confirm that things look the way I intend them to look. I think this would help me immensely, since I am a very visual thinker, and I bet many users would love it.

(Addendum from a later email:) Expanding upon this idea a bit more: first-generation individuals/nodes should also use a distinctive color, and remembered individuals should be designated in some orthogonal fashion so they are also visible (bigger dots?). :->

I have attached illustrations here:

PastedGraphic-1 PastedGraphic-2
bhaller commented 4 years ago

@jeromekelleher and @molpopgen, comments would be welcome. I'm quite excited about this idea. @petrelharp wrote back with a caveat:

As for writing a tool that does this automatically - you'd like to draw the pedigree, sort of. Whenever I've thought about this before I've decided that it would be way too messy for anything except for extremely small examples, so it wasn't worth writing the (probably quite tricky) code. But, it might be worth a try - visualizing the tree sequence somehow would be quite a useful thing.

I agree that for typical "real" models with thousands or millions of individuals these diagrams would be very large and might not be terribly useful (although I'm not sure they would never be useful; I'd like to see what the diagram for, say, a 10-population stepping-stone model with a soft selective sweep would look like :->). But I think people could still find them very useful for developing their Python tree-seq workflow, by simulating with a downscaled model so the diagrams aren't too big, writing out their Python steps and visualizing them to make sure everything in that pipeline is working as intended, and then upscaling their model to production size (and commenting out the visualizations).

I also think it would be very useful for generating pictures to use in the documentation, for explaining concepts like the retention of first-generation ancestors, simplification, and recapitation.

jeromekelleher commented 4 years ago

Well, there is ts.draw_text() and ts.draw_svg():

import msprime

ts = msprime.simulate(5, recombination_rate=0.3, random_seed=2)
print(ts.draw_text())
ts.draw_svg("tmp.svg")

This is the text (might get mangled by some browsers):

5.77┊    10     ┊           ┊           ┊
    ┊  ┏━━┻━┓   ┊           ┊           ┊
4.41┊  ┃    ┃   ┊           ┊   9       ┊
    ┊  ┃    ┃   ┊           ┊ ┏━┻━━┓    ┊
2.18┊  ┃    ┃   ┊     8     ┊ ┃    8    ┊
    ┊  ┃    ┃   ┊  ┏━━┻━┓   ┊ ┃  ┏━┻━┓  ┊
0.89┊  ┃    7   ┊  ┃    7   ┊ ┃  ┃   7  ┊
    ┊  ┃  ┏━┻┓  ┊  ┃  ┏━┻┓  ┊ ┃  ┃  ┏┻┓ ┊
0.81┊  ┃  ┃  6  ┊  ┃  ┃  6  ┊ ┃  ┃  ┃ ┃ ┊
    ┊  ┃  ┃ ┏┻┓ ┊  ┃  ┃ ┏┻┓ ┊ ┃  ┃  ┃ ┃ ┊
0.04┊  5  ┃ ┃ ┃ ┊  5  ┃ ┃ ┃ ┊ ┃  5  ┃ ┃ ┊
    ┊ ┏┻┓ ┃ ┃ ┃ ┊ ┏┻┓ ┃ ┃ ┃ ┊ ┃ ┏┻┓ ┃ ┃ ┊
0.00┊ 0 4 1 2 3 ┊ 0 4 1 2 3 ┊ 3 0 4 1 2 ┊
  0.00        0.10        0.74        1.00

and SVG:

tmp

I guess this isn't what you're looking for though?

bhaller commented 4 years ago

Right; what I'm looking for is a diagram that combines/summarizes all of those tree diagrams into a single diagram. As Peter wrote, more like a pedigree, except that (a) only the nodes present in the tree sequence are shown, so it is not a full pedigree in the genealogical sense, and (b) lines are shown between every pair of nodes that are connected by a branch in any tree in the tree sequence, not just between parents/offspring. And the other big thing I'm looking for is color-coding and other visualization of things like which nodes are samples, which nodes are first-generation ancestors, which nodes are remembered individuals, which pairs of nodes are grouped together to form diploid individuals in the individuals table, etc., which is not visible in those existing visualizations. So this is a much higher-level view of what is going on than is provided by ts.draw_text() and ts_draw_svg(). See the diagrams I posted above, which hopefully illustrate what I'm talking about.

molpopgen commented 4 years ago

I don't have a concrete technical suggestion, but this seems like something that a Python package capable of drawing a network might handle?

bhaller commented 4 years ago

I don't have a concrete technical suggestion, but this seems like something that a Python package capable of drawing a network might handle?

Yes; if there were a package that handled choosing x-positions so as to minimize line-crossing that would be particularly useful. The rest of it is probably pretty easy to do oneself, really.

molpopgen commented 4 years ago

Yes; if there were a package that handled choosing x-positions so as to minimize line-crossing that would be particularly useful. The rest of it is probably pretty easy to do oneself, really.

Python (and R) have a seemingly limitless supply of options. I've been (casually) interested in this same issue, so I'll be looking for something soon.

molpopgen commented 4 years ago

This gets pretty close, but isn't perfect depending on how picky one is, but it took some effort to figure out.

It needs a lot of the graphviz stack on your system, including the C runtime library, as pygraphviz is a dependency in here somewhere.

import msprime
import pandas as pd
import networkx
from collections import namedtuple

Datumz = namedtuple('Datumz', ['parent', 'child'])
ts = msprime.simulate(5, recombination_rate=6, random_seed=666)

D = []
for t in ts.trees():
    for n in t.nodes():
        for c in t.get_children(n):
            D.append(Datumz(n, c))

df = pd.DataFrame(D, columns=Datumz._fields)
df['time'] = ts.tables.nodes.time[df.parent]

G = networkx.from_pandas_edgelist(
    df, 'parent', 'child', create_using=networkx.DiGraph())

A = networkx.nx_agraph.to_agraph(G)

times = df.groupby(['time'])
all_samples = [i for i in ts.samples()]
for n, g in times:
    up = [i for i in g.parent.unique() if i not in all_samples]
    A.add_subgraph(up, rank='same')

A.add_subgraph(all_samples, rank='same')
A.draw('A.png', prog='dot')

A

molpopgen commented 4 years ago

To extend this to remembered individuals, etc., you need to manipulate the horizontal groupings, too, add color, etc.. dot can do all of that.

molpopgen commented 4 years ago

This will change the border color and shape of nodes 0 and 1. The UI semantics are the same as the low-level program dot. In one of the loops above, I am excluding sample nodes from the sub-graph. For this simulation, that has no effect, as there is only 1 parent possible in a tree at exactly one time. For remembered nodes, you'd need to process both parents and children, ask if they are samples, and then do something fun with their color/shape/fill.

for i in [0, 1]:
    n = A.get_node(i)
    n.attr['color'] = 'red'
    n.attr['shape'] = 'box'
molpopgen commented 4 years ago

Seems likely that everything can be done entirely in pygraphviz: https://pygraphviz.github.io/documentation/latest/tutorial.html

I used networkx to start simply b/c I'd heard of it recently in an msprime PR.

molpopgen commented 4 years ago

This version respects the temporal order of the root nodes on each tree.

import msprime
import pandas as pd
import networkx
from collections import namedtuple

Datumz = namedtuple('Datumz', ['parent', 'child'])
ts = msprime.simulate(5, recombination_rate=6, random_seed=666)

D = []
for t in ts.trees():
    for n in t.nodes():
        for c in t.get_children(n):
            D.append(Datumz(n, c))

df = pd.DataFrame(D, columns=Datumz._fields)
df['time'] = ts.tables.nodes.time[df.parent]
df.sort_values(['time'], inplace=True, ascending=False)
G = networkx.from_pandas_edgelist(
    df, 'parent', 'child', create_using=networkx.DiGraph())

A = networkx.nx_agraph.to_agraph(G)

# Make a subgraph of roots, where our A->B edges
# are A is older than B.
roots = []
for t in ts.trees():
    for r in t.roots:
        if r not in roots:
            roots.append(r)

roots = sorted(roots, key=lambda x: -ts.tables.nodes.time[x])

pg = A.add_subgraph([], level="same", name="roots")
for i in range(1, len(roots)):
    pg.add_edge(roots[i-1], roots[i], style="invis")

A.add_subgraph([i for i in ts.samples()], rank='same', name="samples")

A.draw('A.png', prog='dot')

A

petrelharp commented 4 years ago

wow, love it!

molpopgen commented 4 years ago

With the root order preserved, at this point it should be relatively straightforward to create subgraphs representing actual samples taken at different time points. The hard thing, maybe, is to have nested subgraphs of pairs of nodes representing diploids.

petrelharp commented 4 years ago

Ping everyone here - please have a look at my pull request with figures. Let me know if you can't build the docs locally (if you have sphinx set up, you just need to run make in the docs/ directory, and browse to docs/_build to look at the result. Are the figures useful? Are they in the right place? Do we need more/less? I opted to do them by hand (inkscape) rather than figure out how to do these things in python (especially the labeling!).

petrelharp commented 4 years ago

I've got lots of feedback from @bhaller - thanks! This is addressed by tskit-dev/pyslim#70; closing this now.

bhaller commented 4 years ago

This should not be closed, I don't think! For me, this thread has been about making some kind of visualization tool that can display the user's tree sequence in a particular manner. The utility of that for the doc was just one angle; as I wrote originally, "I’d maybe like to see a visualization tool that is more just about the times and relationships of individuals, nodes, and samples, without worrying about the particular tree structures that connect them through time." The docs have had pretty pictures added now (and other improvements), which is great, but the need for this visualization tool remains.

bhaller commented 4 years ago

GitHub doesn't seem to offer me the option of reopening this issue. I could just open a new issue and copy/paste my original comment over, but there has been lots of further good discussion and work from @molpopgen and others, so it would be better if we could leave this issue open (and perhaps change its title)...?

petrelharp commented 4 years ago

Sorry! Re-opened! The issue should be renamed, though, if it's still open, and we should get some specific proposals for what to implement...

bhaller commented 4 years ago

@hyanwong I gather you've been doing work on tree sequence visualization lately, so maybe this issue will interest you! :->

petrelharp commented 4 years ago

@bhaller suggests moving this to tskit; I'm going to change the title somewhat and move it.

hyanwong commented 4 years ago

@bhaller Yes, it does. Recently I've been entirely working on trees (and animating SVGs) but I completely agree with you about the need for a pedigree-style viz. I tried this a few years ago with some of the dynamic graph viz tools, in an attempt to get something exactly like the examples you posted above, but never found the right graphic toolkit, which allow you to specify a Y position for each node but then works out the best X positioning so as to avoid line crossing. It's not clear to me that @molpopgen 's graphviz example allows continuous Y positions - merely a pre-defined stacking order. This was the barrier I hit too.

It seemed strange to me that no-one had come up with a decent heuristic algorithm for the case of a DAG with nodes that have user-specified-floating-point-y, algorithmically-chosen-x. I think we should probably ask some network-viz people? I don't know any off-hand, though. The problem seems well characterised to me.

bhaller commented 4 years ago

It seemed strange to me that no-one had come up with a decent heuristic algorithm for the case of a DAG with nodes that have user-specified-floating-point-y, algorithmically-chosen-x. I think we should probably ask some network-viz people? I don't know any off-hand, though. The problem seems well characterised to me.

I don't know any such people either. I'd be tempted to find a semi-optimal solution through some process like simulated annealing, but it might be a solved problem, which would obviously be better!

molpopgen commented 4 years ago

It's not clear to me that @molpopgen 's graphviz example allows continuous Y positions - merely a pre-defined stacking order. This was the barrier I hit too.

Right, there are only "levels". As with all things graphviz, I find the first two steps tricky, the third very hard, and the next impossible.

hyanwong commented 2 years ago

Also see https://github.com/tskit-dev/msprime/issues/1851

hyanwong commented 2 years ago

By putting a few other pieces I've been working on recently, I think we can use the graphviz "dot" algorithm, called within networkx, to get quite a nice ARG visualisation, and it looks much nice with recombination nodes. Here I plot recombination nodes in red, and use the functions convert_to_single_rec_node and to_networkx_graph (from here and here )

import msprime
import networkx as nx
import numpy as np
import tskit

ts = msprime.simulate(10, record_full_arg=True, random_seed=1, recombination_rate=1.6)
ts_arg = convert_to_single_rec_node(ts)
# I think we are assuming that the nodes are listed in time order in this script
G = to_networkx_graph(ts_arg)

import collections
nodes_at_time = collections.defaultdict(list)
colour_map = []
for nd in G.nodes(data=True):
    nodes_at_time[nd[1]["time"]].append(nd[0])
    colour_map.append("red" if nd[1]["flags"] & msprime.NODE_IS_RECOMB else "blue")

# Turn into graphviz
A = nx.nx_agraph.to_agraph(G)
# First cluster all nodes at the same times (probably mostly samples)
for t, nodes in nodes_at_time.items():
    if len(nodes) > 1:
        A.add_subgraph(nodes, level="same", name=f"cluster_t{t}")
# We could also cluster nodes from a single individual together here
A.layout(prog="dot")
pos = {n: [float(x) for x in A.get_node(n).attr["pos"].split(",")] for n in G.nodes()}
nx.draw_networkx(G, pos, node_color=colour_map, with_labels=True)

image

That plot uses the rank time for the Y axis. We can then use the graphviz -> networkx code in https://github.com/tskit-dev/msprime/issues/1851#issuecomment-991688749 to change the Y positions:

new_pos = {}
for node_id, attrs in G.nodes(data=True):
    # positions sorted by dot are obtained via .get_node(u).attr["pos"]
    x_pos = int(A.get_node(node_id).attr["pos"].split(",")[0])
    new_pos[node_id] = [x_pos, attrs["time"]]

nx.draw_networkx(G, new_pos, node_color=colour_map, with_labels=True)

image

It's annoying that pushing the nodes to the correct times makes them overlap. My suggestion here is to rank the largest gaps between the times in the node table, and split the nodes into time bins, then use the .add_subgraph method to group the nodes into time bands. I'll post the results below (didn't work - you don't get good results if there are links within a group)

hyanwong commented 2 years ago

For posterity (i.e. so I don't lose it), here's the code and an example from some of the stuff I've been posting on slack:

import itertools

from IPython.display import display, SVG
import msprime
import networkx as nx
import numpy as np
import pandas as pd
import tskit

msprime.NODE_IS_RECOMB = 2  # Use the "reserved" space in the bitfield, as we will eventually port this to tskit
msprime.NODE_IS_SOMETIMES_UNARY = int(2**24)

def convert_to_single_rec_node(ts):
    """
    Should be unnecessary when https://github.com/tskit-dev/msprime/issues/1942 is fixed
    """
    tables = ts.dump_tables()
    tables.nodes.clear()
    node_mapping = np.zeros(ts.num_nodes, dtype=tables.edges.child.dtype)
    node_id = 0
    while node_id < ts.num_nodes:
        node_mapping[node_id] = tables.nodes.num_rows
        node = ts.node(node_id)
        if (node.flags & msprime.NODE_IS_RE_EVENT):
            node_id += 1
            assert node_id < ts.num_nodes
            assert ts.node(node_id).flags & msprime.NODE_IS_RE_EVENT
            assert ts.node(node_id).time == node.time
            node_mapping[node_id] = tables.nodes.num_rows
            node.flags &= ~msprime.NODE_IS_RE_EVENT  # Unset the NODE_IS_RE_EVENT bit
            node.flags |= msprime.NODE_IS_RECOMB
        tables.nodes.append(node)
        node_id += 1
    tables.edges.parent = node_mapping[tables.edges.parent]
    tables.edges.child = node_mapping[tables.edges.child]
    tables.mutations.node = node_mapping[tables.mutations.node]
    tables.sort()
    return tables.tree_sequence()

def flag_unary_nodes(ts):
    """
    Return a tree sequence in which any nodes which are sometimes unary nodes in the TS (i.e.
    there is a tree in which they node has only one child) is flagged with msprime.NODE_IS_SOMETIMES_UNARY
    """
    # This is very inefficient - find a proper incremental appraoch
    n = []
    tables = ts.dump_tables()
    for tree in ts.trees():
        for u in tree.nodes():
            if tree.num_children(u) == 1:
                n.append(u)
    flags = tables.nodes.flags
    flags &= np.full_like(flags, ~msprime.NODE_IS_SOMETIMES_UNARY)  # unset this bit
    flags[n] |= msprime.NODE_IS_SOMETIMES_UNARY
    tables.nodes.flags = flags
    return tables.tree_sequence()

def to_networkx_graph(ts):
    edges=ts.tables.edges
    G = nx.from_pandas_edgelist(
        pd.DataFrame({'source': edges.parent, 'target': edges.child, 'left': edges.left, 'right': edges.right}),
        edge_attr=True,
        create_using=nx.MultiDiGraph
    )
    nx.set_node_attributes(G, {n.id: {'flags':n.flags, 'time': n.time, 'labels': "foo"} for n in ts.nodes()})
    return G

def add_individuals_to_coalescence_nodes(ts):
    tables = ts.dump_tables()
    nodes_individual = tables.nodes.individual
    for node in ts.nodes():
        if (node.flags & msprime.NODE_IS_RECOMB) or (node.flags & msprime.NODE_IS_CA_EVENT):
            continue
        if node.individual == tskit.NULL:
            nodes_individual[node.id] = tables.individuals.add_row()
    tables.nodes.individual = nodes_individual
    return tables.tree_sequence()

def draw_with_curved_multi_edges(G, pos, colour_map, curve_scale=1):
    """
    networkx matlib plots show all edges as straight lines by default.
    This function curves the edges if there is more than one edge with identical parent/child values
    At the moment curve_scale is a hack to adjust the fact that we don't know the measurement scale on
    the x or the y axis.
    """
    nx.draw_networkx_nodes(G, pos, node_color=colour_map)
    nx.draw_networkx_labels(G, pos)

    for (start, end), edge_iter in itertools.groupby(G.edges()):
        # identical node/neighbour pairs placed together according to docs
        edges = list(edge_iter)
        dist = ((pos[start][0] - pos[end][0]) ** 2 + (pos[start][1] - pos[end][1]) ** 2) ** 0.5
        # FIXME - dist won't be accurate if X and Y axes are on different scales
        curvature = curve_scale/dist
        # FIXME - calculate curve_scale from the plot dimensions, rather than requiring it
        max_edge_index = len(edges) - 1
        for i, e in enumerate(edges):
            curve_prop = (2 * i / max_edge_index - 1) if max_edge_index > 0 else 0
            nx.draw_networkx_edges(
                G,
                pos,
                edgelist=[e],
                connectionstyle=f'arc3, rad = {curve_prop * curvature}')

# SIMULATE
ts = msprime.sim_ancestry(8, sequence_length=1e4, population_size=1e4, record_full_arg=True, random_seed=12, recombination_rate=1e-8)
ts_arg = convert_to_single_rec_node(ts)
ts_arg = add_individuals_to_coalescence_nodes(ts_arg) # optional - can help if simplifying away rRE and CA nodes
ts_arg = flag_unary_nodes(ts_arg)  # So we can plot nodes which are partially coalescent in blue-green

# PLOT WITH RANK TIMES
import collections
import itertools
import matplotlib.pyplot as plt

G = to_networkx_graph(ts_arg)

nodes_at_time = collections.defaultdict(list)
colour_map = []
for nd in G.nodes(data=True):
    nodes_at_time[nd[1]["time"]].append(nd[0])
    colour = "lightgreen"
    if nd[1]["flags"] & msprime.NODE_IS_RECOMB:
        colour = "red"
    elif nd[1]["flags"] & msprime.NODE_IS_CA_EVENT:
        colour = "cyan"
    elif nd[1]["flags"] & msprime.NODE_IS_SOMETIMES_UNARY:
        colour = "lightseagreen"
    colour_map.append(colour)

# Turn into graphviz
A = nx.nx_agraph.to_agraph(G)
# First cluster all nodes at the same times (probably mostly samples)
for t, nodes in nodes_at_time.items():
    if len(nodes) > 1:
        A.add_subgraph(nodes, level="same", name=f"cluster_t{t}")
# We could also cluster nodes from a single individual together here

# Get the positions from graphviz
A.layout(prog="dot")
pos = {n: [float(x) for x in A.get_node(n).attr["pos"].split(",")] for n in G.nodes()}

fig = plt.figure(1, figsize=(10, 15))
draw_with_curved_multi_edges(G, pos, colour_map, 20)
plt.show()

image

# PLOT WITH CORRECT TIMES
import matplotlib.pyplot as plt
fig = plt.figure(1, figsize=(10, 15))

pos_with_proper_times = {}
for node_id, attrs in G.nodes(data=True):
    # positions sorted by dot are obtained via .get_node(u).attr["pos"]
    x_pos = int(float(A.get_node(node_id).attr["pos"].split(",")[0]))
    pos_with_proper_times[node_id] = [x_pos, np.log(attrs["time"]+100000)]
draw_with_curved_multi_edges(G, pos_with_proper_times, colour_map, 0.01)
plt.show()

image