tskit-dev / tutorials

A set of tutorials for msprime and tskit.
Creative Commons Attribution 4.0 International
18 stars 15 forks source link

Add a graph plot into "What is a tree sequence" #264

Open hyanwong opened 10 months ago

hyanwong commented 10 months ago

To encourage people to think of tree sequences as graph objects, I think it would be helpful to add the graph representation to the "What is a Tree Sequence" tutorial, round about here. This is how you might do it:

import tskit_arg_visualizer as viz

arg = viz.D3ARG.from_ts(ts=ts)
arg.set_node_labels({k: (v if k in ts.samples() else "") for k, v in labels.items()})
arg.draw(
    variable_edge_width=True,
    y_axis_scale="time",
    sample_order=sorted({k: v for k, v in labels.items()if k in ts.samples()}, key=lambda x: labels[x]))

Currently this gives a plot like this:

Screenshot 2023-11-16 at 11 53 50

I think a few things would be helpful to make this look simpler. In particular, if we could change the node sizes & shapes such that the internal nodes are (very) small circles and the sample nodes are square, that would match the tree-by-tree plot above it (https://github.com/kitchensjn/tskit_arg_visualizer/issues/30). Allowing the y-axis ticks to be set to user-chosen values would also be helpful, I think.

Perhaps @kitchensjn has some ideas about how to make the plot friendly to a newcomer in this context?

Note that ts has been produced by code in the nodebook, like that below:

import msprime
import demes

def whatis_example():
    demes_yml = """\
        description:
          Asymmetric migration between two extant demes.
        time_units: generations
        defaults:
          epoch:
            start_size: 5000
        demes:
          - name: Ancestral_population
            epochs:
              - end_time: 1000
          - name: A
            ancestors: [Ancestral_population]
          - name: B
            ancestors: [Ancestral_population]
            epochs:
              - start_size: 2000
                end_time: 500
              - start_size: 400
                end_size: 10000
        migrations:
          - source: A
            dest: B
            rate: 1e-4
        """
    graph = demes.loads(demes_yml)
    demography = msprime.Demography.from_demes(graph)
    # Choose seed so num_trees=3, tips are in same order,
    # first 2 trees are topologically different, and all trees have the same root
    seed = 12581
    ts = msprime.sim_ancestry(
        samples={"A": 2, "B": 3},
        demography=demography,
        recombination_rate=1e-8,
        sequence_length=1000,
        random_seed=seed)
    # Mutate
    # Choose seed to give 12 muts, last one above node 14
    seed = 1476
    return msprime.sim_mutations(ts, rate=1e-7, random_seed=seed)
kitchensjn commented 10 months ago

Added the changes you mentioned to the tskt_arg_visualizer 0.0.2 milestone and should be pretty straightforward to implement!

I personally like the node labels when mapping between the trees and the ARG. Without the nodes, it might be a bit difficult for newcomers to grasp how (and why) the trees are woven together. Something like this paragraph

A major benefit of “tree sequence thinking” is the close relationship between the tree sequence and the underlying biological processes that produced the genetic sequences in the first place, such as those pictured in the demography above. For example, each branch point (or “internal node”) in one of our trees can be imagined as a genome which existed at a specific time in the past, and which is a “most recent common ancestor” (MRCA) of the descendant genomes at that position on the chromosome. We can mark these extra “ancestral genomes” on our tree diagrams, distinguishing them from the sampled genomes (a to j) by using circular symbols.

from lower on the page seems critical to understanding why the trees are correlated, including the fact that specific nodes/edges are found across multiple trees. The tree highlighting and variable edge width within the ARG helps to show this correlation but doesn't include the biological reasoning why. Maybe we move that paragraph up above this figure?

kitchensjn commented 10 months ago

With the latest commit to the visualizer, users can now control the size and symbol of the nodes. Here's your example from above with smaller nodes and square sample nodes.

import msprime
import demes
import tskit_arg_visualizer as viz

def whatis_example():
    demes_yml = """\
        description:
          Asymmetric migration between two extant demes.
        time_units: generations
        defaults:
          epoch:
            start_size: 5000
        demes:
          - name: Ancestral_population
            epochs:
              - end_time: 1000
          - name: A
            ancestors: [Ancestral_population]
          - name: B
            ancestors: [Ancestral_population]
            epochs:
              - start_size: 2000
                end_time: 500
              - start_size: 400
                end_size: 10000
        migrations:
          - source: A
            dest: B
            rate: 1e-4
        """
    graph = demes.loads(demes_yml)
    demography = msprime.Demography.from_demes(graph)
    # Choose seed so num_trees=3, tips are in same order,
    # first 2 trees are topologically different, and all trees have the same root
    seed = 12581
    ts = msprime.sim_ancestry(
        samples={"A": 2, "B": 3},
        demography=demography,
        recombination_rate=1e-8,
        sequence_length=1000,
        random_seed=seed)
    # Mutate
    # Choose seed to give 12 muts, last one above node 14
    seed = 1476
    return msprime.sim_mutations(ts, rate=1e-7, random_seed=seed)

ts = whatis_example()
arg = viz.D3ARG.from_ts(ts=ts)

labels = {}
for node in arg.nodes:
    if node["flag"]==1:
        labels[node["id"]] = node["label"]
    else:
        labels[node["id"]] = ""
arg.set_node_labels(labels=labels)

arg.draw(
    variable_edge_width=True,
    y_axis_scale="time",
    node_size=50,
    sample_node_symbol="d3.symbolSquare",
    sample_order=[0,2,3,4,8,9,5,6,7,1]
)
Example ARG with new symbols and node sizes