tskit-dev / tutorials

A set of tutorials for msprime and tskit.
Creative Commons Attribution 4.0 International
18 stars 15 forks source link

Convert forward simulation notebooks. #14

Open jeromekelleher opened 3 years ago

jeromekelleher commented 3 years ago

There's two notebooks (wfforward.ipynb and wfcython.ipynb) in the old-content/notebooks directory that need to be converted to jupyterbook format. Hopefully this can be reasonably automatic using jupytext.

The only concern is the environment requirements - what is need to run these notebooks? It would be great if we keep this a no-conda affair.

@molpopgen, would you mind taking a look please?

jeromekelleher commented 3 years ago

Also, we want to make sure these pages don't take to long to run.

molpopgen commented 3 years ago

I doubt they need conda--apt should be fine. The one with cython took quite a while to run, in order to demonstrate correctness. I have previously suggested that we delete it.

jeromekelleher commented 3 years ago

Your call on deleting stuff. Demonstrating correctness is beyond the scope of these tutorials now I think.

molpopgen commented 3 years ago

Looking at it now. It was written back when the access to tskit was via msprime. It'll take some time to convert. Also, there's an annoyance--if you convert with jupytext, you can get YAML frontmatter that somehow is incompatible with what is in other md files, causing the whole thing to fail to build.

jeromekelleher commented 3 years ago

I've not bothered modernising the content in the first pass @molpopgen, if that's any help. I've basically just been getting things working, and going to make notes on the outstanding issues so we can follow up in smaller chunks afterwards.

molpopgen commented 3 years ago

yeah, I'm going for "working" right now, which it doesn't!

molpopgen commented 3 years ago

I'm going to shorten it considerably. Looking at it now, I really don't like the length.

molpopgen commented 3 years ago

This will take a bit more time than I thought. Teaching, etc.,..

hyanwong commented 2 years ago

Hi @molpopgen - I'm just looking at our missing tutorials, and see there's a space for forward simulations. Did you get anywhere with this?

molpopgen commented 2 years ago

No, totally off of my radar. I still vote to delete the Cython example and then minimally port over the other to markdown.

hyanwong commented 2 years ago

No, totally off of my radar. I still vote to delete the Cython example and then minimally port over the other to markdown.

That sounds ideal to me. You mentioned something about shortening though? Is that still relevant?

molpopgen commented 2 years ago

That sounds ideal to me. You mentioned something about shortening though? Is that still relevant?

Yes, IMO a tutorial should strive to be only a "screen or 2 long". To that end, something like cell 3 modified to do exactly 1 crossover / meiosis in the "pseudo-haploid" style of the C example will be sufficient. Updating to include periodic simplification would be a good thing, too, I think.

hyanwong commented 1 year ago

IMO a tutorial should strive to be only a "screen or 2 long"

@molpopgen - I am looking at this with a view to showing a simple forwards simulator for a workshop. I rethought the python code from scratch, and came up with this. I would be interested in how clear or otherwise you think it is (I am guessing it isn't very efficient though). The nice thing about picking parents, rather than genomes, at random is not just that it is more obvious to a biologist, but that we can relatively easily modify the add_inheritance_paths function below to allow recombination between the two genomes within a parent (EDITED to make nicer code):

import tskit
import random

# First building block: make an individual containing 2 genomes. Note that
# we flag up the nodes as "sample nodes", which helps when plotting.
def make_individual(tables, time):
    """
    Creates a new individual containing 2 new genomes (nodes), and returns the ID
    of the individual and a tuple of the IDs of the individual's two genomes 
    """
    indivID = tables.individuals.add_row()
    mat_nodeID = tables.nodes.add_row(tskit.NODE_IS_SAMPLE, time, individual=indivID)
    pat_nodeID = tables.nodes.add_row(tskit.NODE_IS_SAMPLE, time, individual=indivID)
    return indivID, (mat_nodeID, pat_nodeID)

# Second building block: add "inheritance paths" between the
# genome the child inherits from one parent (say the mother) and
# the two genomes contained in that parent. This is where "meiosis"
# comes in. However the moment, we ignore recombination and simply pick
# one of the two genomes at random

def add_inheritance_paths(
    tables, parent_genomes, child_genome, recombination_rate=None
):
    """
    Given two parent genome (node) IDs from the same individual, add inheritance
    paths from those genomes to a child genome (node) ID. 

    """
    left = 0
    right = tables.sequence_length
    inherit_from = random.randint(0, 1)  # randomly chose 0 or 1
    tables.edges.add_row(left, right, parent_genomes[inherit_from], child_genome)

# Last building block: take a collection of parents 
# and generate children from them by picking two parents at random
# Here we use `random.choices` to sample with replacement, but we could
# use `random.sample` to exclude the possibility of selfing
def make_new_generation(tables, time, num_children, parents):
    """
    Returns a Python dictionary of children (i.e. length `num_children`) mapping
    {individual_ID: (maternal_genome_ID, paternal_genome_ID), ...}
    The `parents` parameter is a dictionary of the same form, listing possible parents
    for the children. If the dictionary is empty, no inheritance paths are set.
    """
    children = {}
    parent_IDs = list(parents.keys())
    for _ in range(num_children):
        child_ID, child_genomes = make_individual(tables, time)
        children[child_ID] = child_genomes
        if parents:
            # pick IDs for two parents at random with replacement (allows selfing)
            random_parent_IDs = random.choices(parent_IDs, k=2)
            for child_genome, parent_ID in zip(child_genomes, random_parent_IDs):
                add_inheritance_paths(tables, parents[parent_ID], child_genome)
    return children

def setup_simulation(sequence_length, random_seed):
    "Returns a TableCollection where we can store our growing genealogy"
    random.seed(random_seed)
    tables = tskit.TableCollection(sequence_length)
    tables.time_units = "generations"
    return tables

def simple_diploid_sim(popsize, generations, sequence_length, random_seed=123):
    tables = setup_simulation(sequence_length, random_seed)

    # tskit time is in units ago: we want the final generation to be at time 0
    generations_ago = generations - 1  # start with the oldest generation ago

    # Initialise an empty population of parents
    parents = make_new_generation(tables, generations_ago, popsize, parents={})

    # Run the simulation
    while generations_ago > 0:
        generations_ago = generations_ago - 1
        parents = make_new_generation(tables, generations_ago, popsize, parents)

    # Sort the tables, e.g. to put edges in the required order
    tables.sort()
    return tables.tree_sequence()

ts = simple_diploid_sim(5, generations=15, sequence_length=1000)

Here's how you might swap in a function that allows recombination:

def add_inheritance_paths(tables, parent_nodes, child_node, recombination_rate=0):
    """
    Add inheritance paths from parent node IDs (i.e. parent genomes)
    to a child node ID (i.e. a child genome)
    """
    breakpoints = ...  # e.g. pick breakpoints with probability rho per integer position
    inherit_from = random.randint(0, 1)
    for lft, rgt in zip([0] + breakpoints, breakpoints + [tables.sequence_length]):
        tables.edges.add_row(
            left=lft, right=rgt, parent=parent_nodes[inherit_from], child=child_node)
        inherit_from = 1 - inherit_from  # switch: inherit from the other parent genome
molpopgen commented 1 year ago

@hyanwong I find it a bit hard to read for a few reasons:

molpopgen commented 1 year ago

I think I'd also go straight to the case with recombination, including assertions on the invariants that need to be upheld.

hyanwong commented 1 year ago

Thanks. Really useful. I will try to standardise variable names (distinguishing between the individual IDs and the genome (node) IDs is the main problem). Type hints are a good idea. I want to turn this into a workbook, so I'll make changes there and post a link.

hyanwong commented 1 year ago

I think I'd also go straight to the case with recombination

ISWYM. It is nice to be able to simulate without recombination, so you get a single tree for viz purposes. But I could do that simply by setting recombination_rate to 0.

molpopgen commented 1 year ago

As much as Python type hints can bother me, I do think they'd help here. :)

hyanwong commented 1 year ago

I should also say that having make_new_generation as a function that is called each time means it should be easy to both add more generations onto the end of the simulation, and show how to simplify e.g. every 10 generations.

hyanwong commented 1 year ago

I have worked this up into a (very) draft workbook at https://hyanwong.github.io/workshop-pages/lab?path=ForwardSimulation.ipynb, which has a bit of explanation between the function definitions, etc, and examples of what simplification does. Type hints are only present for return values of functions (and there are fewer functions). I have not talked about incorporating simplification into the actual simulation, however - perhaps I should do so. I have also not mentioned mutation.

The idea would be to gradually port @molpopgen 's nice text into this notebook somehow, which could perhaps form the nucleus of a new tutorial. But if we want to keep the existing code, that's fine too (as I am wanting to use mine for a separate "workbook" -type file for workshops rather than the standard tskit tutorials.

More comments gratefully accepted. I know it's a bit rough at the moment.

Todo:

hyanwong commented 1 year ago

Thinking about this, I wonder if it is a good idea to split it into two tutorials. The first can be something like the workbook I put up, and is a "simple simulator" (but with recombination). The second can contain all the gotchas, details about mutations, repeated simplification (and invariance), recapitation, validation, etc. What do you think @molpopgen ?

molpopgen commented 1 year ago

It think that the level of detail depends a bit on your target audience. For static tutorials served up on a website, I think that several short pages is okay.

hyanwong commented 1 year ago

It think that the level of detail depends a bit on your target audience. For static tutorials served up on a website, I think that several short pages is okay.

Yes, I agree. Are you happy for me to take your wfforward notebook, cur it into two, and graft a modified version of my simulation code into the first part, @molpopgen ? Some comments about the first part of your text below:


A node is used to label the birth time of a lineage. A node can be described by a tuple, (id, time), where id is unique and time reflects the birth time of that id.

I think we should describe a node as a genome rather than an "event" (e.g. the birth of a lineage). For instance, a coalescent node might not actually mark the exact time of coalescence: rather it is the neareast representative genome we have to that event. See the eARG vs gARG distinction in the ARG preprint.

For the case of the discrete-time Wright-Fisher (WF) model, if $N$ individuals currently exist at time point $t$, then there are $2N$ nodes, $[i,i+2N)$, and the diploids are defined as tuples grouping pairs of adjacent nodes, $D \in [(i,i+1),(i+2,i+3),\dots,(2N-2,2N-1)]$, where each $(i,j)$ pairing is a diploid.

I don't think we need to require diploids to be adjacent nodes any more, do we? Instead we can group them as belonging to the same individual. Indeed, once we have simplified, the internal nodes can no longer be thought of as adjacent diploids anyway (ancestral nodes are unlikely to also have their matching diploid pair in the genealogy)

All of our simulations will follow the parameter scaling laid out by Dick Hudson's ms software,

I have not followed that scaling in my example, preferring e.g. to have a genome of length e.g. 50kb. Using base pairs seems easier for beginners to me, but I can also see the benefits of using 0..1, so am willing to be convinced that the Hudson scaling is better as an introduction.

We also need to define what we mean by time, which is simple for the discrete-time model. A simulation will start at generation 0,

I find it easier to code time as generations ago (i.e. standard tskit direction), rather than distract the reader by having to reverse the time scale on export. The time-reversal technique has the down-side that the tables are invalid all the way through the simulation, up until the point that the times are reversed. My suggestion is that when we simply count for gen in range(ngens-1, 0, -1): (another option would be start at generation zero but count down, creating negative generation times, and adding ngens to the node times on export). The only issue is when simulations do not have a fixed generation time, or are resumed (in this case, one option is to add ngens to all the existing node (and mutation) times).

I suggest that we combine both counting-down techniques. I.e. we set the initial generation to be time ngens-1 (so that in the simple case, the final generation is at time zero). But once we get to more advanced code, we check the current value of gens and deduct it from the node (and mutation / migration) times, see https://github.com/tskit-dev/tskit/issues/2809

molpopgen commented 1 year ago

It think that the level of detail depends a bit on your target audience. For static tutorials served up on a website, I think that several short pages is okay.

Yes, I agree. Are you happy for me to take your wfforward notebook, cur it into two, and graft a modified version of my simulation code into the first part, @molpopgen ? Some comments about the first part of your text below:

A node is used to label the birth time of a lineage. A node can be described by a tuple, (id, time), where id is unique and time reflects the birth time of that id.

I think we should describe a node as a genome rather than an "event" (e.g. the birth of a lineage). For instance, a coalescent node might not actually mark the exact time of coalescence: rather it is the neareast representative genome we have to that event. See the eARG vs gARG distinction in the ARG preprint.

I defined a node intentionally with respect to the birth of a genome. I think this definition is appropriate in the context.

For the case of the discrete-time Wright-Fisher (WF) model, if N individuals currently exist at time point t, then there are 2N nodes, [i,i+2N), and the diploids are defined as tuples grouping pairs of adjacent nodes, D∈[(i,i+1),(i+2,i+3),…,(2N−2,2N−1)], where each (i,j) pairing is a diploid.

I don't think we need to require diploids to be adjacent nodes any more, do we? Instead we can group them as belonging to the same individual. Indeed, once we have simplified, the internal nodes can no longer be thought of as adjacent diploids anyway (ancestral nodes are unlikely to also have their matching diploid pair in the genealogy)

Here, you need to think about computer architecture. Even with "individuals", nodes are likely to be kept adjacent for performance reasons. The "data oriented" paradigm from, e.g. computer games, would have "nodes" by an array within an "individuals" structure.

All of our simulations will follow the parameter scaling laid out by Dick Hudson's ms software,

I have not followed that scaling in my example, preferring e.g. to have a genome of length e.g. 50kb. Using base pairs seems easier for beginners to me, but I can also see the benefits of using 0..1, so am willing to be convinced that the Hudson scaling is better as an introduction.

Either way.

We also need to define what we mean by time, which is simple for the discrete-time model. A simulation will start at generation 0,

I find it easier to code time as generations ago (i.e. standard tskit direction), rather than distract the reader by having to reverse the time scale on export. The time-reversal technique has the down-side that the tables are invalid all the way through the simulation, up until the point that the times are reversed. My suggestion is that when we simply count for gen in range(ngens-1, 0, -1): (another option would be start at generation zero but count down, creating negative generation times, and adding ngens to the node times on export). The only issue is when simulations do not have a fixed generation time, or are resumed (in this case, one option is to add ngens to all the existing node (and mutation) times).

I suggest that we combine both counting-down techniques. I.e. we set the initial generation to be time ngens-1 (so that in the simple case, the final generation is at time zero). But once we get to more advanced code, we check the current value of gens and deduct it from the node (and mutation / migration) times, see tskit-dev/tskit#2809

Here it is a matter of taste. I'm not sure that "generations" ago generalizes to a forward simulation. For example, there could be a termination condition such that the end time is not tskit's "zero". While that doesn't matter for the data model, it is not necessarily true that the final time is zero. It is, in many cases, more natural for a forward simulation to increment time using positive integers and then do the conversion.

hyanwong commented 1 year ago

I defined a node intentionally with respect to the birth of a genome. I think this definition is appropriate in the context.

How about e.g.

"A node represents a genome at a point in time (often we imagine this as the "birth time" of the genome)".

I think imagining it as the birth point of a genome is less event-y than imagining it as the birth point of a lineage (which is what was written before)

Here, you need to think about computer architecture. Even with "individuals", nodes are likely to be kept adjacent for performance reasons. The "data oriented" paradigm from, e.g. computer games, would have "nodes" by an array within an "individuals" structure.

Yes, I think that we could say that, for efficiency, you might want that. But I don't think we need any more to define a diploid as two successive node IDs. And certainly, two successive node IDs don't have to correspond to a diploid, right? (otherwise all ancestors would have to be diploids too)

Here it is a matter of taste. I'm not sure that "generations" ago generalizes to a forward simulation. For example, there could be a termination condition such that the end time is not tskit's "zero". While that doesn't matter for the data model, it is not necessarily true that the final time is zero.

Yes, you are right here. I was suggesting that we could get around this by subtracting the "current generation time", if not zero.

It is, in many cases, more natural for a forward simulation to increment time using positive integers and then do the conversion.

I agree that it feels more natural, but that is offset by the complication of having to explain that the times need to be reversed later. The fact that by using forward generations, we would be creating ts-invalid tables during simulation was what tipped me toward using "generations_ago" instead. Maybe it's worth asking a newcomer?

Thanks for the replies, by the way.

molpopgen commented 1 year ago

Here, you need to think about computer architecture. Even with "individuals", nodes are likely to be kept adjacent for performance reasons. The "data oriented" paradigm from, e.g. computer games, would have "nodes" by an array within an "individuals" structure.

Yes, I think that we could say that, for efficiency, you might want that. But I don't think we need any more to define a diploid as two successive node IDs. And certainly, two successive node IDs don't have to correspond to a diploid, right? (otherwise all ancestors would have to be diploids too)

I think there's a miscommunication here. My text refers to what forward simulations generate, which is individuals born with some ploidy. The text has that audience in mind. What tskit does with ancestors is an internal detail of tskit.

molpopgen commented 1 year ago

Maybe it's worth asking a newcomer?

I think it depend on the target audience. For example, is familiarity with the tskit data model a strong prereq for this tutorial?

hyanwong commented 1 year ago

I think there's a miscommunication here. My text refers to what forward simulations generate, which is individuals born with some ploidy. The text has that audience in mind. What tskit does with ancestors is an internal detail of tskit.

Ah yes, that is a very good point. I am probably thinking too much in terms of tskit definitions, rather than the (independent) simulation code.

hyanwong commented 1 year ago

I think there's a miscommunication here. My text refers to what forward simulations generate, which is individuals born with some ploidy. The text has that audience in mind. What tskit does with ancestors is an internal detail of tskit.

Ah yes, that is a very good point. I am probably thinking too much in terms of tskit definitions, rather than the (independent) simulation code.

However, if, at any point in the simulation, we want to remove (or renumber) a node independently of the individual in which it resides, then using adjacent IDs to define an individual becomes problematic. Most of the time we leave the details of removing / reallocating IDs to tskit, but I suppose I could imagine cases where this might be required in the simulation itself?

molpopgen commented 1 year ago

However, if, at any point in the simulation, we want to remove (or renumber) a node independently of the individual in which it resides, then using adjacent IDs to define an individual becomes problematic. Most of the time we leave the details of removing / reallocating IDs to tskit, but I suppose I could imagine cases where this might be required in the simulation itself?

But again, I am talking about births. Birth happen at once, "in bulk", adding rows to the node table. That's consecutive, right?

hyanwong commented 1 year ago

But again, I am talking about births. Birth happen at once, "in bulk", adding rows to the node table. That's consecutive, right?

Right. One of the confusions is coming from the fact that I am using tskit to allocate (successive) node IDs, rather than e.g. resetting internal simulation IDs to zero each generation. That means, if we are simplifying part-way, there could be an odd number of nodes in the node table (i.e. after simplification), so that individuals start being allocated nodes such as (101,102), (103, 104). Although these groupings are consecutive, they do not start with even numbers, implying that the lower IDs do not conform to the defined (2-nodes = one individual) grouping.

molpopgen commented 1 year ago

Right. One of the confusions is coming from the fact that I am using tskit to allocate (successive) node IDs, rather than resetting internal simulation IDs to zero each generation. That means, if we are simplifying part-way, there could be an odd number of nodes in the node table (i.e. after simplification), so that individuals start being allocated nodes such as (101,102), (103, 104). Although these groupings are consecutive, they do not start with even numbers, implying that the lower IDs do not conform to the standard 2-nodes grouping.

There's no requirement that things start with an even number. For a diploid, they are simply the next 2 available numbers.

hyanwong commented 1 year ago

There's no requirement that things start with an even number. For a diploid, they are simply the next 2 available numbers.

Sure, but the implication would then be that nodes exist (or have existed) in the simulation that are not grouped in pairs. Without going into the gory details of dropping single nodes from the ID list, it could potentially be a bit confusing to the reader?

molpopgen commented 1 year ago

There's no requirement that things start with an even number. For a diploid, they are simply the next 2 available numbers.

Sure, but the implication would then be that nodes exist (or have existed) in the simulation that are not grouped in pairs. Without going into the gory details of dropping single nodes from the ID list, it could potentially be a bit confusing to the reader?

At some point, you have to assume some background in population genetics or else you will go crazy. This is domain-specific stuff you're talking about.

hyanwong commented 1 year ago

At some point, you have to assume some background in population genetics or else you will go crazy. This is domain-specific stuff you're talking about.

Sure - anyway, I think it's useful to mention the tskit way of recording individuals (because we can store those from the simulation too), so I'll see if I can put together some sort of wording that details both cases, without going into too much detail for the beginner (wish me luck!).

molpopgen commented 1 year ago

Oh, now I see the confusion! For me, an individual has nothing to do with the "individuals table" until a final export happens. Even when keeping ancient samples, it is often cleaner to manage those nodes yourself, remapping them each time you simplify, then deal with the individuals table.

hyanwong commented 1 year ago

Ah yes. Personally, I think it is helpful for a forwards simulation to record individuals in the individual table as they are created (rather than generate them on export). Then you also keep all the individuals associated with ancestral non-sample genomes too, which can be helpful. For example, it means you can e.g. store the geographical location of each individual in the table, and hence gain access to the location of ancestral genomes.

(simplify should take care of the node+individual remapping anyway, right?)

molpopgen commented 1 year ago

That works, but has drawbacks. For example, having the locations only in the tskit table means that they aren't in, say, an R-tree or some other means of efficiently looking up things in "space".

hyanwong commented 1 year ago

That works, but has drawbacks. For example, having the locations only in the tskit table means that they aren't in, say, an R-tree or some other means of efficiently looking up things in "space".

Yes, although I'm not saying that they should only be stored in tskit. In practice, you might keep the locations for the parent and the child generations in some efficient structure too, but throw them away when you go to the next generation. Having the locations permanently stored for all ancestors is generally useful (and if you wanted to do anything time consuming with those locations, you might well extract the coordinates from the individual table and put them into a more efficient structure before analysis).

Either way, I'm pretty sure that it is helpful to store individuals as-we-go, recording any metadata about them as necessary. Individuals associated with non-sample nodes can be discarded later if need be, but they can contain useful information which can be worth storing and accessing from the output tree sequence. I don't think this is easily done by constructing the individuals at the end, just before ts output.

hyanwong commented 1 year ago

Here's what I have so far (a hybrid of your texts (mostly) and mine):

Forward-time simulation with tskit

This tutorial shows the basics of creating a basic forward-time simulator that stores the evolving genealogy as a tskit tree sequence. We will focus on the case of diploids, in which each individual contains 2 genomes, but the concepts used here generalize to any ploidy, if you are willing to do the book-keeping!

:::{note} The distinction between an individual and the genomes it contains is an important one. In fact, individuals are not strictly necessary for representing genetic genealogies (it's the genomes which are important), but a simulator needs to account for individuals, at least temporarily. :::

Definitions

Before we can make any progress, we require a few definitions.

A node represents a genome at a point in time (often we imagine this as the "birth time" of the genome). It can be described by a tuple, (id, time), where id is a unique integer, and time reflects the birth time of that id. When generating a tree sequence, this will be stored in a row of the node table.

A diploid individual is a group of two nodes. During simulation, a simple and efficient grouping assigns sequential pairs of node IDs to an individual. It can be helpful (but not strictly necessary) to store individuals within the tree sequence as rows of the the individual table (a node can then be assigned to an individual by storing that individual's id in the appropriate row of the node table).

An edge reflects a transmission event between nodes. An edge is a tuple (Left, Right, Parent, Child) whose meaning is "Parent genome $P$ passed on the genomic interval $[L, R)$ to child genome $C$". In a tree sequence this is stored in a row of the edge table

The time, in the discrete-time Wright-Fisher (WF) model which we will simulate, is measured in integer generations. To match the tskit notion of time, we record time in generations ago: i.e. for a simple simulation of G generations, we start the simulation at generation $G-1$ and count down until we reach generation 0 (the current-day).

The population consists of $N$ diploid individuals ($2N$ nodes) at a particular time $t$. At the start, the population will have no known ancestry, but subsequently, each individual will be formed by choosing (at random) two parent individuals from the population in the previous generation.

Approach

The goal of this tutorial is to work through the book-keeping required to generate edges, nodes, and individuals forwards in time, adding them to the relevant tskit tables. To aid efficiency, we will also see how to "simplify" the tables into the minimal set of nodes and edges that describe the history of the sample. Finally, these tables can be exported into an immutable tree sequence for storing or analysis.

hyanwong commented 1 year ago

@molpopgen - I suggest that material in your wfforward notebook that deals with "Starting with a prior history" (e.g. matching time units, etc) should be ported into the (already existing) Completing forwards simulations tutorial. Does that seem sensible?

molpopgen commented 1 year ago

Please just go ahead and do whatever you think is needed. I don't have any attachment to anything these docs and don't need to approve any changes.