tskit-dev / msprime

Simulate genealogical trees and genomic sequence data using population genetic models
GNU General Public License v3.0
177 stars 88 forks source link

sim_ancestry with initial state can produce invalid ts when mutations exist above the roots #2319

Open ShyamieG opened 2 months ago

ShyamieG commented 2 months ago

Hi all,

I'm running into a "TSK_ERR_MUTATION_TIME_OLDER_THAN_PARENT_NODE" error when trying to recapitate a truncated tree sequence. I've attached some code and a tree sequence to walk through the issue.

The original tree sequence can be recapitated without a problem, but I run into the error when I try to truncate my tree sequence first using the 'delete_older' method:

# truncate, then recapitate
max_age = 20000
ts_tables = tree_seq.dump_tables()
ts_tables.delete_older(max_age)
trunc_ts = ts_tables.tree_sequence()

msprime.sim_ancestry(initial_state=trunc_ts, recombination_rate = 1e-9, demography = demography, random_seed = 1234)

the above code gave me the following error: Traceback (most recent call last): File "", line 1, in File "/Users/shyamag/miniconda3/envs/slim-env/lib/python3.9/site-packages/msprime/ancestry.py", line 1226, in sim_ancestry return _wrap_replicates( File "/Users/shyamag/miniconda3/envs/slim-env/lib/python3.9/site-packages/msprime/ancestry.py", line 660, in _wrap_replicates deque = collections.deque(iterator, maxlen=1) File "/Users/shyamag/miniconda3/envs/slim-env/lib/python3.9/site-packages/msprime/ancestry.py", line 1518, in run_replicates ts = tables.tree_sequence() File "/Users/shyamag/miniconda3/envs/slim-env/lib/python3.9/site-packages/tskit/tables.py", line 3341, in tree_sequence return tskit.TreeSequence.load_tables(self) File "/Users/shyamag/miniconda3/envs/slim-env/lib/python3.9/site-packages/tskit/trees.py", line 4110, in load_tables ts.load_tables(tables._ll_tables, build_indexes=build_indexes) _tskit.LibraryError: A mutation's time must be < the parent node of the edge on which it occurs, or be marked as 'unknown'. (TSK_ERR_MUTATION_TIME_OLDER_THAN_PARENT_NODE)

I found this strange because I'm not simulating any new mutations during recapitation. I also checked for cases where a mutation has an older time than the parent node of its associated edge and found none:

# check for bad mutations
for mutation in trunc_ts.mutations():
    mut_time = mutation.time
    parent_node_time = trunc_ts.node(trunc_ts.edge(mutation.edge).parent).time
    if mut_time >= parent_node_time:
        print(mutation)

I went ahead and marked all my mutation times as unknown, as suggested by the error message. I don't really need mutation times for my application, so that is a viable workaround for me. However, I feel like this should not be happening and might indicate a bug. Any ideas what might be going on?

Thanks!

ShyamieG commented 2 months ago

Sorry, I just realized that there are in fact two mutations that violate the rule. It seems they ended up on edges that no longer exist after I truncate my tree. I can simply remove them and it works.

Here is the function I wrote:

def truncate_tree(tree_seq, max_age):
    ts_tables = tree_seq.dump_tables()
    ts_tables.delete_older(max_age) # deletes mutations and edges
    # delete nodes that are too old
    too_old_nodes = set(np.where(tree_seq.nodes_time > max_age)[0])
    ts_tables.nodes.clear()
    for node in tree_seq.nodes():
        if node.id not in too_old_nodes:
            ts_tables.nodes.add_row(flags = node.flags, \
                                    time = node.time, \
                                    population = node.population, \
                                    individual = node.individual, \
                                    metadata = node.metadata)
    trunc_ts = ts_tables.tree_sequence()
    # delete mutations whose edge no longer exists
    ts_tables = trunc_ts.dump_tables()
    ts_tables.mutations.clear()
    for mutation in trunc_ts.mutations():
        if mutation.edge != -1:
            ts_tables.mutations.add_row(site = mutation.site, \
                                        node = mutation.node, \
                                        derived_state = mutation.derived_state, \
                                        parent = mutation.parent, \
                                        metadata = mutation.metadata)
    new_tree_seq = ts_tables.tree_sequence()
    return new_tree_seq
bhaller commented 2 months ago

Hmm, should this be closed? Having a mutation positioned on an edge that no longer exists sounds like the truncation operation has left the user with an invalid tree sequence, no? The mutation should either be removed, or re-positioned onto an edge that does still exist, no? I'm gonna reopen this for now, just to make sure this gets seen – although perhaps I'm wrong and there is no bug here.

petrelharp commented 2 months ago

Gee, thanks, @ShyamieG - this is an issue. I think that either

  1. this is an issue with msprime.sim_ancestry(initial_state=...), in that it needs to check for mutations above the roots and either deal with them properly (either put them on the right branches or error?), or
  2. we should decide that this is a User Beware situation and maybe add documentation.

My inclination is (1), and in fact it'd be a pretty easy thing to fix up (ie to put the mutations where they should go?) but I haven't thought about it hard. I'm going to transfer this to msprime.

petrelharp commented 2 months ago

Tangentially, we have the code to do the needed operation over in https://github.com/tskit-dev/tskit/pull/2938:

static int
tsk_treeseq_slide_mutation_nodes_up(
    const tsk_treeseq_t *self, tsk_mutation_table_t *mutations)
{
    int ret = 0;
    double t;
    tsk_id_t c, p, next_mut;
    const tsk_table_collection_t *tables = self->tables;
    const tsk_size_t num_nodes = tables->nodes.num_rows;
    double *sites_position = tables->sites.position;
    double *nodes_time = tables->nodes.time;
    tsk_tree_t tree;

    ret = tsk_tree_init(&tree, self, TSK_NO_SAMPLE_COUNTS);
    if (ret != 0) {
        goto out;
    }

    next_mut = 0;
    for (ret = tsk_tree_first(&tree); ret == TSK_TREE_OK; ret = tsk_tree_next(&tree)) {
        while (next_mut < (tsk_id_t) mutations->num_rows
               && sites_position[mutations->site[next_mut]] < tree.interval.right) {
            t = mutations->time[next_mut];
            if (tsk_is_unknown_time(t)) {
                ret = TSK_ERR_DISALLOWED_UNKNOWN_MUTATION_TIME;
                goto out;
            }
            c = mutations->node[next_mut];
            tsk_bug_assert(c < (tsk_id_t) num_nodes);
            p = tree.parent[c];
            while (p != TSK_NULL && nodes_time[p] <= t) {
                c = p;
                p = tree.parent[c];
            }
            tsk_bug_assert(nodes_time[c] <= t);
            mutations->node[next_mut] = c;
            next_mut++;
        }
    }
    if (ret != 0) {
        goto out;
    }

out:
    tsk_tree_free(&tree);

    return ret;
}

and python code:


def _slide_mutation_nodes_up(ts, mutations):
    # adjusts mutations' nodes to place each mutation on the correct edge given
    # their time; requires mutation times be nonmissing and the mutation times
    # be >= their nodes' times.

    assert np.all(~tskit.is_unknown_time(mutations.time)), "times must be known"
    new_nodes = mutations.node.copy()

    mut = 0
    for tree in ts.trees():
        _, right = tree.interval
        while (
            mut < mutations.num_rows and ts.sites_position[mutations.site[mut]] < right
        ):
            t = mutations.time[mut]
            c = mutations.node[mut]
            p = tree.parent(c)
            assert ts.nodes_time[c] <= t
            while p != -1 and ts.nodes_time[p] <= t:
                c = p
                p = tree.parent(c)
            assert ts.nodes_time[c] <= t
            if p != -1:
                assert t < ts.nodes_time[p]
            new_nodes[mut] = c
            mut += 1

    # in C the node column can be edited in place
    new_mutations = mutations.copy()
    new_mutations.clear()
    for mut, n in zip(mutations, new_nodes):
        new_mutations.append(mut.replace(node=n))

    return new_mutations
petrelharp commented 1 month ago

Someone else has hit this, FWIW: https://github.com/tskit-dev/pyslim/issues/352 - tldr; the input trees were simplified (they shouldn't have been, or should have had keep_input_roots=True).