tskit-dev / tskit

Population-scale genomics
MIT License
151 stars 70 forks source link

Returning the full ARG with ts.simplify() #2765

Open kitchensjn opened 1 year ago

kitchensjn commented 1 year ago

Is there a method when simplifying a tree sequence to remove all unary nodes except the recombination nodes (a middle ground between ts.simplify(keep_unary=False) versus ts.simplify(keep_unary=True))? We are working with the tree sequence output from a SLiM simulation with initializeTreeSeq(retainCoalescentOnly=F), which contains lots of unary nodes, and we want to simplify it down to just the nodes that affect the ARG structure, the full ARG. As the output tree sequence from SLiM does not have marked recombination nodes, these would first need to be identified before simplifying. Copying @pderaje as he is working on this with me.

(See MesserLab/SLiM#376 for the initial post before determining it was better suited here.)

jeromekelleher commented 1 year ago

This isn't very easy right now all right - @hyanwong I think we have some hacks for this in sc2ts? Something like, identify the recombinants as the nodes that have greater than 1 parent, and then mark feed them in as samples or something?

hyanwong commented 1 year ago

Yes, I think it would have to be a hack. You could either mark them as samples (and then unmark them afterwards, or maybe even use update_sample_flags=False), or associate them with an individual and use simplify(keep_unary_in_individuals=True) (and then remove those individuals afterwards).

The advantage to the "mark samples" method is that it is reasonably clean. The advantage to the individuals method is that it will correctly remove portions of edges that lead to the recombination node but do not lead on to other samples (e.g. "nonancestral" recombination events). My (untested) guess is that the samples method will not remove nonancestral recombination nodes.

I'm not aware of any hacks in sc2ts that keep these in (apart from here, where we don't reset the sample flags). Mostly we just use keep_unary=True.

Here's some (semi-untested) code for the two methods. The new "filter_nodes=False" option to simplify is very handy to check that the plots are sensible.

import msprime
import numpy as np

arg = msprime.sim_ancestry(2, sequence_length=1e3, recombination_rate=0.001, record_full_arg=True)
re_nodes = np.where(arg.nodes_flags & msprime.NODE_IS_RE_EVENT)[0]
style = "".join(f".n{u} > .sym {{fill: red}}" for u in re_nodes)
arg.draw_svg(style=style)

image

# "sample" method
s1_arg = arg.simplify(np.concatenate((arg.samples(), re_nodes)), update_sample_flags=False, filter_nodes=False)
s1_arg.draw_svg(style=style)

image

# "individual" method
tables = arg.dump_tables()
individual_arr = tables.nodes.individual
for u in re_nodes:
    individual_arr[u] = tables.individuals.add_row()
tables.nodes.individual = individual_arr
tables.simplify(keep_unary_in_individuals=True, filter_nodes=False)
tables.nodes.individual = arg.nodes_individual  # set the individuals back to the original
s2_arg = tables.tree_sequence()
s2_arg.draw_svg(style=style)

image

hyanwong commented 1 year ago

A keep_multiple_parent_nodes option would be a reasonable idea IMO (note that these might not always be strict "recombination nodes": they could be summaries of an earlier Re event. @GertjanBisschop has been thinking about rationalising the various simplify options, so he might have an opinion.

pderaje commented 1 year ago

Thanks, @hyanwong! We were doing something similar to the "samples" method, except SLiM doesn't automatically flag the recombination nodes, so we identified nodes with multiple parents, which raised two questions

  1. Is there a way to automatically flag those nodes with multiple parents, either while recording the tree sequence in SLiM or while simplifying in tskit (something like keep_multiple_parent_nodes should do the trick)? Otherwise going through each nd takes a while in larger ARGs.

  2. While adding the recombination nodes to be sampled, we weren't sure what we should be adding to the list. In the following code, is it the two parents or the nd itself (or maybe it doesn't matter) that should be added to the sample list?

`def ts_to_ARG(ts): ts_tables = ts.dump_tables() node_table = ts_tables.nodes flags = node_table.flags recomb_nodes = []

for nd in ts_sim.nodes(): 
    parents = np.unique(ts_tables.edges.parent[np.where(ts_tables.edges.child == nd.id)[0]])
    children = np.unique(ts_tables.edges.child[np.where(ts_tables.edges.parent == nd.id)[0]])
    if len(parents) == 2: 
        recomb_nodes += list(parents)
        flags[parents] = [131072,131072]

    if len(parents) > 2:
        raise TypeError('Error',nd.id)

node_table.flags = flags
ts_tables.sort() 
ts_new = ts_tables.tree_sequence()

keep_nodes = list(ts_new.samples()) + list(np.unique(recomb_nodes))
ts_final, maps = ts_new.simplify(samples=keep_nodes, map_nodes = True, keep_input_roots=False, keep_unary=False, update_sample_flags = False)

return ts_final, maps, recomb_nodes`
hyanwong commented 1 year ago
  1. Is there a way to automatically flag those nodes with multiple parents, either while recording the tree sequence in SLiM or while simplifying in tskit (something like keep_multiple_parent_nodes should do the trick)? Otherwise going through each nd takes a while in larger ARGs.

There isn't. But it should be possible from the edge table arrays, more-or-less in a single pass right? You could do an np.unique on the 2 x num_edges 2D array, keeping unique pairs of (child, parent), sorted by child, then look for duplicate child IDs, I think?

  1. While adding the recombination nodes to be sampled, we weren't sure what we should be adding to the list. In the following code, is it the two parents or the nd itself (or maybe it doesn't matter) that should be added to the sample list?

I guess you want the node itself

hyanwong commented 1 year ago

I guess you want the node itself

Ah, I was wrong. It depends how you represent the recombination event. In msprime, we create 2 nodes per recombination event (because it helps us to calculate the likelihood under the Hudson coalescent, see https://github.com/tskit-dev/msprime/issues/1942). Quoting from there:

I can't see how storing the recombination as two nodes helps with this. Surely the position of the breakpoint is lost in this case too?

The exact position of the breakpoint is lost, but you retain information about which non-ancestral gap contains the breakpoint, even when there are several gaps. That's enough to work out how many links are available for an effective recombination, and hence the rate of effective recombinations, in the next time interval.

Ah, thanks for the explanation. So if (in the original msprime code) you have a pair of RE nodes (either of which can be a parent in multiple edges, you take the rightmost edge position from the first node, and the leftmost from the second node, and (because you only have one breakpoint per recombination event) you are guaranteed that the "hidden" breakpoint lies between these? Is that right? I can see how that might work, but I can't see on earth how you would generalise this to multiple breakpoints. Perhaps this is an indications that the system we are switching to, with the breakpoints for the RE node stored in metadata, is more correct, if a little more bespoke.

In this case you might want to keep the parents.

In the SLiM case, my guess is that you have one recombination node per event, so you want the children. This is all a bit messy!

hyanwong commented 1 year ago

it should be possible from the edge table arrays, more-or-less in a single pass right? You could do an np.unique on the 2 x num_edges 2D array, keeping unique pairs of (child, parent), sorted by child, then look for duplicate child IDs,

@kitchensjn : I think this finds the nodes with multiple parents, doesn't it? Could you check my logic, and if it's correct, I can add it as a Q&A to the discussions forum.

uniq_child_parent = np.unique(np.column_stack((ts.edges_child, ts.edges_parent)), axis=0)
nd, count = np.unique(uniq_child_parent[:, 0], return_counts=True)
multiple_parents = nd[count > 1]
print(f"Nodes with multiple parents are {multiple_parents}")
kitchensjn commented 1 year ago

Yup, this will return all of the nodes with more than one parent. Then checking that it identifies the parents with recombination flags should be something like:

import numpy as np
import msprime

ts = msprime.sim_ancestry(
    samples=4,
    recombination_rate=1e-8,
    sequence_length=2_000,
    population_size=10_000,
    record_full_arg=True
)

uniq_child_parent = np.unique(np.column_stack((ts.edges_child, ts.edges_parent)), axis=0)
nd, count = np.unique(uniq_child_parent[:, 0], return_counts=True)
multiple_parents = nd[count > 1]
recomb_nodes = ts.edges_parent[np.in1d(ts.edges_child, multiple_parents)]
print(f"Recombination nodes: {recomb_nodes}. Flags match properly: {np.all(ts.nodes_flags[recomb_nodes]==msprime.NODE_IS_RE_EVENT)}")
hyanwong commented 1 year ago

Yes, although the recombination nodes created by msprime.sim_ancestry(... record_full_arg=True) are not the ones with multiple parents, but (confusingly, and for technical reasons) the nodes above the node with multiple parents. That was the point of my digression above. It should probably work in SLiM though, assuming that SLiM doesn't use the 2-RE-node encoding used by msprime.

kitchensjn commented 1 year ago

Just to clarify, so please correct me if I've misunderstood:

My code should work for tree sequences with the 2-RE-node encoding (msprime) as it returns the parents of the nodes with multiple parents. It is equivalent to np.where(ts.nodes_flags==msprime.NODE_IS_RE_EVENT)[0] for a tree sequence generated using msprime.sim_ancestry(... record_full_arg=True).

The final two lines of my code would not be needed for a tree sequence that uses a 1-RE-node encoding (SLiM, most likely). For those tree sequences, the array multiple_parents would contain all of the recombination nodes.

jeromekelleher commented 1 year ago

There may not be a good reason for doing things this way in msprime with the two re nodes, now that we can keep unary nodes more flexibly.

@GertjanBisschop can you comment? It would be good to make a decision here regarding how we record recombs before we release the new additional nodes API

(This is an msprime issue though - can someone open an issue on msprime to discuss potentially changing how we record re nodes for the new additional nodes API please?)

hyanwong commented 1 year ago

Just to clarify, so please correct me if I've misunderstood:

No you are right - I didn't read your code fully, sorry!

GertjanBisschop commented 1 year ago

(This is an msprime issue though - can someone open an issue on msprime to discuss potentially changing how we record re nodes for the new additional nodes API please?)

I opened an issue. As @hyanwong already mentioned, the 2-nodes vs 1-node recombination event encoding has been discussed before. Not entirely sure yet how the more flexible node recording would help resolve why we stuck with the 2-node encoding.

hyanwong commented 2 months ago

Just to note in passing that a large number of the ARG nodes that are not in a tree sequence are not recombination nodes, but common-ancestor-non-coalescent nodes. You would probably want to keep these too.

I think a flexible thing would be to be able to pass a bit array of flags to simplify, and keep any nodes with those flags set. We could have a separate routine for flagging up nodes in the correct way, which would be different for a 1-RE node vs 2-RE node TS.