tskit-dev / pyslim

Tools for dealing with tree sequences coming to and from SLiM.
MIT License
26 stars 23 forks source link

edges are lost from tree sequence after SLiM step #331

Closed ShyamieG closed 7 months ago

ShyamieG commented 9 months ago

I wasn't entirely sure if this should be a SLiM issue or tskit issue - please let me know if I your think I should post this on the SLiM github instead.

I am finding that edges are being dropped from my tree sequence after passing it through a SLiM simulation. I have attached a full example here. The SLiM code is complicated, but it runs in just a few minutes on my machine, and I wanted to provide a real subset of the problem I am facing 'in the wild'.

In this example, running the script 'run_hhar.sh' passes the tree sequence file 'inf-1602_V8865_to_H265_on_day_49.ts' to a SLiM script 'hhar.slim' and produces the output 'inf-1686_H265_to_V4533_on_day_39.ts', 'inf-1686_H265_to_V4861_on_day_162.ts', and 'inf-1686_H265_to_sample_on_day_168.ts'. Within SLiM, I have set simplify = F.

Next, the check_trees.py script loads the input and output trees and checks the history of 3 nodes (referenced by their SLiM ids) that should exist in all the trees. I simplify all the trees based on those 3 nodes, keeping all roots and internal nodes. Below are examples of the input and output trees across two intervals.

Input tree:

input_tree

One of the output trees:

output_tree1

From this, I can see that edges (e.g. the one connecting 41047010 and 45478074) and internal nodes (e.g. 41047010, 49821434) that existed in the input tree have been dropped in subsequent trees. Of note, the node 41047010 exists in the unsimplified output trees, but its relationship to its parent and child nodes is lost. The other two internal nodes (49821434, 45478102) are dropped entirely. It is not my intention for this to happen - if an edge exists at any point in the history of the sample nodes, I want that edge to be remembered across all subsequent SLiMulations that the tree sequences are passed through.

Any ideas about why this might be happening and how I can fix it?

edited to say that I'm using SLiM version 3.7.1 and tskit version 0.5.2

petrelharp commented 9 months ago

Hi, @ShyamieG! This looks like a very interesting simulation, and thanks very much for the very complete set-up. I'm not sure if this is SLiM or tskit or what yet. And, let's see: I've not got my head around everything that's happening here yet, but can you give a bit more explanation of why you expect those edges to still be there? I'm a bit unclear on what's supposed to be simplified, etcetera.

Two observations:

  1. SLiM is doing simplification as it goes along (or at least potentially is?) - to turn off the 'as you go along' simplification you need to adjust the simplificationRatio' in the call toinitializeTreeSeq`, although you have turned it off when you write the tree sequence out.
  2. The nodes you're choosing to check up on are not all samples - only the first two are samples, and then only in the 'ancestral' tree sequence (flags=1 means samples):
    >>> for n in slim_ids_to_check:
    ...   print('anc', anc_tree.node(anc_tree_node_map[n]))
    ...   print('1', tree1.node(tree1_node_map[n]))
    ...   print('2', tree2.node(tree2_node_map[n]))
    ...   print('3', tree3.node(tree3_node_map[n]))
    ... 
    anc Node(id=50, flags=1, time=46.0, population=7, individual=25, metadata={'slim_id': 50170400, 'is_null': False, 'genome_type': 0})
    1 Node(id=357144, flags=0, time=85.0, population=7, individual=-1, metadata={'slim_id': 50170400, 'is_null': False, 'genome_type': 0})
    2 Node(id=86130, flags=0, time=208.0, population=7, individual=-1, metadata={'slim_id': 50170400, 'is_null': False, 'genome_type': 0})
    3 Node(id=86130, flags=0, time=214.0, population=7, individual=-1, metadata={'slim_id': 50170400, 'is_null': False, 'genome_type': 0})
    anc Node(id=10, flags=1, time=63.0, population=2, individual=5, metadata={'slim_id': 48365944, 'is_null': False, 'genome_type': 0})
    1 Node(id=357147, flags=0, time=102.0, population=2, individual=-1, metadata={'slim_id': 48365944, 'is_null': False, 'genome_type': 0})
    2 Node(id=86133, flags=0, time=225.0, population=2, individual=-1, metadata={'slim_id': 48365944, 'is_null': False, 'genome_type': 0})
    3 Node(id=86133, flags=0, time=231.0, population=2, individual=-1, metadata={'slim_id': 48365944, 'is_null': False, 'genome_type': 0})
    anc Node(id=8044, flags=0, time=106.0, population=6, individual=-1, metadata={'slim_id': 45478074, 'is_null': False, 'genome_type': 0})
    1 Node(id=357148, flags=0, time=145.0, population=6, individual=-1, metadata={'slim_id': 45478074, 'is_null': False, 'genome_type': 0})
    2 Node(id=86134, flags=0, time=268.0, population=6, individual=-1, metadata={'slim_id': 45478074, 'is_null': False, 'genome_type': 0})
    3 Node(id=86134, flags=0, time=274.0, population=6, individual=-1, metadata={'slim_id': 45478074, 'is_null': False, 'genome_type': 0})

You wrote that

if an edge exists at any point in the history of the sample nodes, I want that edge to be remembered across all subsequent SLiMulations that the tree sequences are passed through.

This is quite concrete and clear, I'm just wondering: which sample nodes? The nodes alive at the end of the last simulation? The nodes alive at the end of each run through SLiM? (If you'd like to make sure the nodes alive at the end of each run through SLiM remain, you need to explicitly Remember them, with a call to treeSeqRememberIndividuals.)

ShyamieG commented 9 months ago

Hi @petrelharp, and thanks again for the crucial lead! I looked into the 'simplificationRatio' argument, and setting this to INF gave me the expected result in my example, which was four identical trees when subsetted on those nodes of interest.

To back up a little and answer some of the questions you asked:

can you give a bit more explanation of why you expect those edges to still be there? I'm a bit unclear on what's supposed to be simplified, etcetera.

I guess I don't understand how some of these shared nodes (e.g. 45478074) can become completely untethered from the rest of the tree, and it seems that this behaviour is causing problems down the line when I attempt to merge related tree sequences. This probably reflects some conceptual misunderstanding on my part, but I would expect that if a node exists in the ancestral tree sequence and is retained in the descendant trees it should also keep all edges to related nodes, provided they are also retained.

I'm just wondering: which sample nodes? The nodes alive at the end of the last simulation? The nodes alive at the end of each run through SLiM?

In theory, only the former (bolded) - the samples alive at the end of the last set of SLiM simulations are the only ones I really care about. But maybe I need to remember some of these intermediate individuals as well, for practical reasons? My broader goal is to simulate multiple populations that share a common origin (a single 'root' SLiM simulation) and evolve in parallel along semi-independent chains formed by additional SLiM simulations. This is illustrated in the figure below, with each coloured node essentially representing a SLiM simulation (a non-cartoon version of this can be seen here).

Fig3_network_figure

Ultimately, I want to merge the tree sequences generated by the terminal set of SLiM simulations (the 'sampled infections', or leaves on the 'meta-tree'). The problem I'm having is that tskit.union() sometimes complains about the shared portions of a pair of trees not being identical, as in the example I initially posted here. So my goal is to retain all the information needed to correctly merge these 'leaf' tree sequences and generate the 'meta-tree' tree sequence. But I don't want to retain too much unnecessary information, either. I have played around with different ways of combining simplification and 'remembering' at strategic points along my pipeline, but it seems that I haven't gotten it quite right yet.

Your tip about the 'simplificationRatio' option is really helpful, as I had not come across that before. I'm trying it on my full pipeline now. The only problem is that it makes things significantly slower - currently, it looks like it will take ~9-10x longer to run a full simulation. I can live with that if it fixes all my merge issues (will keep you posted on that), but if you have any other ideas on how I could optimize these functions across SLiM and tskit to keep the information I need, I'd be very open to hearing them. I feel like what I want is the option to 'keep_unary' and 'keep_input_roots' when I output the tree sequence at the end of my SLiM script...

petrelharp commented 9 months ago

Let's see - first off, I think I'll move this to pyslim.

And - I'm guessing you've read this tutorial on union-ing tree sequences together. It says how to do it, but doesn't really go into the how or why. Have you had a look at these docs? Those might help.

But I think the short answer to your question is: suppose you want to simulate from times t0->t1; then take the result and simulate independently several different times from t1->t2; and later merge those together. To make sure that the history from t0->t1 in all the resulting tree sequences is identical, you should (in SLiM) use treeSeqRememberIndividuals at the end of that first t0->t1 simulation. That will ensure that everyone alive at time t1 is marked as a sample, and will remain marked as a sample henceforth; then simplify will not remove any of their history.

But, maybe I misunderstand and that's not getting at your problem? Let me know.

You shouldn't need to use simplificationRatio - you're right, turning off simplify is a terrible thing that uses way too much time and memory.

ShyamieG commented 9 months ago

I think this is what I was missing - thank you! I am trying this out now.

petrelharp commented 7 months ago

Say - can we close this issue, @ShyamieG ?

ShyamieG commented 7 months ago

Yes, I think so. I haven't finished testing this out because of all the other issues I've been running in to, but I will be sure to update once I've sorted them out.