tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 69 forks source link

edges lost after merging two trees #2929

Closed ShyamieG closed 1 week ago

ShyamieG commented 2 months ago

Hi all,

I've been observing some unexpected behaviour when merging tree sequences using tskit.union(). I've tried to pare it down to a reasonable example that can be examined by eye. My code and starting tree sequences are attached.

I have two tree sequences (tree1.txt, tree2.txt) that share some of their evolutionary history, but diverge and evolve independently starting at tick 371. Both tree sequences comprise unbroken chains of nodes going all the way back to the root, across the entire length of the sequence. I'm using the SLiM ids to identify equivalent nodes across trees for tskit.union().

The problem is that when I merge these two trees, I end up with missing edges such that some of these node chains end up 'broken' (merged.txt - see middle tree between bp 9758 and 15477, where tree1's chain of nodes ends at tick 1). This also leads to the merged tree having more roots overall (specifically, 274815470 becomes a new root, even though it had a parent in the original tree).

Weirdly, this does NOT happen when I simplify both tree sequences before merging. This gives me a tree sequence that is more like what I would expect to see (merged_sim.txt). Both tree1 and tree2's chains of nodes carry on all the way back to their shared root node.

The nodes in the attached tree drawings are labelled by SLiM id to facilitate comparison across the starting trees and merged tree. I was trying to create an even smaller example, but I found that simplifying the tree sequences past a certain point I can no longer reproduce the issue.

I'm really not sure what might be going on here. Any ideas would be appreciated!

mufernando commented 2 months ago

Hi, can you tell me more about how the SLiM tree sequences were generated?

I see that there are no coalescences in tree1 and tree2. It's just a few lineages with a bunch of unary nodes along them.

ShyamieG commented 2 months ago

Hi Murillo,

Thanks for your help! I am simulating the evolution of a pathogen as infections are transmitted from one individual to another through time. My SLiM scripts simulate pathogen reproduction within an infected individual and output the resulting tree sequence, which becomes input for the next iteration of the SLiM script (i.e. the individual to whom the infection is transmitted to next). One contagious individual can infect more than one susceptible individual, so the tree sequences can diverge and continue in parallel at various points in the infection scenario. The result of my simulations are a set of related tree sequences that have been 'sampled' from a population of infected individuals. My goal now is to merge these tree sequences to get a unified tree sequence representing the genetic ancestry of all the sampled pathogens. Here is an older discussion that explains my context and overall goal in a bit more detail.

I hope that helps. Happy to go into more detail if you need.

mufernando commented 2 months ago

Hi Shyamie, I made some progress here.

tskit.union adds the non-shared bits of other onto self. The way union knows what bits of other are not already in self is with the node_map. node_map is an array of length other.num_nodes which relates a node in other (by the position in this array) to a node in self (the value at each position). If the value at a position is tskit.NULL, then it adds that node/edge onto self.

The main complaint here is that tskit.union seems to not be working because it doesn't add the lineage starting from node SLiM id 274815470 back to its root in the interval [9758, 15477). But it does work when you first simplify the two tree sequences.

I think the problem is in the node_map. I saved the node_map from the un-simplified union as node_map1 and for the simplified union as node_map2. If you inspect these arrays, you can see that they both contain tskit.NULL for the extant node 274815504, indicating that this node must be added. However, starting on the node 274815470 and back, the node_map for the un-simplified tree sequences contain a value other than tskit.NULL, thus they are not being added to the union-ed tree sequence.

> # SLiM node 274815504
> x = node_map1[tree1_slim_id_map[274815504]]
> y = node_map2[tree1_slim_id_map2[274815504]]
> x,y
(-1, -1)
> # SLiM node 274815470
> x = node_map1[tree1_slim_id_map[274815470]]
> y = node_map2[tree1_slim_id_map2[274815470]]
> x,y
(1, -1)

Thus, I don't think there is an issue with tskit.union itself, but on how you are defining these maps.

Now, I am not sure how to fix this because it isn't clear to me how we should match the nodes in your case...

ShyamieG commented 2 months ago

Hi Murillo,

Ah, yes, you're right. I'd already simplified my trees a bit to post a smaller example, and I think that is introducing some complications (ironically).

The node 274815470 actually does exist in both my original trees - however, it only exists for tree1 in this particular interval. Here is a larger interval for tree1 and tree2 to illustrate this. Again, the merged tree sequence has 'lost' edges and stranded nodes. Sorry for the large text files. Here are the tree sequence files, too. You can see the last interval in tree2 contains the node 274815470.

Your explanation is helpful, though. It makes sense that if a node exists in both trees, there would be no need to add it. I guess my naive assumption was that, even if that is the case, tskit.union would add any edges present in either tree1 or tree2 to the merged tree. In the image below, node 1 corresponds to SLiM id 274815470:

Screenshot 2024-04-16 at 3 10 08 PM

Clearly, that isn't what is happening - the edges unique to tree2 are kept and those unique to tree1 are dropped. If I reverse the order of the tree sequences in the tskit.union() call, the reverse happens. Is there some way to force tskit.union to retain both sets of edges?

mufernando commented 2 months ago

There is no way currently to retain both sets of edges with union. We assume that the history is completely identical up to a point. If this node exists and is evolving independently in each of the two simulations, how can you reconcile the independent histories?

ShyamieG commented 2 months ago

I have to think about this a bit more, but to take a stab at answering your question - I don't think the shared nodes are evolving independently in these two trees. The history of any node that is present in both tree1 and tree2 should be identical. However, not all nodes will be present in all intervals because the two trees represent different sample sets.

(As a side note, I said earlier that the trees diverged at tick 371, but this isn't accurate - that is just when the lineages leading to samples 274815504 and 274815508 diverge.)

Does this make any sense? Again, I think I need to do a bit more investigating in order to address your point.

This has been a helpful discussion so far. Thanks for taking the time!

mufernando commented 2 months ago

Are you remembering all the individuals at the end of each pathogen reproduction cycle? If not then you would lose some edges due to simplification in each subsequent simulation independently and then we can't union them.

In the pyslim vignette, I recommend you remember all the individuals at the end and these can be used to create the node_map afterwards.

ShyamieG commented 2 months ago

I will look at that more carefully - thanks!

ShyamieG commented 3 weeks ago

Update - I think I finally solved this problem. It did indeed have to do with certain crucial individuals not being retained/remembered in the final trees. Thanks for the tip, @mufernando!

petrelharp commented 3 weeks ago

That's very exciting! Good work!!!

ShyamieG commented 3 weeks ago

UGH - I'm sorry, but I spoke to soon. This problem still persists 😭. See example below - trees attached here and code here.

As before, I have two tree sequences that I want to merge together. They share a common origin, but diverge at some point in my simulation sequence and continue evolving independently. Drawings of these trees are attached here (tree1) and here (tree2). Note that the nodes are labelled by their SLiM id, so they are directly comparable.

Over the first interval, it looks like the merge has been performed correctly. However, over the second interval, where tree1 and tree2 share a different ultimate root (6 and 0, respectively), the problem I described before still pops up: several edges that exist in tree2 are dropped when it is merged with tree1 (the converse happens if I flip the order in which I merge the trees). See merged tree drawing here.

I feel like I'm so close to solving this - I have two parallel trees over the second subtree, which is what I would expect in this situation, but they don't both trace all the way back to their respective ultimate roots. All the relevant individuals are now being permanently remembered throughout my simulation, so I'm not sure what else to try.

Any help or ideas would once again be greatly appreciated. Thank you!

petrelharp commented 3 weeks ago

UGH indeed. I'll have a closer look, but: if you remember all the individuals, does the problem go away?

ShyamieG commented 3 weeks ago

Do you mean remembering everyone that is created during my sim, or everyone that is alive at the end of a sim step? Or something else?

I can try playing with different strategies, but since I'm generating hundreds of thousands of individuals per sim step, I have had to come up with some way of dropping irrelevant individuals, or else things get untenable quickly 😅. I guess the trick is figuring out who is actually irrelevant!

ShyamieG commented 3 weeks ago

I regenerated these trees with some modifications to my simulation pipeline - now, no individuals are dropped between simulation steps, although they could still be lost through simplification within SLiM. Each tree now has >2 million individuals (before each had around 1000). Both these sample trees still have only 'true' roots (nodes with SLiM ids from 0-39). I have not attached these trees here, but I can if you would like me to.

tree1_roots = []
for ts in tree1.trees():
    tree1_roots.append([tree1.node(x).metadata['slim_id'] for x in ts.roots])
tree2_roots = []
for ts in tree2.trees():
    tree2_roots.append([tree2.node(x).metadata['slim_id'] for x in ts.roots])

"tree1's roots: " + str(tree1_roots);"tree2's roots: " + str(tree2_roots)
>> "tree1's roots: [[6]]"
>> "tree2's roots: [[6], [0]]"

When I merge them, the resulting tree has even more 'false' roots than before.

merged_tree = merge_ts(tree1, tree2)
merged_tree_roots = []
for ts in merged_tree.trees():
    merged_tree_roots.append([merged_tree.node(x).metadata['slim_id'] for x in ts.roots])

"merged tree's roots: " + str(merged_tree_roots)
>> "merged tree's roots: [[63771849, 63771851, 63771845, 63771847, 55585957, 55585961, 55585963, 55585965, 55585967, 55585969, 55585971, 55585973, 55585975, 55585977, 55585979, 55585983, 55585985, 55585987, 55585981, 55585959, 52183151, 52183153, 52183157, 52183155, 47237755, 47237757, 47237759, 47237761, 47237763, 47237765, 47237767, 47237769, 47237771, 47237773, 47237777, 47237775, 44773867, 44773875, 44773877, 44773879, 44773881, 44773871, 44773869, 44773873, 42597301, 42597307, 42597311, 42597315, 42597317, 42597319, 42597321, 42597323, 42597327, 42597329, 42597331, 42597325, 42597305, 42597309, 42597313, 42597303, 40188117, 40188121, 40188123, 40188127, 40188129, 40188131, 40188133, 40188135, 40188137, 40188139, 40188119, 40188125, 38464819, 38464821, 38464823, 38464825, 33998265, 33998267, 33998269, 33998271, 25710419, 25710421, 25710423, 25710425, 25710427, 25710431, 25710433, 25710429, 18361981, 18361983, 18361985, 18361987, 14570945, 14570947, 14570949, 14570951, 14570953, 14570957, 14570943, 14570955, 6675059, 6675061, 6675063, 6675065, 6194335, 6194337, 6194341, 6194339, 1749413, 1749415, 1749417, 1749419, 614751, 614753, 614757, 614755, 6, 7, 14, 15, 16, 17, 20, 21, 28, 29, 32, 33, 38, 39, 4, 5, 10, 11, 26, 27, 30, 31, 24, 25, 0, 1, 2, 3, 8, 9, 12, 13, 18, 19, 22, 23, 34, 35, 36, 37, 63786233, 63786235, 63786237, 63786239, 63786241, 63786243, 63786245, 63786247, 63786251, 63786253, 63786255, 63786259, 63786261, 63786263, 63786249, 63786257, 58012803, 58012805, 58012807, 58012809, 51579297, 51579299, 51579301, 51579303, 51579305, 51579307, 51579311, 51579313, 51579315, 51579317, 51579321, 51579323, 51579325, 51579327, 51579329, 51579333, 51579309, 51579295, 51579331, 51579319, 41740147, 41740149, 41740151, 41740153, 39948301, 39948303, 39948305, 39948307, 38119141, 38119143, 38119145, 38119149, 38119151, 38119153, 38119155, 38119157, 38119159, 38119161, 38119163, 38119147, 37487163, 37487165, 37487167, 37487171, 37487175, 37487177, 37487173, 37487169, 34146291, 34146293, 34146297, 34146299, 34146301, 34146303, 34146305, 34146307, 34146309, 34146313, 34146315, 34146317, 34146319, 34146321, 34146295, 34146311, 31669367, 31669371, 31669373, 31669369, 28107153, 28107155, 28107157, 28107159, 20113449, 20113451, 20113453, 20113455, 20113459, 20113457, 20113447, 20113445, 13340389, 13340391, 13340393, 13340395, 9695671, 9695675, 9695677, 9695673, 614751, 614753, 614757, 614755, 7, 14, 15, 17, 21, 29, 33, 39, 5, 10, 11, 27, 31, 25, 1, 3, 9, 12, 13, 19, 23, 35, 36, 37], [63771849, 63771851, 63771845, 63771847, 55585957, 55585961, 55585963, 55585965, 55585967, 55585969, 55585971, 55585973, 55585975, 55585977, 55585979, 55585983, 55585985, 55585987, 55585981, 55585959, 52183151, 52183153, 52183157, 52183155, 47237755, 47237757, 47237759, 47237761, 47237763, 47237765, 47237767, 47237769, 47237771, 47237773, 47237777, 47237775, 44773867, 44773875, 44773877, 44773879, 44773881, 44773871, 44773869, 44773873, 42597301, 42597307, 42597311, 42597315, 42597317, 42597319, 42597321, 42597323, 42597327, 42597329, 42597331, 42597325, 42597305, 42597309, 42597313, 42597303, 40188117, 40188121, 40188123, 40188127, 40188129, 40188131, 40188133, 40188135, 40188137, 40188139, 40188119, 40188125, 38464819, 38464821, 38464823, 38464825, 33998265, 33998267, 33998269, 33998271, 25710419, 25710421, 25710423, 25710425, 25710427, 25710431, 25710433, 25710429, 18361981, 18361983, 18361985, 18361987, 14570945, 14570947, 14570949, 14570951, 14570953, 14570957, 14570943, 14570955, 6675059, 6675061, 6675063, 6675065, 6194335, 6194337, 6194341, 6194339, 1749413, 1749415, 1749417, 1749419, 614751, 614753, 614757, 614755, 6, 7, 14, 15, 16, 17, 20, 21, 28, 29, 32, 33, 38, 39, 4, 5, 10, 11, 26, 27, 30, 31, 24, 25, 0, 1, 2, 3, 8, 9, 12, 13, 18, 19, 22, 23, 34, 35, 36, 37, 63786233, 63786235, 63786237, 63786239, 63786241, 63786243, 63786245, 63786247, 63786251, 63786253, 63786255, 63786259, 63786261, 63786263, 63786249, 63786257, 58012803, 58012805, 58012807, 58012809, 51579297, 51579299, 51579301, 51579303, 51579305, 51579307, 51579311, 51579313, 51579315, 51579317, 51579321, 51579323, 51579325, 51579327, 51579329, 51579333, 51579309, 51579295, 51579331, 51579319, 41740147, 41740149, 41740151, 41740153, 39948301, 39948303, 39948305, 39948307, 38119141, 38119143, 38119145, 38119149, 38119151, 38119153, 38119155, 38119157, 38119159, 38119161, 38119163, 38119147, 37487163, 37487165, 37487167, 37487171, 37487175, 37487177, 37487173, 37487169, 34146291, 34146293, 34146297, 34146299, 34146301, 34146303, 34146305, 34146307, 34146309, 34146313, 34146315, 34146317, 34146319, 34146321, 34146295, 34146311, 31669367, 31669371, 31669373, 31669369, 28107153, 28107155, 28107157, 28107159, 20113449, 20113451, 20113453, 20113455, 20113459, 20113457, 20113447, 20113445, 13340389, 13340391, 13340393, 13340395, 9695671, 9695675, 9695677, 9695673, 614751, 614753, 614757, 614755, 7, 14, 15, 17, 21, 29, 33, 39, 5, 10, 11, 27, 31, 25, 1, 3, 9, 12, 13, 19, 23, 35, 36, 37], [63771849, 63771851, 63771845, 63771847, 55585957, 55585961, 55585963, 55585965, 55585967, 55585969, 55585971, 55585973, 55585975, 55585977, 55585979, 55585983, 55585985, 55585987, 55585981, 55585959, 52183151, 52183153, 52183157, 52183155, 47237755, 47237757, 47237759, 47237761, 47237763, 47237765, 47237767, 47237769, 47237771, 47237773, 47237777, 47237775, 44773867, 44773875, 44773877, 44773879, 44773881, 44773871, 44773869, 44773873, 42597301, 42597307, 42597311, 42597315, 42597317, 42597319, 42597321, 42597323, 42597327, 42597329, 42597331, 42597325, 42597305, 42597309, 42597313, 42597303, 40188117, 40188121, 40188123, 40188127, 40188129, 40188131, 40188133, 40188135, 40188137, 40188139, 40188119, 40188125, 38464819, 38464821, 38464823, 38464825, 33998265, 33998267, 33998269, 33998271, 25710419, 25710421, 25710423, 25710425, 25710427, 25710431, 25710433, 25710429, 18361981, 18361983, 18361985, 18361987, 14570945, 14570947, 14570949, 14570951, 14570953, 14570957, 14570943, 14570955, 6675059, 6675061, 6675063, 6675065, 6194335, 6194337, 6194341, 6194339, 1749413, 1749415, 1749417, 1749419, 614751, 614753, 614757, 614755, 6, 7, 14, 15, 16, 17, 20, 21, 28, 29, 32, 33, 38, 39, 4, 5, 10, 11, 26, 27, 30, 31, 24, 25, 0, 1, 2, 3, 8, 9, 12, 13, 18, 19, 22, 23, 34, 35, 36, 37, 63786233, 63786235, 63786237, 63786239, 63786241, 63786243, 63786245, 63786247, 63786251, 63786253, 63786255, 63786259, 63786261, 63786263, 63786249, 63786257, 58012803, 58012805, 58012807, 58012809, 51579297, 51579299, 51579301, 51579303, 51579305, 51579307, 51579311, 51579313, 51579315, 51579317, 51579321, 51579323, 51579325, 51579327, 51579329, 51579333, 51579309, 51579295, 51579331, 51579319, 41740147, 41740149, 41740151, 41740153, 39948301, 39948303, 39948305, 39948307, 38119141, 38119143, 38119145, 38119149, 38119151, 38119153, 38119155, 38119157, 38119159, 38119161, 38119163, 38119147, 37487163, 37487165, 37487167, 37487171, 37487175, 37487177, 37487173, 37487169, 34146291, 34146293, 34146297, 34146299, 34146301, 34146303, 34146305, 34146307, 34146309, 34146313, 34146315, 34146317, 34146319, 34146321, 34146295, 34146311, 31669367, 31669371, 31669373, 31669369, 28107153, 28107155, 28107157, 28107159, 20113449, 20113451, 20113453, 20113455, 20113459, 20113457, 20113447, 20113445, 13340389, 13340391, 13340393, 13340395, 9695671, 9695675, 9695677, 9695673, 614751, 614753, 614757, 614755, 7, 14, 15, 17, 21, 29, 33, 39, 5, 10, 11, 27, 31, 25, 1, 3, 9, 12, 13, 19, 23, 35, 36, 37], [63771849, 63771851, 63771845, 63771847, 55585957, 55585961, 55585963, 55585965, 55585967, 55585969, 55585971, 55585973, 55585975, 55585977, 55585979, 55585983, 55585985, 55585987, 55585981, 55585959, 52183151, 52183153, 52183157, 52183155, 47237755, 47237757, 47237759, 47237761, 47237763, 47237765, 47237767, 47237769, 47237771, 47237773, 47237777, 47237775, 44773867, 44773875, 44773877, 44773879, 44773881, 44773871, 44773869, 44773873, 42597301, 42597307, 42597311, 42597315, 42597317, 42597319, 42597321, 42597323, 42597327, 42597329, 42597331, 42597325, 42597305, 42597309, 42597313, 42597303, 40188117, 40188121, 40188123, 40188127, 40188129, 40188131, 40188133, 40188135, 40188137, 40188139, 40188119, 40188125, 38464819, 38464821, 38464823, 38464825, 33998265, 33998267, 33998269, 33998271, 25710419, 25710421, 25710423, 25710425, 25710427, 25710431, 25710433, 25710429, 18361981, 18361983, 18361985, 18361987, 14570945, 14570947, 14570949, 14570951, 14570953, 14570957, 14570943, 14570955, 6675059, 6675061, 6675063, 6675065, 6194335, 6194337, 6194341, 6194339, 1749413, 1749415, 1749417, 1749419, 614751, 614753, 614757, 614755, 6, 7, 14, 15, 16, 17, 20, 21, 28, 29, 32, 33, 38, 39, 4, 5, 10, 11, 26, 27, 30, 31, 24, 25, 0, 1, 2, 3, 8, 9, 12, 13, 18, 19, 22, 23, 34, 35, 36, 37, 63786233, 63786235, 63786237, 63786239, 63786241, 63786243, 63786245, 63786247, 63786251, 63786253, 63786255, 63786259, 63786261, 63786263, 63786249, 63786257, 58012803, 58012805, 58012807, 58012809, 51579297, 51579299, 51579301, 51579303, 51579305, 51579307, 51579311, 51579313, 51579315, 51579317, 51579321, 51579323, 51579325, 51579327, 51579329, 51579333, 51579309, 51579295, 51579331, 51579319, 41740147, 41740149, 41740151, 41740153, 39948301, 39948303, 39948305, 39948307, 38119141, 38119143, 38119145, 38119149, 38119151, 38119153, 38119155, 38119157, 38119159, 38119161, 38119163, 38119147, 37487163, 37487165, 37487167, 37487171, 37487175, 37487177, 37487173, 37487169, 34146291, 34146293, 34146297, 34146299, 34146301, 34146303, 34146305, 34146307, 34146309, 34146313, 34146315, 34146317, 34146319, 34146321, 34146295, 34146311, 31669367, 31669371, 31669373, 31669369, 28107153, 28107155, 28107157, 28107159, 20113449, 20113451, 20113453, 20113455, 20113459, 20113457, 20113447, 20113445, 13340389, 13340391, 13340393, 13340395, 9695671, 9695675, 9695677, 9695673, 614751, 614753, 614757, 614755, 7, 14, 15, 17, 21, 29, 33, 39, 5, 10, 11, 27, 31, 25, 1, 3, 9, 12, 13, 19, 23, 35, 36, 37]]"

I still might not be remembering all the individuals I need to, but I am still a bit surprised to see that keeping more information around throughout my sim actually makes the problem worse...

I tried a couple other things, too. I noticed before that when I switch the order of the trees in the merge I end up with a different subtree being cut off before it reaches its true root. But I thought that if I could merge these two reciprocal merges to each other, I might end up with what I want. However this doesn't work - I get a TSK_ERR_BAD_EDGES_CONTRADICTORY_CHILDREN, and doing a .sort() or .build_index() on the merged tree's table collection doesn't fix it. I would guess that this is because of the edge being dropped by the first merge steps.

# try merging reciprocal merges
merged_tree_0 = merge_ts(tree1, tree2)
merged_tree_1 = merge_ts(tree2, tree1)
draw_tree(merged_tree_0, input_dir + "merged_tree_0.txt")
draw_tree(merged_tree_1, input_dir + "merged_tree_1.txt")
meta_merge_ts = merge_ts(merged_tree_1, merged_tree_0) # this doesn't work

>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>>   File "<stdin>", line 28, in merge_ts
>>   File "/Users/shyamag/miniconda3/envs/slim-env/lib/python3.9/site-packages/tskit/trees.py", line 7020, in union
>>     tables.union(
>>   File "/Users/shyamag/miniconda3/envs/slim-env/lib/python3.9/site-packages/tskit/tables.py", line 4119, in union
>>     self._ll_tables.union(
>> _tskit.LibraryError: Bad edges: contradictory children for a given parent over an interval, or indexes need to be rebuilt. 
(TSK_ERR_BAD_EDGES_CONTRADICTORY_CHILDREN)

I also tried simplifying my trees, as before, and found I could generate the expected tree if I kept only a few samples (although it does depend on which samples). Again, this only works if I leave filter_nodes as True.

merged_tree_0_sim = merged_tree_0.simplify(samples=[x.id for x in merged_tree_0.nodes() if x.metadata['slim_id'] in [64168318, 64226720, 64203428, 63940526]], keep_input_roots=True, keep_unary=True, filter_nodes=True)
merged_tree_1_sim = merged_tree_1.simplify(samples=[x.id for x in merged_tree_1.nodes() if x.metadata['slim_id'] in [64168318, 64226720, 64203428, 63940526]], keep_input_roots=True, keep_unary=True, filter_nodes=True)
meta_merge_ts = merge_ts(merged_tree_0_sim, merged_tree_1_sim) # this works!
draw_tree(meta_merge_ts, input_dir + "meta_merge_ts.txt") # this is what I would expect it to look like

I will keep investigating this, but wanted to provide an update in case it is helpful.

petrelharp commented 3 weeks ago

I meant "what if you keep all individuals at all times" - I know this is totally infeasible, but I'm just trying to eliminate possible issues. But, never mind that.

I'm looking at your example from this comment. There, an issue is that in the merged tree there is an odd, new root on the second interval, node 619416. What's up with that? Well, here's some code I used to investigate:

slim_map1 = SLiM_to_node_dict(tree1)
slim_map2 = SLiM_to_node_dict(tree2)

def get_context(ts, node):
    print("--> Context for: ",  ts.node(node))
    for t in ts.trees():
        print("-----", t.interval)
        if node not in t.nodes():
            print(" not in tree.")
        else:
            p = t.parent(node)
            if p == tskit.NULL: # -1
                print("parent: None")
            else:
                print("parent: ", ts.node(p))
            print("  children:")
            for c in t.children(node):
                print("     ", ts.node(c))

focal_node = 619416
get_context(tree1, slim_map1[focal_node]) 
get_context(tree2, slim_map2[focal_node]) 

This produces:

--> Context for:  Node(id=1125, flags=0, time=993.0, population=0, individual=441, metadata={'slim_id': 619416, 'is_null': False, 'genome_type': 0})
----- Interval(left=0.0, right=35389.0)
parent:  Node(id=1127, flags=0, time=995.0, population=0, individual=439, metadata={'slim_id': 614754, 'is_null': False, 'genome_type': 0})
  children:
      Node(id=1123, flags=0, time=960.0, population=0, individual=454, metadata={'slim_id': 630168, 'is_null': False, 'genome_type': 0})
>>> get_context(tree2, slim_map2[focal_node]) 
--> Context for:  Node(id=1003, flags=0, time=993.0, population=0, individual=424, metadata={'slim_id': 619416, 'is_null': False, 'genome_type': 0})
----- Interval(left=0.0, right=1.0)
parent:  Node(id=1005, flags=0, time=995.0, population=0, individual=422, metadata={'slim_id': 614754, 'is_null': False, 'genome_type': 0})
  children:
      Node(id=1000, flags=0, time=961.0, population=0, individual=436, metadata={'slim_id': 630022, 'is_null': False, 'genome_type': 0})
----- Interval(left=1.0, right=35389.0)
 not in tree.

Okay, so the node is present in both trees, but only on the first interval of tree2. And it seems like the node is the same in both trees - it has the same (SLiM-ID'ed) parent. I pulled out your code to make the node_map, and this tells me that

node_map[slim_map1[focal_node]]

is 1003, as expected.

Okay, so what's the situation here? (I'm talking out loud.) This is a node that

  1. is in both trees but
  2. in tree2 is only there over the interval 0 to 1.0, having an edge above it to SLiM ID 614754 and below it to SLiM ID 630022
  3. in tree1 is over the whole interval 0 to 35389.0, having an edge above it to SLiM ID 614754 and below it to SLiM ID 630168

Furthermore, that parental node is correctly mapped between trees:

node_map[slim_map1[614754]]

is indeed 1005, as it should be. However, the child node from tree1 had SLiM ID 630168; and this one is not mapped:

node_map[slim_map1[630168]]

... as expected, this is -1.

Now, turning to help(tree2.union): which edges are added to the result? These are:

    2. Edges whose parent or child are new to ``self``.

(Note: I'm not even trying to be pedantic here, I'm just talking through to myself what's going on. =) Okay - so, the result is As Expected: tree2.union(tree1, ...) is not supposed to add new information between two nodes that already in tree2, so it does not mess with the relationships beteween 619416 and its parent, since both are in both trees. However, there is a node in tree1 over the entire interval that is below 619416; so as a result 619416 is now found in the entire merged tree. Note that this discrepancy is something contributing to the error we get if we check shared equality.

Now, it would be totally reasonable in this case to do what we want, i.e., to add the new information (here, to extend the edge above 619416), since that doesn't contradict anything in tree2. However, that's now how we wrote union(). Getting it to work properly was a pain already, but it'd be pretty neat to get it to do so.

I guess the real question is how to avoid this sort of situation. One way to avoid it for-sure-I-think is that if every time a tree sequence is written out, everyone alive at the time is Remembered. I totally forget what exactly you're doing and if that is feasible, however. It seems like if you don't do that, you'll run into these situations.

Another question would be: what would it take to union these tree sequences in the way that you want here. That's certainly do-able, but doing it efficiently would be a fair bit of work, I think (but I haven't thought through it).

(Side note: it's maybe a good idea to modify your code to assert that IDs are unique:

def SLiM_to_node_dict(tree_seq):
    id_map = {}
    for i in range(tree_seq.num_nodes):
        slim_id = tree_seq.node(i).metadata['slim_id']
        assert slim_id not in id_map
        id_map[slim_id] = i
    return id_map

since otherwise the code silently ignores this fact. If this is in fact expected, maybe this needs to return lists instead of ints? There are a few SLiM IDs in the merged sequence that occur in multiple nodes, but not in the originals.)

ShyamieG commented 2 weeks ago

Thank you so much for walking me through your thought process! This is super helpful. And thanks for your last note, I will have to re-think how I account for multiple nodes having the same SLiM id.

So it seems like the two options are 1) modify my simulations to remember individuals that aren't currently being remembered or 2) add an additional step that checks for these cases and manually adds the information I want to the tree post hoc.

Because of how union() has been designed, I am leaning towards option 2. Because of the way my simulations are set up, I'm not sure that I'd ever be able to properly implement option 1. I did try a version where I keep everyone that is alive at a branch point, but that didn't fix the issue.

I will play around with manually adding edges that are present in the initial tree but missing from the merged tree and let you know how I get on with that.

petrelharp commented 2 weeks ago

I'm not sure that I'd ever be able to properly implement option 1. I did try a version where I keep everyone that is alive at a branch point, but that didn't fix the issue.

Hm - my strong inclination is to try (1) first, since it really ought to Just Work. I am probably forgetting something when I say that, but unless you know a concrete reason why it should not work, then I think we should give it a try. The only things I can think of right now that would get in the way are (a) SLiM metadata might not agree (but that can be fixed); and (b) remembering more individuals might cause performance issues. But, it's going to require a lot less work getting (1) working to assess whether performance will be an issue than it will be to get (2) working.

petrelharp commented 1 week ago

Update from off github: (1) does indeed work, but it also makes things a lot slower than a version that only kept what we really needed would be. However, that'd require a more sophisticated version of .union( ). Brainstorming continues.

Anyhow - I think we can close this issue, @ShyamieG? And we can open an diifferent issue if needed? Re-open this if I'm missing something, though.

ShyamieG commented 1 week ago

Yes, agreed! Remembering everyone + not simplifying the trees at all before merging is the solution.

petrelharp commented 6 days ago

(for clarity: remembering everyone who is alive at save points, not everyone alive ever)