tskit-dev / pyslim

Tools for dealing with tree sequences coming to and from SLiM.
MIT License
26 stars 23 forks source link

recapitating merged trees #304

Closed ShyamieG closed 1 year ago

ShyamieG commented 1 year ago

Hi there! I'm trying to implement recapitation in my project, which involves merging tree sequences coming from parallel SLiM simulations. My overall aim is conceptually very similar to what is outlined in this tskit vignette.

I have been able to figure out how to generate a merged tree sequence object (thanks to a lot of help from @jeromekelleher and @petrelharp!), but now I am running into problems with recapitation.

I have created a minimal example derived from my simulation results that reproduces the error I am running into - please see attached for tree sequences and code. Specifically, I get a "ValueError: Duplicate population name: 'p2'" whenever I attempt to recapitate a merged tree.

I have tried re-assigning all my population IDs to be '0' (population IDs are meaningless for my application), but this does not fix the problem. I'm kind of stumped as to where to go from here. Any help would be appreciated!

petrelharp commented 1 year ago

Hi, Shyamalika! Let's see - I haven't got into the details here, but adding the filter_populations=False flag to both calls to .simplify( ) makes this error go away. (I haven't found if your script finishes running, as it started to make my computer run slowly).

For further debugging, if necessary I suggest putting in some calls like for p in ts.populations(): print(p)

to see what's going on with the populations.

-- peter

On Fri, Feb 10, 2023 at 1:27 PM Shyamalika Gopalan @.***> wrote:

Hi there! I'm trying to implement recapitation in my project, which involves merging tree sequences coming from parallel SLiM simulations. My overall aim is conceptually very similar to what is outlined in this tskit vignette.

I have been able to figure out how to generate a merged tree sequence object (thanks to a lot of help from @jeromekelleher and @petrelharp!), but now I am running into problems with recapitation.

I have created a minimal example derived from my simulation results that reproduces the error I am running into - please see attached for tree sequences and code. Specifically, I get a "ValueError: Duplicate population name: 'p2'" whenever I attempt to recapitate a merged tree.

I have tried re-assigning all my population IDs to be '0' (population IDs are meaningless for my application), but this does not fix the problem. I'm kind of stumped as to where to go from here. Any help would be appreciated!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

ShyamieG commented 1 year ago

Thanks for looking into this, Peter! Adding the filter_populations=False flag to .simplify() doesn't fix the issue for me. The error still occurs at the 'pyslim.recapitate()' line later on in the script.

Sorry that my code is lagging up your computer. Is it one of the lines in particular that is causing a problem?

petrelharp commented 1 year ago

Darn, okay. Well, I can have a look, but first: could you see if you can make a smaller example (e.g., make the sequence length much shorter, and see if the error still occurs?)? And, did you try doing print( ) of the populatoins, to see if the problem becomes obvious?

ShyamieG commented 1 year ago

Okay, I produced a smaller example and attached it here.

I did try printing the the populations. Nothing obviously wrong there.

petrelharp commented 1 year ago

Thanks. But, hm, that script doesn't run for me, not even after I change it to say input_dir = ".". Could you fix it up so it does run, quickly? A good way to make sure is to put all the files in their own directory - just like I would have them, starting from nothing, and test it yourself. Thanks.

To be clear, the error I get is

$ python3 test.py 
Traceback (most recent call last):
  File "/home/peter/debug/pyslim/shyamieg/test.py", line 13, in <module>
    tree1_slim_ids = np.array([i.metadata['slim_id'] for i in tree1_sim.nodes()])
                                                              ^^^^^^^^^

which I'm sure is an easy thing to fix, but it's really much better if you can send me an already-working script.

ShyamieG commented 1 year ago

Sorry about that! See here - this should work once you change the input_dir

petrelharp commented 1 year ago

Ah, thanks, much better. Let's see:

First, the two tree sequences have the same populations:

>>> for p in tree1.populations():
...   print(p)
... 
Population(id=0, metadata={'name': 'p0', 'slim_id': 0})
Population(id=1, metadata={'name': 'p1', 'slim_id': 1})
Population(id=2, metadata={'name': 'p2', 'slim_id': 2})
Population(id=3, metadata={'name': 'p3', 'slim_id': 3})
Population(id=4, metadata={'name': 'p4', 'slim_id': 4})
Population(id=5, metadata={'name': 'p5', 'slim_id': 5})
Population(id=6, metadata={'name': 'p6', 'slim_id': 6})
Population(id=7, metadata={'name': 'p7', 'slim_id': 7})
Population(id=8, metadata={'name': 'p8', 'slim_id': 8})
Population(id=9, metadata={'name': 'p9', 'slim_id': 9})
>>> for p in tree2.populations():
...   print(p)
... 
Population(id=0, metadata={'name': 'p0', 'slim_id': 0})
Population(id=1, metadata={'name': 'p1', 'slim_id': 1})
Population(id=2, metadata={'name': 'p2', 'slim_id': 2})
Population(id=3, metadata={'name': 'p3', 'slim_id': 3})
Population(id=4, metadata={'name': 'p4', 'slim_id': 4})
Population(id=5, metadata={'name': 'p5', 'slim_id': 5})
Population(id=6, metadata={'name': 'p6', 'slim_id': 6})
Population(id=7, metadata={'name': 'p7', 'slim_id': 7})
Population(id=8, metadata={'name': 'p8', 'slim_id': 8})
Population(id=9, metadata={'name': 'p9', 'slim_id': 9})

Next, after you union and then simplify we get

>>> for p in merged_tree.populations():
...  print(p)
... 
Population(id=0, metadata={'name': 'p0', 'slim_id': 0})
Population(id=1, metadata={'name': 'p2', 'slim_id': 2})
Population(id=2, metadata={'name': 'p2', 'slim_id': 2})

Why'd it do this? Well, it's easier to see if we remove the .simplify():

>>> for p in merged_tree.populations():
...  print(p)
... 
Population(id=0, metadata={'name': 'p0', 'slim_id': 0})
Population(id=1, metadata={'name': 'p1', 'slim_id': 1})
Population(id=2, metadata={'name': 'p2', 'slim_id': 2})
Population(id=3, metadata={'name': 'p3', 'slim_id': 3})
Population(id=4, metadata={'name': 'p4', 'slim_id': 4})
Population(id=5, metadata={'name': 'p5', 'slim_id': 5})
Population(id=6, metadata={'name': 'p6', 'slim_id': 6})
Population(id=7, metadata={'name': 'p7', 'slim_id': 7})
Population(id=8, metadata={'name': 'p8', 'slim_id': 8})
Population(id=9, metadata={'name': 'p9', 'slim_id': 9})
Population(id=10, metadata={'name': 'p2', 'slim_id': 2})
Population(id=11, metadata={'name': 'p0', 'slim_id': 0})
Population(id=12, metadata={'name': 'p7', 'slim_id': 7})

What happened there is that union by default adds new populations for any nodes in the second tree sequence, and copies over the metadata. I guess that only p2 and p0 and p7 had individuals in them in tree1.

Now, you could disable this by passing add_populations=False to union:

merged_tree = tree2.union(tree1, node_map, check_shared_equality=False, add_populations=False)

... which gives a tree sequence that recapitates just fine. However, maybe that's not what you want? Is p2 in tree1 and p2 in tree2 the same population, or not?

petrelharp commented 1 year ago

If it is not what you want, then there's at least two ways to deal with it. One is to arrange your SLiM scripts so that they don't use the same subpop numbers. (So, after you simplify() you won't end up with duplicate IDs.) Another would be to edit the population table to change the names to not coincide.

ShyamieG commented 1 year ago

Great, this fixes my issue! Thank you so much for walking me through it.