mufernando / graft

0 stars 0 forks source link

check for inconsistencies due to nodes being alive in both branches #5

Open petrelharp opened 4 years ago

petrelharp commented 4 years ago

Check the assumption that all shared nodes have the same mutations. Can use a small nonWF sim to test that this error is thrown.

(site note: to do with with a nonWF sim, should split off a new subPop at the start of each branch to avoid problems like this (or maybe at the end of each branch?).)

mufernando commented 4 years ago

Any idea how to do this efficiently? Do you mean mutations that only appeared at that node or all mutations inherited by a node?

petrelharp commented 4 years ago

One way would be to make for each a table collection empty except for (a) the shared nodes and (b) the mutations appearing on those nodes, set the mutation.parent column to tskit.NULL; sort them, then check for equality?

mufernando commented 4 years ago

I tried doing that but something is off. https://github.com/mufernando/graft/blob/master/graft/__init__.py#L86

Can't figure out where the problem is. But whenever I run this function in our nonWF sims still the mutation tables differ.

petrelharp commented 4 years ago

Hm, all the tests pass at the moment. Could you make a test that does something wrong?

mufernando commented 4 years ago

Ok, I implemented a test to make sure the tree sequences are the same above the shared nodes. I am just simplifying on those nodes and checking that the tables are equal. See here.

Still need to write a test with nonWF to make sure the check would catch it I think.

mufernando commented 4 years ago

I'm trying to trigger this error in

https://github.com/mufernando/graft/blob/master/tests/test_graft.py#L165

What I'm currently doing is checking the mutations BEFORE the split, but here you are worried about non-shared mutations AFTER the split (I think). In sum, my test is not doing what you asked, because by simplifying on the shared nodes I only get stuff before split.

mufernando commented 4 years ago

One way to test what you are saying is that grafting ts2 onto ts1 should be the same as ts1 onto ts2. At least for some attributes (e.g., mutation table)

petrelharp commented 4 years ago

Wouldn't an easy way to check this be to check that their genotypes are the same?

mufernando commented 4 years ago

I guess the issue here is that any nodes that survive longer than split time are not gonna be matched, because the match_nodes function takes the split time as input and only matches nodes at least as old as the split.

petrelharp commented 4 years ago

Well, deciding which nodes to match is a job for pyslim, not tskit. But a node's time is their birth time, so I think that pyslim should match them?

mufernando commented 4 years ago

well, then I don't see how simplifying on the nodes wouldn't catch the non-shared mutations. maybe mutation rate is not high enough?

petrelharp commented 4 years ago

I'm not sure what you mean about simplifying. I think that genotypes of the mapped nodes should be equal at all shared sites, and that all sites in ts2 should be present in tsg? If so, here's code to check it (untested):

variants2 = ts2.variants()
v2 = next(variants2)
for vg in tsg.variants():
    while v2.position < vg.position:
        v2 = next(variants2)
    # all sites in ts2 should also be in tsg:
    self.assertEqual(v2.position, vg.position)
    for n2, ng in nodemap2new.items():
        g2 = v2.genotypes[n2]
        gg = vg.genotypes[ng]
        a2 = v2.site.mutations[g2]
        ag = vg.site.mutations[gg]
        self.assertEqual(a2, ag)
mufernando commented 4 years ago

Sorry, I'll try to be as verbose as possible:

You were worried that if a node is still alive at the split time, then they may accumulate mutations after the split, and we should trigger an error, because we are assuming the nodes experienced the exact same history. I wrote this code that is called within the graft to catch situations like this.

The idea of the check is that if the matched nodes were still alive, they would accumulate different mutations in each branch and so the tree sequences returned by simplifying on those nodes should be different. I think this is does all that checking the genotypes of the nodes (and more).

Now my issue is that I am trying to trigger the error using nonwf sims in SLiM. I am using a simple recipe with overlapping generations that hopefully would trigger the check_shared_nodes. But I cannot get that error thrown, at least not with this recipe and parameters.

The question then is: (a) would the test I wrote really be triggered if nodes were alive in both branches?, and (b) if so, then is the problem in the sims or their parameters? if not, where is my logic failing?

petrelharp commented 4 years ago

Ah, gotcha; that makes sense. And I do agree that is a good check. To make sure the error should be triggered, you could turn up the mutation rate so it's sure to happen (eg 10 individuals, genome length of 10 nucleotides, for 10 generation, mutation rate 0.1)?

mufernando commented 4 years ago

I tried this, and even more bigger genomes/pops, but that error is never triggered.

I'm using this recipe here within this test here.

petrelharp commented 4 years ago

Oh! I'm sorry. In SLiM, mutations only occur at birth, unless you add them manually later. We can test this with msprime, though!