tskit-dev / tskit

Population-scale genomics
MIT License
153 stars 72 forks source link

Test failure in test_ibd_finder_samples_are_descendants #1678

Closed jeromekelleher closed 3 years ago

jeromekelleher commented 3 years ago

The C test test_ibd_finder_samples_are_descendants is interesting, as it checks for IBD relationships in the following case:

const char *multi_root_tree_ex_nodes = "1  0   -1  -1\n" /*  4     5 */
                                       "1  0   -1  -1\n" /*  |     | */
                                       "1  1   -1  -1\n" /*  2     3 */
                                       "1  1   -1  -1\n" /*  |     | */
                                       "0  2   -1  -1\n" /*  0     1 */
                                       "0  2   -1  -1\n";

const char *multi_root_tree_ex_edges = "0   1   2   0\n"
                                       "0   1   3   1\n"
                                       "0   1   4   2\n"
                                       "0   1   5   3\n";

Now, we if we specify the set within = {0, 1, 2, 3, 4, 5} we get:

===                                            
IBD Result
===
num_keys = 4
2       (0,2) n=1 total_span=1.000000   (0.000000, 1.000000)->2, 
9       (1,3) n=1 total_span=1.000000   (0.000000, 1.000000)->3, 
16      (2,4) n=1 total_span=1.000000   (0.000000, 1.000000)->4, 
23      (3,5) n=1 total_span=1.000000   (0.000000, 1.000000)->5, 

So, it's only finding in the relationships between samples with parent/child relationships and not also the ones between, e.g., 0 and 4.

I'm not sure this test ever worked btw - some of the loops in the C tests were preventing code from being run in a non-obvious way.

Any thoughts here @gtsambos? There's no point in looking at this in detail until #1660 is merged, but good to get any immediate inputs.

jeromekelleher commented 3 years ago

@gtsambos - I think this is what's hitting us over in #1809. Any quick ideas on what might be the issue here?

gtsambos commented 3 years ago

Hmm. There was one Python IBD test that I was marking as an expected failure bc I couldn't get to the bottom of it at the time, and I think it was this one, but I've just seen that it seems to have been updated with all your recent changes. Let me go back and see if I can find it in the history.

gtsambos commented 3 years ago

Note that the corresponding Python test doesn't actually test for this set of pairs (it only looks for parent-child pairs), so this may well fail in both implementations.

gtsambos commented 3 years ago

My guess is there's a bug to do with when the parent_should_be_added boolean is invoked and modified. It's perhaps a little easier to see what it does in the Python code, here in lines 219-237

This was always perhaps not the best way to implement this edge case -- basically, what it does is add an extra segment to the list of ancestral segments if the parent is also a sample, which allows the calculate_ibd_segs function to process it as usual. Basically, calculate_ibd_segs assumes that the MRCA will be a polytomous node, so by default it won't do anything if there is just one edge underneath (as there is in this example). By adding an extra segment in, we 'trick' it into thinking there are two.

It might be simpler to just call add_segment directly in this loop, because this would avoid some of the extra bookkeeping needed to stop lots of duplicate segments being added into the lists (which is why parent_should_be_added is needed in the first place)

gtsambos commented 3 years ago

I'll be working tonight if you want to have more of a chat about it then @jeromekelleher! I need to do some thesis work now, but can explain in more detail exactly what parts of the code might need to be changed, and to what

jeromekelleher commented 3 years ago

Ah, sounds about right, thanks @gtsambos. I'll see if I can get my head around it. FWIW, internal samples has always been a major pain in these simplify-esque algorithms.