tskit-dev / tskit

Population-scale genomics
MIT License
151 stars 71 forks source link

General segment accumulation for find_ibd #1663

Open jeromekelleher opened 3 years ago

jeromekelleher commented 3 years ago

Assuming that we agree on filtering candidate segments based on the among and between segments within the IBD finding algorithm as proposed in #1662, the interface between the IBD finding and storage becomes very simple and clean. Really, all we need is some class that implements the following interface:

class IbdResult:
    def add_segment(self, left, right, sample_a, sample_b, ancestor, time):
           raise NotImplemented

I've added time here because this seems like a useful thing to provide the result object. Otherwise, we'd have to keep a reference to the tree sequence to find out the ancestor node's time (which we'd definitely want, for computing stats).

In C, we'd do this by supplying a function pointer with a similar interface.

The point of this is that is decouples the definition of finding IBD from how the results are used. We can imagine doing things like accumulating time-binned statistics or just writing out to a file, all of which are independent of the IBD finding method. We just supply it a function pointer (in C), and it'll call this function pointer with any qualifying segments.

I'd imagine this could be useful for computing a range of different statistics (a bit like general_stat)?

Any thoughts @gtsambos , @petrelharp?

I'm not suggesting we change the public interface for find_ibd, which would continue to return an IbdResult object which provides basic IBD summaries and segment lookup - this is something that would happen behind the scenes, possibly enabling a range of other IBD based functionality.

gtsambos commented 3 years ago

Is the idea that this would be used like this?

class IbdResult(self, write_to_file=True, total_segment_count=True):
    self.outfile = something
    self.total_segment_count = 0

    def add_segment(self, left, right, sample_a, sample_b, ancestor, time):
           if write_to_file:
                 outfile.add_row(...)
           if total_segment_count:
                 total_segment_count += 1

res = ts.find_ibd(...)
res.total_segment_count
 # 3
gtsambos commented 3 years ago

I've added time here because this seems like a useful thing to provide the result object. Otherwise, we'd have to keep a reference to the tree sequence to find out the ancestor node's time (which we'd definitely want, for computing stats).

This makes sense. In theory the IBD segments could inherit any properties of the ancestral node (population, flags etc), but since time is the one people will almost certainly want to use, it makes sense to provide this by default.

jeromekelleher commented 3 years ago

Yes, that's the sort of thing we'd be able to support. It's probably better to think of it as a function pointer, so we'd have something like

with open("ibd_segs.txt", "w") as out:
    def dump_segments(left, right, sample_a, sample_b, ancestor):
          print(left, right, sample_a, sample_b, ancestor, file=out)
    ts.find_ibd(segment_func=dump_segments)

This would make it very general.

The example here would be inefficient because the segment_func would be calling back into Python for every segment, but we can imagine building up a set of operations that can be done efficiently in C using the low-level infrastructure.

jeromekelleher commented 3 years ago

I've change my mind about the time argument actually --- in practise we'd probably need the time for sample_a and sample_b as well for this to be generally useful, and it's not that big a deal keeping a reference to the tree sequence to compute these on the fly in the segment_func.

gtsambos commented 3 years ago

in practise we'd probably need the time for sample_a and sample_b as well for this to be generally useful, and it's not that big a deal keeping a reference to the tree sequence to compute these on the fly in the segment_func.

I agree on the second point, but I'm not sure about the first. I think the time of the common ancestor is almost always important, unlike the times of the sample nodes. That's both because (a) usually we're trying to study something interesting that happened in an epoch in which those ancestors existed, and (b) because people will often just be in a situation where all of their samples come from the same time.

jeromekelleher commented 3 years ago

Yeah, that's a fair point. I'd probably aim to keep things minimal though in the interest of clarity/simplicity at the C level, and keeping a reference to tables.nodes.time isn't too hard. Let's see how it looks when we get to implementing it.

petrelharp commented 3 years ago

The point of this is that is decouples the definition of finding IBD from how the results are used.

🤯 Great idea!!

petrelharp commented 3 years ago

I agree about keeping a reference to the tree sequence - it'd be nice to not, but pretty soon we'd come across a use case where we needed something not already in the struct and would need to add more.