summary statistics on find_ibd results

tskit-dev / tskit

Population-scale genomics

MIT License

153 stars 72 forks source link

summary statistics on find_ibd results #1645

Closed gtsambos closed 3 years ago

gtsambos commented 3 years ago

See #1639 . In the future, we may wish to expand find_ibd so that summary statistics are calculated and updated internally as the segments are discovered. This would prevent the need for large numbers of IBD segments to be stored in memory or written to disk. Here we can discuss the appropriate way to implement this.

@jeromekelleher @petrelharp @gbinux

jeromekelleher commented 3 years ago

One idea would be to have a statistics_only argument to find_ibd, which would result in the ibd_result_t struct not storing the actual segments. So, we might have something like

result = ts.find_ibd(statistics_only=True)
print(result.total_segment_length)
print(result.total_segments)
# Lots of other stats available through properties/methods

then, if we tried something like

print(result[(0, 1)]) # Fails with SegmentsNotAvailableError or something

I guess the question then would be, should the default be to compute the segments or not? I.e., should the parameter be store_segments=False by default, so that we only store the segments if explicitly asked to? This is probably a better default to be honest, because I can easily see people not reading the documentation here and complaining that they run out of memory when then run find_ibd and are just interested in the summary stats.

gtsambos commented 3 years ago

Presumably there would be some other arguments (summary_stats?) that users would supply to indicate which statistics they are interested in? And that they could leave as None if they just wanted to use the method to get the current full output. We could require store_segments=False whenever summary_stats is not None, or something.

jeromekelleher commented 3 years ago

Depends on what we end up computing I think - but I think there's some basic stats that we should always track.

bguo068 commented 3 years ago

Some possible stats or functions to consider: 1) Number of IBD segments 2) (Histogram of) IBD length/time distribution 3) IBD coverage/rate with step size, such as 1kb 4) Chromosome-wise or genome-wise total IBD for each sample pair (return a scipy.sparse or NumPy matrix) 5) (For comparison) percentage of overlapping of IBD result with existing IBD results either from an external program like hap-ibd, or from tskit tree sequence, or tskit ibd result object/file 6) Merge/flatten nearby IBD records (not sure this is needed for tskit find-ibd, but it is recommended by hap-ibd, see the "Removing breaks and gaps in IBD segments" section from here)

petrelharp commented 3 years ago

I don't think we can store enough in the results object to answer all possible queries, so maybe we just need a different method, like

ts.summarise_ibd( ... )

that returns something appropriate. For cases 1-4 in @gbinux's examples, doing

ts.summarise_ibd(
  windows=None,
  time_windows=None,
  length_bins=None,
  pairs=None,  # defaults to all pairs
  count_segments=True, # alternative would add up total length instead of counting segments
)

and returning a numpy array would do it?