Closed gtsambos closed 3 years ago
One idea would be to have a statistics_only
argument to find_ibd
, which would result in the ibd_result_t struct not storing the actual segments. So, we might have something like
result = ts.find_ibd(statistics_only=True)
print(result.total_segment_length)
print(result.total_segments)
# Lots of other stats available through properties/methods
then, if we tried something like
print(result[(0, 1)]) # Fails with SegmentsNotAvailableError or something
I guess the question then would be, should the default be to compute the segments or not? I.e., should the parameter be store_segments=False
by default, so that we only store the segments if explicitly asked to? This is probably a better default to be honest, because I can easily see people not reading the documentation here and complaining that they run out of memory when then run find_ibd
and are just interested in the summary stats.
Presumably there would be some other arguments (summary_stats
?) that users would supply to indicate which statistics they are interested in? And that they could leave as None
if they just wanted to use the method to get the current full output. We could require store_segments=False
whenever summary_stats
is not None, or something.
Depends on what we end up computing I think - but I think there's some basic stats that we should always track.
Some possible stats or functions to consider: 1) Number of IBD segments 2) (Histogram of) IBD length/time distribution 3) IBD coverage/rate with step size, such as 1kb 4) Chromosome-wise or genome-wise total IBD for each sample pair (return a scipy.sparse or NumPy matrix) 5) (For comparison) percentage of overlapping of IBD result with existing IBD results either from an external program like hap-ibd, or from tskit tree sequence, or tskit ibd result object/file 6) Merge/flatten nearby IBD records (not sure this is needed for tskit find-ibd, but it is recommended by hap-ibd, see the "Removing breaks and gaps in IBD segments" section from here)
I don't think we can store enough in the results
object to answer all possible queries, so maybe we just need a different method, like
ts.summarise_ibd( ... )
that returns something appropriate. For cases 1-4 in @gbinux's examples, doing
ts.summarise_ibd(
windows=None,
time_windows=None,
length_bins=None,
pairs=None, # defaults to all pairs
count_segments=True, # alternative would add up total length instead of counting segments
)
and returning a numpy array would do it?
See #1639 . In the future, we may wish to expand
find_ibd
so that summary statistics are calculated and updated internally as the segments are discovered. This would prevent the need for large numbers of IBD segments to be stored in memory or written to disk. Here we can discuss the appropriate way to implement this.@jeromekelleher @petrelharp @gbinux