tskit-dev / tskit

Population-scale genomics
MIT License
153 stars 72 forks source link

Add ``store_segments=True`` and ``store_pairs=True`` to the ibd_segments method #1712

Closed jeromekelleher closed 3 years ago

jeromekelleher commented 3 years ago

See #1708 for more background on the ibd_segments method (currently find_ibd)

The idea here is that the IbdSegments gives us access to useful simple summaries of the IBD segments, which may be all we need for some purposes. Then, storing the segments (or indeed, even the pairs) would be unnecessary and we'd save a lot of time an memory by not storing them.

If store_pairs is false, store_segments is also necessarily false. We'd have to be careful about the semantics of the IdbSegments class in these cases, but I think the simplest and most logical thing to do would be to raise an error if someone tries to access information that hasn't been stored.

Then, we could argue about what the defaults should be.

Any thoughts @gtsambos @petrelharp ?

gtsambos commented 3 years ago

This also sounds good to me, though I'll probably have a few more questions about the semantics too. I'm not totally clear on how you could have store_segments=True with store_pairs=False -- don't you need to know the pair to lookup the segments?

jeromekelleher commented 3 years ago

Yes, you're right, store_segments implies store_pairs.

gtsambos commented 3 years ago

Ah sorry, I think I meant this the other way around -- but i can actually see some applications when just the pairs and not the segments might be wanted.

jeromekelleher commented 3 years ago

If store_segments is False and store_pairs is True, then we keep a segment_list for each pair (we find an IBD segment for), but we only update the running totals of num_segments and total_span instead of actually storing the segments themselves. This would save a lot of memory.

petrelharp commented 3 years ago

Sounds good to me! Since it'll be qutie easy to get a very big object, I'd suggest that store_segments default to False, so that people can first run the method to see how many segments they'd get at a given threshold and then turn on storing segments to actually get them.

jeromekelleher commented 3 years ago

Great, that's exactly what I was thinking.

gtsambos commented 3 years ago

Can we guesstimate (or give a sensible lower bond) for the size the object based on the number of pairs and segments? That might be more useful than just the number of segments since I imagine that's the constraint that users will be thinking about.

petrelharp commented 3 years ago

That seems like a very tricky thing to do in general...

jeromekelleher commented 3 years ago

That's easy I think - we know how many segments and pairs there are and we know how much memory each one takes. We could do this in Python as part of the __str__ implementation. Probably best do it as a follow up though.