tskit-dev / tskit

Population-scale genomics
MIT License
153 stars 72 forks source link

Making find_IBD more efficient by pruning ancestral segments #1636

Open gtsambos opened 3 years ago

gtsambos commented 3 years ago

find_ibd relies on creating 'master' lists of segments underneath ancestral node. However, these lists can take up a lot of memory, especially if they are high up in the tree.

However, it isn't necessary to keep all of these lists. Once we have processed all of the edges where some node u is a child, we will not need to refer to the list of segments underneath u any more, and can prune these segments from memory.

This could be done by keeping a list that counts the number of relevant edges for each ancestral node, and calls a 'prune' function when that number is decremented to 0 for any node.

jeromekelleher commented 3 years ago

This isn't as straightforward as it might seem because we're using the tsk_blkalloc allocator to allocate segments, and it doesn't support freeing. We would either have to adapt the "heap" allocator we use in msprime (which supports freeing) or do the malloc/freeing manually segment-by-segment. Both options have downsides, so we should probably get some idea of how much memory this would save in typical cases first before embarking on this. Probably best to get the IbdResult class fully nailed down before doing this (#1639)