tskit-dev / tskit

Population-scale genomics
MIT License
153 stars 72 forks source link

Add ``left`` and ``right`` arguments to TreeSequence.trees() #24

Open jeromekelleher opened 6 years ago

jeromekelleher commented 6 years ago

These should be seen as analogous to the start and stop arguments to Python's builtin range, and be specified in tree sequence coordinates. We should initially ensure that start < stop, but in future we could allow start > stop, and enable reverse iteration in this way.

It would be best to do this with a low-level sparse_tree_seek() method to position the tree at the correct location.

jeromekelleher commented 5 years ago

After discussion on #121, I think the best approach here is to add start and stop arguments in sequence coordinates (as above). The trees in the returned sequence would guarantee that the first tree has left == start and the last tree would have right == stop. To do this, we'd need to 'split' the underlying trees so that we guarantee this property (that is, the end points of the trees we return don't necessarily align with the underlying edges). The reason for doing this is that if we're looking for the trees in a particular genomic interval, we almost certainly only want sites also within that interval. If we don't split the trees, we'll have sites arbitrarily far away which would have to be filtered manually, making client code much uglier and error prone.

This isn't hard to do, but will require implementation at the C level.

hyanwong commented 5 years ago

Copied from https://github.com/tskit-dev/tskit/pull/121:

Note that some of the docs use the words [left,right) rather than [start, stop), e.g. https://tskit.readthedocs.io/en/latest/python-api.html#tskit.TreeSequence.edge_diffs . "left" and "right" have the advantage that it's clearer (to me anyway) that we are talking about a length of genome and not e.g. tree indexes, or node times. But it has the disadvantage that it seems weird to specify e.g. (left=20.5, right=10.5) in order to get an iterator that goes right to left. And there's definitely the advantage that start and stop match the range operator.

jeromekelleher commented 5 years ago

Thinking about how to implement this, I think the right way to do it is to add a tsk_tree_set_interval(left, right) method which can shrink the interval for a tree. This would mainly involve updating the list of sites. The tsk_tree_advance method would need to be updated to take this into account, so that if we call next() on a tree that has had the right coordinate moved leftwards, the next tree state is the remainder of the original tree.

The rest of the functionality can be built on this then.

jeromekelleher commented 3 years ago

Adding this to the 1.0 milestone for now as something to review and plan ahead for.

hyanwong commented 1 year ago

To match the variants() iterator, we should use left and right here.

jeromekelleher commented 1 year ago

Also we can implement this quiet efficiently since #2661 was merged