tskit-dev / tskit

Population-scale genomics
MIT License
155 stars 73 forks source link

Allow Tree.mrca() to have no arguments #2610

Open hyanwong opened 2 years ago

hyanwong commented 2 years ago

A number of times now (particularly for tsinferred trees), I have wanted to get the tMRCA of all the sample nodes. This is not equivalent to the root in tsinferred trees, as there can be unary nodes above the root.

I could get that by doing Tree.mrca(*ts.samples()), but that's going to be very inefficient. Nearly all the time it will be easier to walk down from the root until you find the first node with 2 children. It strikes me that this operation would be very reasonably described by Tree.mrca() (with no arguments), meaning the MRCA in the whole tree.

I suggest that we should allow no arguments to Tree.mrca() and state that this uses a different algorithm to find the mrca of all the samples. We would still error out if there is only one argument, though.

benjeffery commented 2 years ago

Nearly all the time it will be easier to walk down from the root until you find the first node with 2 children.

How do you know that the first bifurcation splits the sample nodes?

hyanwong commented 2 years ago

Yes, you are right, you need to descend from the root, following the child at each point that has tree.num_samples(child_id) == ts.num_samples until you find a node whose child has < ts.num_samples, and that's the MRCA.

jeromekelleher commented 2 years ago

This seems reasonable. We should throw an error if the tree has more than one root, but doing something like

u = tree.root
n = tree.tree_sequence.num_samples
while tree.num_samples(u) == n:
      for v in tree.children(u):
             if tree.num_samples(v) == n:
                    u = v
                    break
return u

Or something along those lines. We need to be careful in corner cases here like a stick tree though as we'd end up in an infinite loop with this probably.

hyanwong commented 2 years ago

Or something along those lines. We need to be careful in corner cases here like a stick tree though as we'd end up in an infinite loop with this probably.

Yes, there are some odd edge cases, like a stick tree with 2 internal samples etc. But I think the principle is sound, and the meaning is very clear. Thanks

jeromekelleher commented 2 years ago

OK, just to check the semantics here:

tree.mrca() == tree.mrca(*ts.samples()) 
# Implying ->
tree.mrca(*[]) == tree.mrca(*ts.samples())
tree.mrca(u) == u # We should implement this case also so any tree.mrca(*x) is valid
# So, we could do something like:
samples = ts.samples()
for j in range(ts.num_samples()):
    subset_mrca = tree.mrca(*samples[:j])

It's slightly weird semantically that we go like this

tree.mrca(*[]) == all_sample_mrca
tree.mrca(*[1]) == 1

It would probably make more sense in this "give me the mrca of this list of nodes" interpretation to have the empty list map to -1.

I think it would be better actually to have a keyword only argument to do this, so we'd have something like

def mrca(*args, all_samples=None):
     if len(args) > 0 and all_samples is not None:
          raise ValueError(...)
     if all_samples:
         # do the thing above
     if len(args) == 1:
          return args[1] # but checking it's a node I guess
     else:
          # do what we do now

Which is a bit convoluted.

Perhaps it would be better to add a separate method all_samples_mrca which just returns the value we want and sidesteps these tricky semantics issues?

benjeffery commented 2 years ago

Just throwing in my two cents here - I think tree.mrca(*[]) should be -1. I think this because it being all_sample_mrca is a bit of a hidden feature and could be a subtle bug. How about we detect tree.mrca(*ts.samples()) then special case it to fast branch? Adding it to the docs for that method as an example?

jeromekelleher commented 2 years ago

Just throwing in my two cents here - I think tree.mrca(*[]) should be -1. I think this because it being all_sample_mrca is a bit of a hidden feature and could be a subtle bug.

:thumbsup:

How about we detect tree.mrca(*ts.samples()) then special case it to fast branch? Adding it to the docs for that method as an example?

That would cost O(n) time though, which is much more expensive than the operation itself.

jeromekelleher commented 2 years ago

Having played with mrcas a bit over the last few days I'm now strongly -1 on making Tree.mrca(*[]) == all samples mrca.

Let's just add a function called Tree.all_samples_mrca(). We could make it a property, but probably a function is more appropriate as it entails some computation.