Open hyanwong opened 2 years ago
Nearly all the time it will be easier to walk down from the root until you find the first node with 2 children.
How do you know that the first bifurcation splits the sample nodes?
Yes, you are right, you need to descend from the root, following the child at each point that has tree.num_samples(child_id) == ts.num_samples until you find a node whose child has < ts.num_samples, and that's the MRCA.
This seems reasonable. We should throw an error if the tree has more than one root, but doing something like
u = tree.root
n = tree.tree_sequence.num_samples
while tree.num_samples(u) == n:
for v in tree.children(u):
if tree.num_samples(v) == n:
u = v
break
return u
Or something along those lines. We need to be careful in corner cases here like a stick tree though as we'd end up in an infinite loop with this probably.
Or something along those lines. We need to be careful in corner cases here like a stick tree though as we'd end up in an infinite loop with this probably.
Yes, there are some odd edge cases, like a stick tree with 2 internal samples etc. But I think the principle is sound, and the meaning is very clear. Thanks
OK, just to check the semantics here:
tree.mrca() == tree.mrca(*ts.samples())
# Implying ->
tree.mrca(*[]) == tree.mrca(*ts.samples())
tree.mrca(u) == u # We should implement this case also so any tree.mrca(*x) is valid
# So, we could do something like:
samples = ts.samples()
for j in range(ts.num_samples()):
subset_mrca = tree.mrca(*samples[:j])
It's slightly weird semantically that we go like this
tree.mrca(*[]) == all_sample_mrca
tree.mrca(*[1]) == 1
It would probably make more sense in this "give me the mrca of this list of nodes" interpretation to have the empty list map to -1.
I think it would be better actually to have a keyword only argument to do this, so we'd have something like
def mrca(*args, all_samples=None):
if len(args) > 0 and all_samples is not None:
raise ValueError(...)
if all_samples:
# do the thing above
if len(args) == 1:
return args[1] # but checking it's a node I guess
else:
# do what we do now
Which is a bit convoluted.
Perhaps it would be better to add a separate method all_samples_mrca
which just returns the value we want and sidesteps these tricky semantics issues?
Just throwing in my two cents here - I think tree.mrca(*[])
should be -1
. I think this because it being all_sample_mrca
is a bit of a hidden feature and could be a subtle bug.
How about we detect tree.mrca(*ts.samples())
then special case it to fast branch? Adding it to the docs for that method as an example?
Just throwing in my two cents here - I think tree.mrca(*[]) should be -1. I think this because it being all_sample_mrca is a bit of a hidden feature and could be a subtle bug.
:thumbsup:
How about we detect tree.mrca(*ts.samples()) then special case it to fast branch? Adding it to the docs for that method as an example?
That would cost O(n) time though, which is much more expensive than the operation itself.
Having played with mrcas a bit over the last few days I'm now strongly -1 on making Tree.mrca(*[]) == all samples mrca.
Let's just add a function called Tree.all_samples_mrca()
. We could make it a property, but probably a function is more appropriate as it entails some computation.
A number of times now (particularly for tsinferred trees), I have wanted to get the tMRCA of all the sample nodes. This is not equivalent to the root in tsinferred trees, as there can be unary nodes above the root.
I could get that by doing
Tree.mrca(*ts.samples())
, but that's going to be very inefficient. Nearly all the time it will be easier to walk down from the root until you find the first node with 2 children. It strikes me that this operation would be very reasonably described byTree.mrca()
(with no arguments), meaning the MRCA in the whole tree.I suggest that we should allow no arguments to
Tree.mrca()
and state that this uses a different algorithm to find the mrca of all the samples. We would still error out if there is only one argument, though.