kgori / treeCl

Clustering phylogenetic trees with python
MIT License
25 stars 12 forks source link

Working with get_inter_tree_distances() #16

Closed fbemm closed 6 years ago

fbemm commented 6 years ago

I am trying to work with inter_tree distances on a set of variable trees with different leaf/branch overlapping patterns. TreeCl's distance results contain two types of 0's as far as I understand it. The ones of identical trees (e.g., the ones on the diagonal) and the ones where the calculation "failed". At the moment there is no way of setting those two appart. Is there anyway to set the failed ones to -1 or Nan?

kgori commented 6 years ago

Could you provide an example pair of trees for which the calculation fails?

On 15 May 2018, at 11:13, Felix Bemm notifications@github.com wrote:

I am trying to work with inter_tree distances on a set of variable trees with different leaf/branch overlapping patterns. TreeCl's distance results contain two types of 0's as far as I understand it. The ones of identical trees (e.g., the ones on the diagonal) and the ones where the calculation "failed". At the moment there is no way of setting those two appart. Is there anyway to set the failed ones to -1 or Nan?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kgori/treeCl/issues/16, or mute the thread https://github.com/notifications/unsubscribe-auth/ABkM_4312uIEQODZBb0ugLKmvxArQDWOks5tyqo7gaJpZM4T_RBa.

fbemm commented 6 years ago

I misused the word fail. In case there is no overlap between two trees, TreeCl would return a 0 right?

fbemm commented 6 years ago

,1_3,1_4,1_5,1_7, 1_3,0.0,0.0,0.0,0.0, 1_4,0.0,0.0,0.0,0.0, 1_5,0.0,0.0,0.0,0.0, 1_7,0.0,0.0,0.0,0.0,

0 at 1_3 vs. 1_3 makes sense. The off-diagonals not because those 4 trees share almost no leaves.

Here is some data:

https://drive.google.com/open?id=1ALZCcoYRmhwjbSxuO2xnMnj1U6V59jDR

kgori commented 6 years ago

I've added a parameter to all tree distance functions, overlap_fail_value. For pairs of trees where the overlap is smaller than min_overlap, no distance is calculated and overlap_fail_value is returned instead. By default overlap_fail_value is zero, but can be anything, e.g. -1, numpy.NaN, None...

In [] coll.get_inter_tree_distances('rf', overlap_fail_value=np.nan)
Out[]
          class1_1  class1_2  class1_3  class1_4
class1_1       0.0       NaN       NaN       NaN
class1_2       NaN       0.0       NaN       NaN
class1_3       NaN       NaN       0.0       NaN
class1_4       NaN       NaN       NaN       0.0
fbemm commented 6 years ago

Great! Thanks for implementing that so fast!