Closed richardreeve closed 7 years ago
Sorry! I believe the legacy version of the package was fairly feature complete, and without bugs, and you are certainly welcome to branch it.
The reasoning for this move was because the motivation for this package was to build towards us having phylogenetic tree inference tools (using MCMC) in julia. It is inefficient for us to duplicate node and/or branch data for each permutation/iteration of a Tree. For myself, I'm dealing with tens of millions of iterations often, which, with the old implementation results in me copying the sequence data for each. What I've found to be a better approach is to create a Dict{Int64, N}
with node data N
. The Int64
keys would match the node indices in the Tree. Does that make sense? You can see a simple example here: https://github.com/jangevaare/PhyloModels.jl
Okay, I see where you're going with that, but if your simulations have the same tips (presumably if you're inferring tree structure?), then couldn't this be efficiently folded into PhyloTrees to provide a coherent interface to everything rather than separating it out?
A tree copy constructor could just pass the same Dict{Int64, N}
(or whatever) reference between all of the trees without copying it...
I'm not necessarily suggesting you do this, but as I offered before, I'm happy to have a go if you might be interested? Otherwise I'll look at other options, but this is the most developed lightweight tree implementation, so I'm happy to have a play if it would be welcome.
A tree copy constructor could just pass the same Dict{Int64, N} (or whatever) reference between all of the trees without copying it...
You are right, a deepcopy
is not necessary for Tree
iterations. Something like you suggest would work, and I think is actually preferable to the old method that was used for parametric trees because it is more straight forward to share node data amongst tree iterations.
A couple things I'm hesitant about:
For your common user, there can be can be surprising consequences of having something like a Dict
shared amongst several Trees. If that Dict
is modified in one Tree
, it is modified in them all. These implications are more obvious if node data is maintained as an independent object.
For those that are not using parametric trees especially, the interface should be kept as simple as possible. I do have a couple ideas on how this might work, we may just have an outer construction method for when node and branch data types are not specified, that creates a Tree{Void, Void}
. A bunch of new functions would have to be created, which would essentially be wrappers for various Dict
utilities. Not a huge deal, but increases the code that has to be maintained here as Julia matures.
There has to be clear workflow/ease of use advantages to having parametric trees vs. maintaining independent Dict
s for node data. Various error checking could be built in, which could be advantageous on one hand (possibly annoying when constructing a Tree though - use warnings instead of errors). Would love to hear more thoughts on how this solution is better to the alternative.
Lets keep bouncing ideas on this.
For your common user, there can be can be surprising consequences of having something like a Dict shared amongst several Trees. If that Dict is modified in one Tree, it is modified in them all. These implications are more obvious if node data is maintained as an independent object.
I agree, but then I would argue copy constructors are very unlikely to be used by common users. You could easily make a constructor that is passed a Dict
do a deep copy, and only a true copy constructor Tree(::Tree)
pass by reference (I have to admit I find the opaque way this is handled in Julia a bit confusing myself coming from C++!). Or even further, you could say that only a specific function would do it - e.g. shallowcopy(::Tree)
or indeed makeempty(::Tree)
to create a new Tree
object without a tree in it (which you would then add your new tree to).
For those that are not using parametric trees especially, the interface should be kept as simple as possible. I do have a couple ideas on how this might work, we may just have an outer construction method for when node and branch data types are not specified, that creates a Tree{Void, Void}. A bunch of new functions would have to be created, which would essentially be wrappers for various Dict utilities. Not a huge deal, but increases the code that has to be maintained here as Julia matures.
I don't see this as a massive issue so long as the code is as simple as possible.
There has to be clear workflow/ease of use advantages to having parametric trees vs. maintaining independent Dicts for node data. Various error checking could be built in, which could be advantageous on one hand (possibly annoying when constructing a Tree though - use warnings instead of errors). Would love to hear more thoughts on how this solution is better to the alternative.
I have to admit I think there's no possible advantage to independent Dicts precisely because of the lack of error checking. It would be easy to have unsafe_xxx()
functions that don't do it if you are critically speed-limited, but by default people should have the error checking built in.
The other possibility is to have an AbstractTree
supertype with an interface which is then shared by Tree
and something else (ParameterisedTree
?).
Anyway, just a first thought...
PS Are all of your Tree
s one or more strict binary trees? I was about to add in some checking for that, and I also wasn't sure why Node
's in
field was a vector? It could easily be a Nullable{[U]Int}
or indeed just a [U]Int
set to NaN
, Inf
or 0
when it had no parent...
PS Are all of your
Tree
s one or more strict binary trees? I was about to add in some checking for that, and I also wasn't sure whyNode
'sin
field was a vector? It could easily be aNullable{[U]Int}
or indeed just a[U]Int
set toNaN
,Inf
or0
when it had no parent...
I deal with strict binary trees, but I did leave the code more general than that. I also wanted to keep things fairly consistent in how connections were represented in Node
s and Branch
es. A zero length Vector{Int64}
works okay for indicating a Node
without a parent. I think I prefer this to the alternatives.
Currently, there is an error if a user tries to add a branch which results in an in degree > 1. e.g. https://github.com/jangevaare/PhyloTrees.jl/blob/master/src/construction.jl#L83-L85
The only problem I was thinking of is efficiency - copying and accessing vectors is less efficient that manipulating integers...
True. I don't think it's such a hit that I'd consider it a problem, but with staying true to the goals of this package to be a lightweight implementation of phylogenetic trees in Julia it might be something we'd want to change. But lets perhaps talk about that in a separate issue!
I'm going to think more on your comments about implementing parametric trees. I'm especially interested right now in whether the best approach is to have separate types for parametric and nonparametric trees. Also, if there is to be parametric trees, Node
/Branch
labelling will have to be rethought as well. That was removed because it was thought that it could be handled the same way as Node
/Branch
data.
@jangevaare - I've reimplemented PhyloTrees in #14 to provide an abstract interface to nodes and trees (and to an extent to branches) that allows your base Tree type to roughly survive with the same API, but our more complex requirements to be satisfied too... see what you think... the only significant difference that I've deliberately introduced to your API is that the xxx!()
functions return the name of the thing (e.g. node, branch) created, rather than the object being altered, which is already known since it's one of the arguments.
You kindly implemented parametric nodes after we (@claireh93 and I) brought it up in #11, but it now seems to have gone away again... is that a permanent loss, or are you planning on bringing them back, since we were using them?