Open JonasDoe opened 3 years ago
Does the Dendrogram
type not do what you need? It is a structured representation of the tree. It's an array of SubCluster
s, each subcluster having a node ID and two descendant leaves (and distances), which can either represent terminal branches or another internal node. If your input matrix has three rows, then leaves 0-2 represent rows 1, 2 and 3 respectively, and IDs > 2 represent parent nodes. The leaves don't have names, as the generating function just takes a matrix, but the assumption is those can be easily assigned from the leaf IDs.
The Tree
function, which is a bit poorly named, was designed just to convert the Dendrogram
structure into the common Newick format, and were I refactoring this package I would probably implement a method called Newick
on the Dendrogram
type instead, that might be called likedendrogram.Newick(names)
.
I may be misunderstanding what you are requesting, so feel free to clarify.
Hey, thanks for your response! My hassle with the dendrogram is that I had problems to match the leafs to my original item names/ids. I understood there was some logic to match those, but I really could not wrap my head around it, with indices of indices being handed over and all that ;). Matching the names/ids to those nodes is what the Tree()
method actually did for me, so I ended up more or less with copy-pasting Tree()
but build a structure with named nodes from it instead of a Newick string. (That said, after your recent explanation I've got more understanding about how node numbers relate to leaf names. Thanks for that!)
But you're right that Dendrogram
is more of a tree than Tree
itself, though, and that Dendrogram
might be the better place to apply tree-logic (like a visitor pattern or anything else). As I said, it's a bit hard to work with that in its current form, especially without your additional explanations. Maybe an solution would be to establish a Distance
type (with the names/ids from column and row plus the distance value) and let Distance()
return a [][]Distance
matrix instead of just [][]float64
. Those infos could be directly used in Dendrogram()
(there could be even a method receiver (myDistanceMatrix.Dendogram())
for a type DistanceMatrix = [][]Distance
), so the result of the Dendrogram
could just provide information about the name/id by itself, and all that index shuffling in Tree()
to find out the actual name/id of a node could be avoided. Another advantage would be that one does not mix up in the input array for Distance()
with that for, say, Cluster()
.
That said, it's really just a proposal at this point, if you feel that this is just a particular custom request from a random user which means more work than anything else, feel free to ignore it! I've already built my own workaround for now, and if I feel like improving it I could just rely on your explanation on how to match nodes with their names manually (which, if I understand correctly, is that all node numbers can be used as indices to read the corresponding name from the names
array, and that all node numbers exceeding this array aren't leaves anyway).
I think I follow now and it sounds like a good idea. Since you have a working solution I won't tackle this at the moment, and maybe there will be some related feedback from others that would be good to consider. So I'll leave this open and come back to it when I've got some time.
The Problem
You libraries proofs really useful to create a dendrogram, but at the moment it is not really possible iterate through the tree structure and colleced the leaf names (e.g. ids) by their distance: The dendogram data structure itself doesn't know the names (the labels are only node numbers), and when processed with the
Tree()
method, I only receive astring
(Newick) representation of the tree with names/ids assigned. I would have to parse thisstring
representation, which is error-prone and not really desirable performance-wise.In more general terms, the use of the generate Tree is quite limited at the moment.
My Proposal
Let the
Tree()
method return a typedTree
(which doesn not just hold some strings but nodes and leafs (with distance as well as names/ids). The returnedTree
type could provide a methodString()
for the Newick representation, as well as anOrder()
method and maybe some others.The advantage of this approach is not limited to my special use case: in general a real tree structure allows you (and all users of the library basically) to provide more and more methods on that very
Tree
, not justString()
andOrder()
, and could be even abstracted into a visitor pattern (which might become more relevant once Generics finally make it onto Go, which should happen in the nearer future). With the visitor pattern, you'ld unlock huge flexibility without the need add new methods.If I find time I could try to provide a PR myself, but I'ld like to hear your opinion about that first.