knightjdr / hclust

Hierarchical clustering in Go
MIT License
20 stars 6 forks source link

Feature Request: Improved handling of node names #2

Open JonasDoe opened 3 years ago

JonasDoe commented 3 years ago

The Problem

You libraries proofs really useful to create a dendrogram, but at the moment it is not really possible iterate through the tree structure and colleced the leaf names (e.g. ids) by their distance: The dendogram data structure itself doesn't know the names (the labels are only node numbers), and when processed with the Tree() method, I only receive a string (Newick) representation of the tree with names/ids assigned. I would have to parse this string representation, which is error-prone and not really desirable performance-wise.

In more general terms, the use of the generate Tree is quite limited at the moment.

My Proposal

Let the Tree() method return a typed Tree (which doesn not just hold some strings but nodes and leafs (with distance as well as names/ids). The returned Tree type could provide a method String() for the Newick representation, as well as an Order() method and maybe some others.

The advantage of this approach is not limited to my special use case: in general a real tree structure allows you (and all users of the library basically) to provide more and more methods on that very Tree, not just String() and Order(), and could be even abstracted into a visitor pattern (which might become more relevant once Generics finally make it onto Go, which should happen in the nearer future). With the visitor pattern, you'ld unlock huge flexibility without the need add new methods.

If I find time I could try to provide a PR myself, but I'ld like to hear your opinion about that first.

knightjdr commented 3 years ago

Does the Dendrogram type not do what you need? It is a structured representation of the tree. It's an array of SubClusters, each subcluster having a node ID and two descendant leaves (and distances), which can either represent terminal branches or another internal node. If your input matrix has three rows, then leaves 0-2 represent rows 1, 2 and 3 respectively, and IDs > 2 represent parent nodes. The leaves don't have names, as the generating function just takes a matrix, but the assumption is those can be easily assigned from the leaf IDs.

The Tree function, which is a bit poorly named, was designed just to convert the Dendrogram structure into the common Newick format, and were I refactoring this package I would probably implement a method called Newick on the Dendrogram type instead, that might be called likedendrogram.Newick(names).

I may be misunderstanding what you are requesting, so feel free to clarify.

JonasDoe commented 3 years ago

Hey, thanks for your response! My hassle with the dendrogram is that I had problems to match the leafs to my original item names/ids. I understood there was some logic to match those, but I really could not wrap my head around it, with indices of indices being handed over and all that ;). Matching the names/ids to those nodes is what the Tree() method actually did for me, so I ended up more or less with copy-pasting Tree() but build a structure with named nodes from it instead of a Newick string. (That said, after your recent explanation I've got more understanding about how node numbers relate to leaf names. Thanks for that!)

But you're right that Dendrogram is more of a tree than Tree itself, though, and that Dendrogram might be the better place to apply tree-logic (like a visitor pattern or anything else). As I said, it's a bit hard to work with that in its current form, especially without your additional explanations. Maybe an solution would be to establish a Distance type (with the names/ids from column and row plus the distance value) and let Distance() return a [][]Distance matrix instead of just [][]float64. Those infos could be directly used in Dendrogram() (there could be even a method receiver (myDistanceMatrix.Dendogram()) for a type DistanceMatrix = [][]Distance), so the result of the Dendrogram could just provide information about the name/id by itself, and all that index shuffling in Tree() to find out the actual name/id of a node could be avoided. Another advantage would be that one does not mix up in the input array for Distance() with that for, say, Cluster().

That said, it's really just a proposal at this point, if you feel that this is just a particular custom request from a random user which means more work than anything else, feel free to ignore it! I've already built my own workaround for now, and if I feel like improving it I could just rely on your explanation on how to match nodes with their names manually (which, if I understand correctly, is that all node numbers can be used as indices to read the corresponding name from the names array, and that all node numbers exceeding this array aren't leaves anyway).

knightjdr commented 3 years ago

I think I follow now and it sounds like a good idea. Since you have a working solution I won't tackle this at the moment, and maybe there will be some related feedback from others that would be good to consider. So I'll leave this open and come back to it when I've got some time.