talgalili / dendextend

Extending R's Dendrogram Functionality
152 stars 28 forks source link

Inconsistent attribution of individuals to clusters #120

Open aloboa opened 1 month ago

aloboa commented 1 month ago

Given

hc <- hclust(dist(USArrests[c(1, 6, 13, 20, 23), ]), "ave")
dend <- as.dendrogram(hc)
plot(dend)

I think the following difference should be considered as a bug:

a <- cutree(dend, h=50)
b <- cutree(dend, h=50, order_clusters_as_data = FALSE)
table(a)
a
1 2 3 
3 1 1 

table(b)
b
1 2 3 
1 1 3 

One thing is changing the order of the labels in the vector, and another one is changing the cluster to which a given element has been classified. In this example, Minnesota should be in the same cluster in both cases, and the number of individuals within each cluster should be the same.

talgalili commented 1 month ago

Thanks. I'm not likely to address this in the near future. But if you propose a fix, I'd be happy to review it.

Thanks.

On Mon, Jul 29, 2024 at 12:35 PM aloboa @.***> wrote:

Given

hc <- hclust(dist(USArrests[c(1, 6, 13, 20, 23), ]), "ave") dend <- as.dendrogram(hc) plot(dend)

I think the following difference should be considered as a bug:

a <- cutree(dend, h=50) b <- cutree(dend, h=50, order_clusters_as_data = FALSE) table(a) a 1 2 3 3 1 1

table(b) b 1 2 3 1 1 3

One thing is changing the order of the labels in the vector, and another one is changing the cluster to which a given element has been classified. In this example, Minnesota should be in the same cluster in both cases, and the number of individuals within each cluster should be the same.

— Reply to this email directly, view it on GitHub https://github.com/talgalili/dendextend/issues/120, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHOJBTU7JFJAU2IY5JJTC3ZOYEHXAVCNFSM6AAAAABLT4OU2SVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZTIOJTGMYDMMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

aloboa commented 1 month ago

If you do not fix this issue, please clarify asap the documentation of your dendextend::cuttre() It should be: order_clusters_as_data
logical, defaults to TRUE. There are two ways by which to name and order the clusters: 1) By the order of the original data. 2) by the order of the labels in the dendrogram. TRUE: clusters are named and ordered according to their sequence in the data. FALSE: clusters are named and ordered according to their sequence in the dendrogram.

If you fix the issue, you probably want to create a new function named cutdend(dend), where dend should be a dendrogram (eg.dend <- as.dendrogram(hc)) to avoid confusion with base R.

The documentation would be: order_clusters_as_data
logical, defaults to TRUE. Clusters are always named according to their sequence in the dendrogram. There are two ways by which to order the clusters: 1) By the order of the original data. 2) by the order of the labels in the dendrogram. TRUE: clusters are ordered according to their sequence in the data. FALSE: clusters are ordered according to their sequence in the dendrogram.

In case b <- cutdend(dend,order_clusters_as_data = TRUE) b would be the same as a <- cutdend(dend,order_clusters_as_data = FALSE) except that a would be reordered.

The fix is very simple, just look at this example:

d <- USArrests[c(1, 6, 13, 20, 23), ]
d
          Murder Assault UrbanPop Rape
Alabama     13.2     236       58 21.2
Colorado     7.9     204       78 38.7
Illinois    10.4     249       83 24.0
Maryland    11.3     300       67 27.8
Minnesota    2.7      72       66 14.9

hc <- hclust(dist(d), "ave")
dend <- as.dendrogram(hc)
a  <-  cutree(dend,h=50,order_clusters_as_data = FALSE)
y <- row.names(d)
x <- names(a)
y
[1] "Alabama"   "Colorado"  "Illinois"  "Maryland"  "Minnesota"

a
Minnesota  Maryland  Colorado   Alabama  Illinois 
        1         2         3         3         3 

a[order(match(x,y))]
  Alabama  Colorado  Illinois  Maryland Minnesota 
        3         3         3         2         1 
jefferis commented 1 month ago

@aloboa sorry to see that things did not behave as you expected, but I am slightly confused by your opening description of this issue.

One thing is changing the order of the labels in the vector, and another one is changing the cluster to which a given element has been classified. In this example, Minnesota should be in the same cluster in both cases, and the number of individuals within each cluster should be the same.

the whole point of the option (order_clusters_as_data = FALSE) is to assign the clusters (labelled 1 ... k) as they appear in the dendrogram. This means that the numeric labels of the clusters must be different and therefore that the membership of the clusters identified by a given label in 1...k will be different.

Now I agree that for some purposes you might wish to return the integer cluster membership vector for each individual observation ordered by the input data rather than by the dendrogram. But that is a choice and because this is doing something different to base R I don't think you can say that one behaviour or another is a bug. I suppose one could add yet another argument asking to return the cluster membership in data order (e.g. order_return_as_data, order_membership_as_data or similar)

In other words Minnesota should be in a different cluster in the two cases. But you could discuss the ordering of the return vector.

jefferis commented 1 month ago

Also although I understand the intent behind your suggestion to change the docs:

clusters are named and ordered according to their sequence in the data.

I don't think it works because for the clusters naming/ordering are the same thing. What you want is to change the sort order of the returned cluster membership vector for observations.

If you want some ideas about proposing changes to the docs to avoid surprise then maybe take a look at dendroextras::slice which does the same thing as cutree(order_clusters_as_data = FALSE). Note also the example

slice(hc,k=5)[order(hc$order)] 

which give the output you want. The group membership vector is not just an ascending or descending set of cluster ids as in your smaller examples. This may help to highlight that observations!=clusters.

As a side note, I have to say that anyone I have ever tried to teach clustering methods to finds it very strange that clusters in base R are not assigned in the order they appear in the dendrogram.

talgalili commented 1 month ago

Thanks Aloboa. Related to what Gregory wrote, I don't think it's a bug but rather a behaviour which is not documented well enough to avoid all possible confusion.

I'll keep this issue open and take a look at it in the coming weeks (assuming nothing critical would stop me from taking a look).

On Tue, 30 Jul 2024, 21:42 Gregory Jefferis, @.***> wrote:

Also although I understand the intent behind your suggestion to change the docs:

clusters are named and ordered according to their sequence in the data.

I don't think it works because for the clusters naming/ordering are the same thing. What you want is to change the sort order of the returned cluster membership vector for observations.

If you want some ideas about proposing changes to the docs to avoid surprise then maybe take a look at dendroextras::slice https://rdrr.io/cran/dendroextras/man/slice.html which does the same thing as cutree(order_clusters_as_data = FALSE). Note also the example

slice(hc,k=5)[order(hc$order)]

which give the output you want. The group membership vector is not just an ascending or descending set of cluster ids as in your smaller examples. This may help to highlight that observations!=clusters.

As a side note, I have to say that anyone I have ever tried to teach clustering methods to finds it very strange that clusters in base R are not assigned in the order they appear in the dendrogram.

— Reply to this email directly, view it on GitHub https://github.com/talgalili/dendextend/issues/120#issuecomment-2258981223, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHOJBWRVUVIRNOFH4EFHJTZO7NDFAVCNFSM6AAAAABLT4OU2SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJYHE4DCMRSGM . You are receiving this because you commented.Message ID: @.***>