Open wangjiawen2013 opened 12 months ago
Hi @wangjiawen2013 - thanks for reporting this issue. I can see that it stems from the fact that the dataframe provided for clustering has identical data points:
x, y = np.meshgrid(range(-5, 5), range(-5, 5))
z = x ** 2 + y ** 2
source = pd.DataFrame({'x': x.ravel(),
'y': y.ravel(),
'z': z.ravel()})
df = source.pivot(columns='y',index='x',values='z')
df.values
Would yield the following as the input for clustering.
array([[50, 41, 34, 29, 26, 25, 26, 29, 34, 41],
[41, 32, 25, 20, 17, 16, 17, 20, 25, 32],
[34, 25, 18, 13, 10, 9, 10, 13, 18, 25],
[29, 20, 13, 8, 5, 4, 5, 8, 13, 20],
[26, 17, 10, 5, 2, 1, 2, 5, 10, 17],
[25, 16, 9, 4, 1, 0, 1, 4, 9, 16],
[26, 17, 10, 5, 2, 1, 2, 5, 10, 17],
[29, 20, 13, 8, 5, 4, 5, 8, 13, 20],
[34, 25, 18, 13, 10, 9, 10, 13, 18, 25],
[41, 32, 25, 20, 17, 16, 17, 20, 25, 32]])
I see that the native SciPy dendrogram handles this edge case by displaying these sorts of links as "flat links" with a distance of zero.
idendrogram
logic currently doesn't account for such cases. I'll see what I can do to make it work, but my immediate advice would be to pre-process the data frame to exclude duplicates before clustering, which would avoid this issue, e.g:
x, y = np.meshgrid(range(-5, 5), range(-5, 5))
z = x ** 2 + y ** 2
#Convert this grid to columnar data expected by Altair
source = pd.DataFrame({'x': x.ravel(),
'y': y.ravel(),
'z': z.ravel()})
df = source.pivot(columns='y',index='x',values='z')
column_linkage_matrix = sch.linkage(df.drop_duplicates(), method='single', metric='euclidean')
column_flat_clusters = sch.fcluster(column_linkage_matrix, t=0, criterion='distance')
column_cl_data = idendrogram.ClusteringData(
linkage_matrix = column_linkage_matrix,
cluster_assignments = column_flat_clusters
)
column_idd = idendrogram.idendrogram()
column_idd.set_cluster_info(column_cl_data)
column_dendrogram = column_idd.create_dendrogram(p=20, sort_descending=False).plot(
backend='altair',
orientation='top',
height=30, width=200,
show_nodes=False).interactive()
column_dendrogram
Thanks, the strategy that SciPy dendrogram handles this case is mor reasonable, because sometime we will have two objects have the same features, such as two students have the same score for all curriculums (science, mathmatics, physics).
Hi, When I plot the column dendrogram of a dataframe, the following error occurred: LookupError: Link traversing failed - the linkage matrix is likely invalid
Here is my code:
import altair as alt import numpy as np import pandas as pd
Compute x^2 + y^2 across a 2D grid
x, y = np.meshgrid(range(-5, 5), range(-5, 5)) z = x 2 + y 2
Convert this grid to columnar data expected by Altair
source = pd.DataFrame({'x': x.ravel(), 'y': y.ravel(), 'z': z.ravel()})
column_linkage_matrix = sch.linkage(source.pivot(columns='y',index='x',values='z'), method='single', metric='euclidean') column_flat_clusters = sch.fcluster(column_linkage_matrix, t=0, criterion='distance') column_cl_data = idendrogram.ClusteringData( linkage_matrix = column_linkage_matrix, cluster_assignments = column_flat_clusters ) column_idd = idendrogram.idendrogram() column_idd.set_cluster_info(column_cl_data)
column_dendrogram = column_idd.create_dendrogram(p=20, sort_descending=False).plot( backend='altair', orientation='top', height=30, width=200, show_nodes=False).interactive() column_dendrogram