Duplicate data points in source data break link identification logic

wangjiawen2013 commented 12 months ago

Hi, When I plot the column dendrogram of a dataframe, the following error occurred: LookupError: Link traversing failed - the linkage matrix is likely invalid

Here is my code:

import altair as alt import numpy as np import pandas as pd

Compute x^2 + y^2 across a 2D grid

x, y = np.meshgrid(range(-5, 5), range(-5, 5)) z = x 2 + y 2

Convert this grid to columnar data expected by Altair

source = pd.DataFrame({'x': x.ravel(), 'y': y.ravel(), 'z': z.ravel()})

column_linkage_matrix = sch.linkage(source.pivot(columns='y',index='x',values='z'), method='single', metric='euclidean') column_flat_clusters = sch.fcluster(column_linkage_matrix, t=0, criterion='distance') column_cl_data = idendrogram.ClusteringData( linkage_matrix = column_linkage_matrix, cluster_assignments = column_flat_clusters ) column_idd = idendrogram.idendrogram() column_idd.set_cluster_info(column_cl_data)

column_dendrogram = column_idd.create_dendrogram(p=20, sort_descending=False).plot( backend='altair', orientation='top', height=30, width=200, show_nodes=False).interactive() column_dendrogram

kamicollo commented 11 months ago

Hi @wangjiawen2013 - thanks for reporting this issue. I can see that it stems from the fact that the dataframe provided for clustering has identical data points:

x, y = np.meshgrid(range(-5, 5), range(-5, 5))
z = x ** 2 + y ** 2
source = pd.DataFrame({'x': x.ravel(),
'y': y.ravel(),
'z': z.ravel()})
df = source.pivot(columns='y',index='x',values='z')
df.values

Would yield the following as the input for clustering.

array([[50, 41, 34, 29, 26, 25, 26, 29, 34, 41],
       [41, 32, 25, 20, 17, 16, 17, 20, 25, 32],
       [34, 25, 18, 13, 10,  9, 10, 13, 18, 25],
       [29, 20, 13,  8,  5,  4,  5,  8, 13, 20],
       [26, 17, 10,  5,  2,  1,  2,  5, 10, 17],
       [25, 16,  9,  4,  1,  0,  1,  4,  9, 16],
       [26, 17, 10,  5,  2,  1,  2,  5, 10, 17],
       [29, 20, 13,  8,  5,  4,  5,  8, 13, 20],
       [34, 25, 18, 13, 10,  9, 10, 13, 18, 25],
       [41, 32, 25, 20, 17, 16, 17, 20, 25, 32]])

I see that the native SciPy dendrogram handles this edge case by displaying these sorts of links as "flat links" with a distance of zero.

idendrogram logic currently doesn't account for such cases. I'll see what I can do to make it work, but my immediate advice would be to pre-process the data frame to exclude duplicates before clustering, which would avoid this issue, e.g:

x, y = np.meshgrid(range(-5, 5), range(-5, 5))
z = x ** 2 + y ** 2
#Convert this grid to columnar data expected by Altair

source = pd.DataFrame({'x': x.ravel(),
'y': y.ravel(),
'z': z.ravel()})

df = source.pivot(columns='y',index='x',values='z')
column_linkage_matrix = sch.linkage(df.drop_duplicates(), method='single', metric='euclidean')
column_flat_clusters = sch.fcluster(column_linkage_matrix, t=0, criterion='distance')
column_cl_data = idendrogram.ClusteringData(
linkage_matrix = column_linkage_matrix,
cluster_assignments = column_flat_clusters
)
column_idd = idendrogram.idendrogram()
column_idd.set_cluster_info(column_cl_data)

column_dendrogram = column_idd.create_dendrogram(p=20, sort_descending=False).plot(
backend='altair',
orientation='top',
height=30, width=200,
show_nodes=False).interactive()
column_dendrogram

wangjiawen2013 commented 11 months ago

Thanks, the strategy that SciPy dendrogram handles this case is mor reasonable, because sometime we will have two objects have the same features, such as two students have the same score for all curriculums (science, mathmatics, physics).

kamicollo / idendrogram