mwaskom / seaborn

Statistical data visualization in Python
https://seaborn.pydata.org
BSD 3-Clause "New" or "Revised" License
12.5k stars 1.92k forks source link

incomplete dendrograms in clustermap #2591

Closed tsp-kucbd closed 2 years ago

tsp-kucbd commented 3 years ago

With some dataframes we encounter strange behaviours of the clustermap, where a dendrogram on one axis does not align with the heat map cells as it has too few branches.

An example:

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({'A': {0: 0.5, 1: 0.0, 2: 1.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0, 10: 0.0, 11: 0.5, 12: 0.0, 13: 0.5, 14: 0.0, 15: 0.0, 16: 0.0, 17: 0.0}, 'B': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 1.0, 5: 1.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.5, 10: 0.0, 11: 1.0, 12: 0.0, 13: 0.0, 14: 0.0, 15: 0.0, 16: 0.0, 17: 0.0}, 'C': {0: 1.0, 1: 0.0, 2: 1.0, 3: 1.0, 4: 1.0, 5: 1.0, 6: 1.0, 7: 0.5, 8: 0.5, 9: 0.0, 10: 1.0, 11: 0.0, 12: 0.0, 13: 0.0, 14: 0.0, 15: 0.5, 16: 0.5, 17: 0.0}, 'D': {0: 0.0,1: 0.5, 2: 1.0, 3: 0.5, 4: 0.5, 5: 0.5, 6: 0.0, 7: 0.5, 8: 0.0, 9: 0.0, 10: 1.0, 11: 1.0, 12: 1.0, 13: 1.0, 14: 0.5, 15: 0.0, 16: 0.0, 17: 1.0}})

sns.clustermap(df, cmap='Greens', col_cluster = False, linewidth=0.5, linecolor='red', tree_kws={'colors':'blue'})
plt.show()

In the resulting figure one can see that the dendogram leaves (blue) are not aligned with the heatmap cells (red grid), and that the dendrogram has only 13 leaves whereas the dataframe has 18 rows. Any ideas why?

image

Versions Seaborn 0.11.1 Scipy 1.6.3 Pandas 1.2.4

mwaskom commented 3 years ago

It seems that the repeat rows are getting dropped from the dendrogram. I don't know whether that is expected behavior from scipy and something seaborn needs to account for, or an upstream bug.

tsp-kucbd commented 3 years ago

Interesting ... A temporary - quick and very dirty - fix which works for my cases, is to make slight changes to the rows to prevent duplicate droppings (dependent of the min values in the dataframe) df = (df + np.random.randint(1,10,size=df.shape)/100) if df.duplicated().any() else df

mwaskom commented 2 years ago

OK I have an answer to this. The duplicate rows have a distance of 0 from each other, so the connection between them is drawn at a height of 0 (or equivalently no spacing from the right edge of the plot). With thin lines they are hidden, but you can see the horizontal connection if you thicken the dendrogram lines (using tree_kws):

image