Clustergram - Request to cluster based on a selected aggregated "groupby" column

adkinsrs commented 2 years ago

Currently the Dash clustergram is restricted to clustering based on all row or column values. There are cases where I would like to sort my data based on a chosen metadata category, and then cluster based on the mean value of that metadata category. Right now I am forced to choose to preserve sorting without clustering, or cluster by the raw data values and lose the aesthetic grouping that came from pre-sorting the data. Below I have two pictures of Dash-Bio Clustergrams (with my own post-processing touches) that show the situation I am trying to convey.

Clustering by individual samples instead of category

Sorted by a category but no clustering

The functionality I am requesting is similar to the dendrogram option for Scanpy's heatmap function (see https://scanpy.readthedocs.io/en/stable/generated/scanpy.pl.heatmap.html).

I thought a potential solution would be to

Groupby the chosen category to get mean values for the data
Run dashbio.Clustergram on this to get the dendrogram traces back
Sort the original data to have the order match the dendrogram traces
And then plug those traces back into dashbio.Clustergram using the sorted non-grouped original data.

But I would be running the "clustergram" tool twice, and since the category groups have uneven counts of members, the traces from step 2 would not line up 1-to-1 with the sorted data and the x/y coords would need to be adjusted.

Any thoughts on this enhancement?

adkinsrs commented 2 years ago

I just ran into a dataset that had so many data samples that Scipy ran into a "maximum recursion depth exceeded" error when attempting to cluster the samples, so being able to optionally cluster by an aggregated category would also alleviate this issue.

nickmelnikov82 commented 2 years ago

Hi @adkinsrs.

The reordering of the data is proceeding not in the Clustegram component directly, but in the Dendrogram class from the plotly.figure_factory module. So we don't available to fix the main problem of this issue in the dash-bio project. We can create an issue about the reordering problem in the original Dendrogram component from figure_factory.

Best wishes, Nick.

plotly / dash-bio

Clustergram - Request to cluster based on a selected aggregated "groupby" column #645