plotly / dash-bio

Open-source bioinformatics components for Dash
https://dash-gallery.plotly.host/Portal/?search=Bioinformatics
MIT License
531 stars 192 forks source link

Clustergram - Request to cluster based on a selected aggregated "groupby" column #645

Open adkinsrs opened 2 years ago

adkinsrs commented 2 years ago

Currently the Dash clustergram is restricted to clustering based on all row or column values. There are cases where I would like to sort my data based on a chosen metadata category, and then cluster based on the mean value of that metadata category. Right now I am forced to choose to preserve sorting without clustering, or cluster by the raw data values and lose the aesthetic grouping that came from pre-sorting the data. Below I have two pictures of Dash-Bio Clustergrams (with my own post-processing touches) that show the situation I am trying to convey.

Clustering by individual samples instead of category

Screen Shot 2021-12-08 at 11 03 43 AM

Sorted by a category but no clustering

Screen Shot 2021-12-08 at 11 03 31 AM

The functionality I am requesting is similar to the dendrogram option for Scanpy's heatmap function (see https://scanpy.readthedocs.io/en/stable/generated/scanpy.pl.heatmap.html).

I thought a potential solution would be to

  1. Groupby the chosen category to get mean values for the data
  2. Run dashbio.Clustergram on this to get the dendrogram traces back
  3. Sort the original data to have the order match the dendrogram traces
  4. And then plug those traces back into dashbio.Clustergram using the sorted non-grouped original data.

But I would be running the "clustergram" tool twice, and since the category groups have uneven counts of members, the traces from step 2 would not line up 1-to-1 with the sorted data and the x/y coords would need to be adjusted.

Any thoughts on this enhancement?

adkinsrs commented 2 years ago

I just ran into a dataset that had so many data samples that Scipy ran into a "maximum recursion depth exceeded" error when attempting to cluster the samples, so being able to optionally cluster by an aggregated category would also alleviate this issue.

nickmelnikov82 commented 2 years ago

Hi @adkinsrs.

The reordering of the data is proceeding not in the Clustegram component directly, but in the Dendrogram class from the plotly.figure_factory module. So we don't available to fix the main problem of this issue in the dash-bio project. We can create an issue about the reordering problem in the original Dendrogram component from figure_factory.

Best wishes, Nick.