add automated cluster detection for Seurat object uploads

alexvpickering commented 11 months ago

Background

Currently Seurat object upload uses scdata$seurat_clusters as the louvain clusters. It is common to have multiple clusterings of a dataset (e.g. from different resolutions, grouping multiple clusters into a single cluster, etc). These additional clusterings of a Seurat object are not currently available to the user without manually overwriting seurat_clusters and uploading as a separate project. The goal of this ticket is to automatically detect these clusterings and make them available to the user.

Approach

Discover cluster columns by exclusion (may need to adjust):

exclude samples column and any columns that are consistent with division of samples into groups (auto-detected currently as sample-level metadata columns for between-group comparisons)
only consider columns that have a reasonable number of distinct values (2 - 1000)
exclude columns that are numeric, but non-integer
exclude columns where most common values are repeated only a few times (< 3)
skip boolean columns

gerbeldo commented 11 months ago

There's an issue with seurat objects downloaded from Cellenics, using the "download rds" button in the Data Management module.

Louvain slot was duplicated, and the values taken as louvain were actually the doublet predictions.

These objects use the cellset key to add the clustering information, which for clusters is "louvain".

In the dataset in question, "louvain" was not the first column in the metadata table, which duplicates the name.

Original:

Re-uploaded:

gerbeldo commented 11 months ago

I tested two other datasets, with and without sample level metadata, but the issue does not reproduce, so it might be a particularly bad dataset.

alexvpickering commented 11 months ago

Thanks for the report @gerbeldo! It should be fixed now. The issue was the following:

Seurat objects downloaded from Cellenics have no cluster based active.ident (just the project name) or seurat_clusters which are the first choice this PR uses for the default clusters
the second choice is just the first identified cluster column. For this particular dataset, this is doublet_class which was being given the key louvain (we need this key to exist)
there was also a column called louvain that was being assigned the same key causing the issue

There is now an explicit check to make sure that the louvain key isn't used if there is a column with the same name

hms-dbmi-cellenics / issues

add automated cluster detection for Seurat object uploads #9