How to determine n_clusters for a unseen data

Winbuntu commented 3 years ago

I noticed that in the kolod_pollen_bench.ipynb, cellbench.ipynb, and main_TM.py, the ground truth of number of cell types in the unlabeled dataset was given to the n_clusters parameter, for example, like what was shown in the main_TM.py

n_clusters = len(np.unique(unlabeled_data.y))
mars = MARS(n_clusters, params, labeled_data, unlabeled_data, pretrain_data[idx], hid_dim_1=1000, hid_dim_2=100)

But in practice, the unlabeled dataset is usually unseen before, so number of cell types in this unlabeled dataset is usually unknown. What is your recommendation for determining the value of n_clusters parameter, if we have a unlabeled dataset that is completely unseen before? If the value of n_clusters is off, would this greatly influence the final cell type labeling outcome?

mbrbic commented 3 years ago

As discussed in the paper, MARS expects number of clusters and n_clusters can not be off. This parameter is similar to resolution in Louvain clustering. You can experiment with different n_clusters and check differentially expressed genes for obtained clusters for validation. By varying the number of clusters, MARS can be used for a multi-resolution exploration of cell types.

Winbuntu commented 3 years ago

I see. Thanks for your clarification!

snap-stanford / mars

How to determine n_clusters for a unseen data #23