hemberg-lab / SC3

A tool for the unsupervised clustering of cells from single cell RNA-Seq experiments
http://bioconductor.org/packages/SC3
GNU General Public License v3.0
118 stars 55 forks source link

Not clear how SC3 exactly works #75

Closed CodeInTheSkies closed 5 years ago

CodeInTheSkies commented 5 years ago

Hello SC3 authors,

Thanks a lot for a very nice package! I'm trying to use SC3 for 10X data, and although I have been able to successfully run the package and get results with some decent level of understanding of the steps, I'm still missing a thorough intuitive idea as to how exactly SC3 does its clustering. Put another way, if the biologists whose data I'm analyzing ask me to explain it in my upcoming presentation, I will be nervous.

I have browsed through the posts and replies here, and I have also read the paper and the online vignette many times. Still, a good intuitive understanding eludes me, to the extent of feeling uncomfortable to explain to others.

So, can I please request somebody to explain how it works in simple language?

To make it less time-consuming for anybody wishing to reply to my question, I thought I will explain the current state of my understanding about SC3, just to start the discussion.

SC3 is a consensus clustering method, uses k-means clustering at its core, and does the clustering with different random starting points 1000 times by default. Then it tries to merge all the clustering results and come to a consensus result. Here is where it starts becoming vague to me. How exactly do the clustering results get merged? What is the meaning of the consensus plot? The one where there are red and blue squares, and where the most ideal situation would be to have full red squares along the diagonal and blue everywhere else.

It would be wonderful if somebody can explain the method intuitively.

Many thanks.

mhemberg commented 5 years ago

Thanks for your interest in SC3 and your questions. Here is an attempt to explain the consensus step:

You can think of the merging as a voting procedure where we for each of the individual clustering solutions create a n x n symmetric binary matrix (n is the number of cells), A_l (l goes from 1 to 1000). If cells i and j are found in the same cluster, then the element (i, j) is set to 1, otherwise it is 0. We then add all of the A_l matrices to obtain the consensus matrix. This consensus matrix is subsequently clustered and it is the normalized version which is displayed in red and blue. The matrix that you see has been clustered using hierarchical clustering into k clusters. Note that in principle, it is not necessary to use the same k for the individual clusterings as for the consensus. However, in practice the code does not allow the user to set these parameters to different values.

The ideal situation is to have red squares along the diagonal and blue outside the diagonal. Such a situation suggests that we have a high degree of similarity to all clusters within the same block and low similarity to the other cells. If there are lots of red in two different squares, that suggests that the cells in the two clusters are similar and that it might be a good idea to merge the two clusters. On the other hand, if there is blue inside the square on the diagonal, then it suggests that the cells in the cluster are not homogenous and that a higher k should be used.

Hope this helps and please let us know if you have any further questions.

CodeInTheSkies commented 5 years ago

Thank you very much for your detailed explanation! That really helps, and I will carefully read through the material again with your explanation in view for further understanding.

Just another related question I have is about the input to SC3. In my first trial, I gave it the input by creating the SCE object from scratch using the Seurat object's @raw.data slot, as I am using Seurat for all other processing. I am also trying Seurat's clustering, and I found that the clustering was better in Seurat after I regressed out certain effects including cell cycle. Now, I want to compare how SC3 clusters the same regressed-out data. So, I would like to pass on to SC3 the normalized data from Seurat after regressing out certain effects using Seurat's regress-out functionality. As I understand, this data is in Seurat's @scale.data slot after processing. So, my question is: what exactly should I bring in to SC3? Is it OK if I bring in the log-normalized and regressed-out data directly from Seurat? Also then, would this go into the logcounts slot of the SCE object? In this case then, what would go into the "counts" slot? Can the "counts" actually be empty, with values in just the "logcounts" slot?

Thanks!

CodeInTheSkies commented 5 years ago

Just as a further clarification to my question, I now realize from the SC3 manual that SC3 requires both the "counts" and the "logcounts" slots to exist. This is fine, and so my question then is, does the relationship between the "counts" and the "logcounts" slots strictly have to be such that logcounts = log2(counts+1) ?

Or, is it that the exact relationship does not matter, for e.g., it can be any kind of a custom log-transformation function? In my case then, the relationship would represent the "regressed-out log-normalization" according to how Seurat does it. I can then pass on the @raw.data slot for the "counts" slot, and the @scale.data slot to the "logcounts" slot. Would this be correct?

Would appreciate any further responses!

CodeInTheSkies commented 5 years ago

Hello!

Would love to hear back on my above question.

Relatedly, I am also wondering, for a given dataset, if one was to compare the clusters obtained using Seurat and SC3, would it actually make more sense (or any sense at all) to input the dimensionality-reduced principal components (PCs) as input to SC3 instead of giving the matrix of expression counts? When I think of it, after all, this is what Seurat's Louvain clustering algorithm looks at, correct? So, to be fair, SC3 should also begin clustering from this point in the pipeline, isn't that so? That is, from the point after dimensionality reduction.

I am just thinking of ways to eliminate as many steps as possible that differ between the two algorithms. So that they have a good chance to agree on the obtained clusters.

Would appreciate thoughts and suggestions.

Many thanks.

wikiselev commented 5 years ago

Hi, if you regression algorithm does not change 0 values to non-zero, then it is safe to use those values as your logcounts.

Regarding your second question. This is not going to work, since SC3 uses both PCA and spectral components for clustering. SC3 and Seurat are conceptually quite different, it will be hard to make them look the same or provide identical results.