scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.93k stars 604 forks source link

tl.neighbors ; n_neighbors optimal range? #223

Closed jayypaul closed 6 years ago

jayypaul commented 6 years ago

Would you say that there is an optimal range to set n_neighbors usually? And maybe a max value that rarely should be exceeded?

I'm trying to optimize louvain clustering for several datasets, and I'm aiming to automate at least a portion of the process, by going through a range of neighbor values (tl.neighbors) and resolution values (for tl.louvain), while keeping n_pcs constant, and most of my highest scoring clustering arrangements (measured by the silhouette index) uses neighbor parameters ~ 22 - 30. I know that these parameters will depend on the dataset, but I'm wondering if I should set a lower upper limit (For now it's 30), then go in and try to optimize the clustering of specific clusters using the restrict_to parameter for the louvain function. The clustering arrangements I have don't seem to be adequate based on certain markers that I'm plotting across the cells.

Hope this makes sense.

Best

falexwolf commented 6 years ago

I never used something different than n_neighbors=5 for very small datasets (~1000 cells) up to n_neighbors=30 for large datasets. However, I rarely change the default n_neighbors=15 anyways. For very large datasets and in very rare cases I can imagine that it pays off to go up to n_neighbors=50 or even more, but I never did this...

I'd say it doesn't actually make a lot of sense to use the silhouette coefficient for evaluation: the Louvain algorithm optimizes modularity, which you can view as the graph-based version of a silhouette coefficient ("ratio" of intra-cluster edges versus inter-cluster edges as compared to intra-cluster distances vs. inter-cluster distances in the silhouette coefficient). Once the graph is computed, there is no point in going back to the feature space for the computation of topological properties.

In the end, you describe the common workflow. You start with some coarse clustering and recluster the parts of the graph in which you want higher resolution. It's not at all surprising that the clustering by default doesn't agree with marker genes: modularity clustering clusters densely connected partitions of the graph, an information that comes from averaging over all genes.

jayypaul commented 6 years ago

This is very informative, thanks a lot for the detail.

Cheers

jayypaul commented 6 years ago

But Is there a quantitative measure to get an idea of how the graph is improving or worsening, modularity wise as you change a parameter? Or does it just make more sense to go based of marker genes and a seeming need to increase resolution for particularly coarse clusters?

LuckyMD commented 6 years ago

As the clustering is an optimization of the modularity I would argue it makes little sense to use modularity to evaluate the clustering again. Especially at different resolutions, the modularity values you obtain are not really comparable (as the resolution parameter is introduced to not optimize pure modularity and get the same result you would otherwise get at resolution 1). A comparison between knn-graph modularity at the same resolution would tell you how inherently modular the graph is. Is that what you want to know? Or what does an 'improved' graph look like to you?

I agree with Alex that using the silhouette coefficient wouldn't be much more informative. It would then just be an assessment of the approach of using a KNN graph and modularity optimization. And as that approach has been shown to work quite well, evaluating it based on something that works less well (clustering in the feature space directly) feels a bit uninformative.

I would go with what you suggested: evaluating based on marker gene expression. In the end the graph is a tool to describe the biology, so any graph structure means little without it.

If you want to evaluate how well the graph represents the biology, maybe the best way forward would be to infer cell-type labels (or use a dataset with labels) and look at the normalized mutual information between clusters and the labels. The clusters would have to be obtained in the same way for each graph (e.g. modularity optimization at a fixed resolution). Depending on how specific the labels are you will get a different result though ;).

jayypaul commented 6 years ago

I guess what I mean is a metric to describe how well the data points are clustered in their own cluster relative to every other cluster and/or data point. But what you both said makes sense. Looking at marker expression and cell type classification seems to be the most obvious, practical way to assign clusters. At the end of the day, it's the biology we care about.

Thanks for your detail explanations everyone, I'm new to this but am continuing to learn a lot.

Best