scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.81k stars 586 forks source link

Louvain clustering on weighted network? #240

Closed ivirshup closed 5 years ago

ivirshup commented 5 years ago

I'm looking at network based clustering for single cell data, and was wondering about the louvain implementation used here. Is there a reason scanpy doesn't currently allow clustering on a weighted network (like that generated by sc.pp.neighbors(adata, knn=False, method="gauss"))? It's my impression that this would be more similar to the cited Phenograph method.

I'd be interested in adding that functionality (and have a proof of concept on my fork), but wanted to hear if there was a reason this wasn't implemented currently first.

Thanks for any input!

fidelram commented 5 years ago

Do you know what are the advantages of using such network in the louvain clusters?

fidelram commented 5 years ago

As a quick test, I tried the weighted version of the louvain method and it was able to identify small clusters that are no identified with the non weighted louvain. However, I did not use knn=False as this does not work well with the UMAP representation. Still, I could see differences (eg. cluster 6 and 9 in the top figure):

image

LuckyMD commented 5 years ago

@fidelram Can you change the resolution of the non-weighted version to reproduce a clustering similar to the weighted case? Weighting can change the scale of the resolution parameter.

I would assume clustering on a weighted, fully connected network would be a lot more computationally expensive. I'm curious if it's worth the additional cost. It may also perform worse as you are putting more emphasis on the euclidean distances between transcriptomes than you may want (are the values more important or only the order of the distances?).

ivirshup commented 5 years ago

Intuitively, I'd think having a more complete graph with weighted edges is more representative of the data than an arbitrary k neighbors. Even if you do use a hard cutoff on number of neighbors, I don't see how discounting all distance information would give a more accurate result. I would suspect using a weighted graph could perform better at identifying small subpopulations (where nearest neighbors from other cell types could be common), but that's just conjecture.

From some preliminary attempts of my own, it seems clustering solutions become more stable with respect to parameter choice when I use the method=gauss, knn=False weighted network. I'm still in the process of verifying this, however.

I would also note that the documentation for sc.tl.louvain references this paper (the Phenograph method), which uses the louvain method on a a weighted graph. If the method is cited, why not allow using it?

Just from a package design/ usability perspective, I think it's nice to include. It would make the package more flexible and allows the user to take more advantage of the louvain-igraph library. If the user could also specify the kind of partition used, even better.

@LuckyMD, it's definitely more memory intensive, but I'm not sure it's prohibitively computationally expensive. Also weights don't have to be based on the euclidean distance (Phenograph uses Jaccard distances between nodes' neighborhoods) and there's some evidence to suggest we should using correlation based distance metrics anyways.

fidelram commented 5 years ago

@LuckyMD I changed the resolution of the method and I find that the weighted version produces results that are different. In other words, the difference in the clustering does not seem to come from differences in the resolution parameter.

Here is an example. In both cases, 13 clusters are found; however only the weighted version can identify the small cluster No. 13:

image

Increasing the resolution of the non-weighted method eventually discerns the small cluster 13

I would vote to add the modifications from @ivirshup

LuckyMD commented 5 years ago

I see no reason why the possibility shouldn't exist to run the weighted version on the full graph. I'm still curious about the quality of the outcome though.

Using protein-protein interaction data, I've noticed that similarity scores perform worse than using network neighbourhoods based on cutoffs to cluster data (this does not have to be the case for scRNA-seq of course). In the latter case you require cells to be each others nearest neighbours to create dense network regions, rather than highly similar transcriptomes based on one calculation of similarity. I would have thought the cutoff approach is more robust to changing similarity metrics as well. It's definitely worth testing this though. Maybe I'm just too skeptical of similarity metrics over all.

@fidelram do you have labels on your data where you could verify the quality of those two partitions?

fidelram commented 5 years ago

@LuckyMD The data I am using is from the recent single cell paper about the discovery of the Ionocytes Plasschaert, L. et al. 2018.. Based on the published markers I can label some clusters: the small cluster 11 is the Ionocytes and cluster 10 is labeled a PNEC/Brush.

LuckyMD commented 5 years ago

@fidelram So based on that could you say that the non-weighted method performs better for cluster 10 (PNEC/Brush cluster) as it is identified in this partition, but merged with other cells in cluster 3 in the weighted partition?

I guess it might take a more thorough analysis to make those types of conclusions though.

fidelram commented 5 years ago

@LuckyMD I would say that is the other way around:

However, the differences are not extremely large.

LuckyMD commented 5 years ago

Sorry, I thought you meant the clusters 10 and 11 from the non-weighted version. It would be interesting to see how the labels look for the clusters where the two methods don't agree.

ivirshup commented 5 years ago

So, I'll go ahead and start a pull request?

Something I think could be useful to include in implementing this is allowing multiple network representations with keyed access (similar to use_rep). This would be useful for the cases where you want to cluster on a network that would be inappropriate to use for UMAP.

LuckyMD commented 5 years ago

@ivirshup I would wait for @falexwolf to get back from his break and see what he says.

As for your second point, I think there are reservations about making the anndata objects too large. But again, @falexwolf will be able to answer that better than me.

ivirshup commented 5 years ago

Oh, any idea when that is?

As an alternative to allowing more storage, I'd be happy to be able to get a graph returned and later pass it in as an argument, I'm just not sure that fits with the current api. Personally, I feel like it's ultimately up to the user how much they store in an object.

falexwolf commented 5 years ago

Hi all,

thanks for the nice discussion!

Phenograph is cited not for the specific implementation but for suggesting to use community detection for clustering in single-cell data. This is what the docs state "The Louvain algorithm has been proposed for single-cell analysis by [Levine15]."

My opinion on clustering algorithms: use something that respects the topology of the data (points belonging to one clusters should be connected). Any graph clustering algorithm respects that, even spectral clustering. So, I'm not a big fan of trying 5 clustering algorithms to produce sensible results. Either a given representation of the data clusters clearly or it doesn't. If it doesn't, Louvain clustering just gives you one possible, representative partitioning of the data. But there are many others that are equally meaningful. Similar for other graph clustering algorithms.

Now, running Louvain clustering on a fully connected graph is prohibitive computationally (memory and CPU time wise).

Intuitively, I'd think having a more complete graph with weighted edges is more representative of the data than an arbitrary k neighbors. Even if you do use a hard cutoff on number of neighbors, I don't see how discounting all distance information would give a more accurate result. I would suspect using a weighted graph could perform better at identifying small subpopulations (where nearest neighbors from other cell types could be common), but that's just conjecture.

That's just speculation to me. I never saw convincing benchmarks. No one claims that "discounting all distance information gives a more accurate result". It's just that it's computationally cheaper.

I acknowledge that a "non-fixed-degree knn graph" varying say, between 5 and 100, would be computationally tractable and would carry information about the sampling density of the data in the given representation. This information is only indirectly available in the fixed-degree knn graph (more loops etc. in high-density regions). I never investigated this as I never saw fundamental results on such a non-fixed-degree knn graph. As it's also hard to benchmark this, I'd be afraid of getting into this if one doesn't have the time to get the fundamentals right.

I want to note that even in the context of diffusion processes, we managed to obtain meaningful results with kNN graphs in practice. And this clearly contradicts the fundamental results found in all the Coifman papers.

Having said that: if the code is simple, I don't mind at all to have the possibility that you suggest, @ivirshup. Please go ahead with a pull request and I'll see whether the changes are simple enough. The user will still use the default plain knn version, which is also what is done in Seurat.

But my philosophy rests the same: rather than engineering the clustering or any other aspect of the manifold analysis, one should engineer the representation.

Sorry that this got a bit length and confused. I hope I addressed everything but I feel I'm still missing something - I'm currently working through a lot of issues.

falexwolf commented 5 years ago

Ah, what I missed is the statement about metrics: I know that many people play around with different metrics. But then you mix "representation engineering" (preprocessing or machine learning) with manifold analysis. I'd say the cleanest is to always just use Euclidean distance and all the other work should be done already before.

ivirshup commented 5 years ago

I'd agree with your statement that engineering the representation is more important than the analysis. I view my goal here as allowing more representations as input. Would you mind saying more about why you thing using different metics is less clean (simple?)? I would think that would depend on what representation you're calculating the distances on.

falexwolf commented 5 years ago

I guess we can close with this.

And sorry, I forgot to answer the above:

I view my goal here as allowing more representations as input.

When I say "representation", I mean a feature space representation, which is directly amenable to differentiable mappings, hence optimization and learning. When you say "graph representation" that might be a legit notion, too; and one can definitely think about learning different graph representations (by transforming them back to a vector space). But to me, it appears much more reasonable and straight forward to do all the inference on the feature space representation. And one should do it an a way so that the applied metric used to analyze the arising manifold in the learned representation does make sense.

Of course, if you work directly with raw data, using different metrics can capture a lot of what you'd otherwise need to learn or preprocess (invariance to scales, ...).

Hope this helps a little bit.