Closed Intron7 closed 10 months ago
@Intron7 I don't think it's a bug. If the criteria to form Leiden clusters don't fulfill then the nodes are not merged to form aggregated clusters and we end up with a lot more clusters compared to Louvain algorithm.
The way the Leiden algorithm works - the nodes in a graph move to form temporary clusters (similar to modularity optimization phase in Louvain algorithm) that maximize the modularity. Afterwards, in the nodes within each such temporary clusters are checked if they are strongly connected (according to Leiden algorithm). If not, then the nodes in a temporary clusters are not merged to from aggregated clusters.
One can change the thresholds so that relatively weakly-connected nodes keep forming Leiden clusters. If you have specific input graph, we can look into it. cc @ChuckHastings
I know that Leiden and Louvain produce different clusterings. But when the number of Leiden clusters goes up from 50 to over 1000 from 23.04 to 23.06 I think there might be some unintentional changes. In my testing it looks like the issue has to do with max_iter
. But that's just a hunch.
Thank you for sharing nice plots. Could you try to run with smaller resolution?
max_iter
indicates maximum number graph aggregation steps. If a graph doesn't change, say after 5 iterations, then the algorithm would terminate.
Here is the plot you asked for. The original Plot was taken at 0.6 resolution. now you can see for Leiden the resolution scaling and that it behaves very erratically.
Here you can see that max_iter doesnt change the plot. My hunch is that the weights dont get updated and they stay the same so the algorithm stops prematurly
Thank you for sharing additional plots. Is your dataset public? If so, we would like to run it on our end to figure out a bit more on it. If the algorithm doesn't change after 1st iteration, it indicates no Leiden refinement is happening.
https://github.com/Intron7/rapids_singlecell/blob/main/notebooks/demo_gpu-seuratv3.ipynb
It's this notebook. I can also create the anndata object needed (with all the preprocessing done) and upload it to googledrive and provide the link
It would be great if could upload kindly upload the data (that are input the Leiden algorithm) on google drive. I would like to run it locally.
I also have the same issue with my data that is separate. Solution for leiden clustering would be super helpful!
@johnhickey22 - which version of cugraph are you using?
@ChuckHastings - I am using 23.06.02 - let me know if you need anything else.
@Intron7
I was trying to create graph from cugraph.h5ad.
I wonder, would it be possible to create smaller possible dataset ( perhaps in the vicinity of 50 nodes) out of the data you sent where we would still see too manly leiden cluster compared to louvain?
You can already see this behaviour with scanpy.datasets.pbmc68k_reduced()
dataset. It's 8 clusters for louvain and 54 for leiden clustering. with resolution=0.6
Edit:
Here is a test example
import rapids_singlecell as rsc
import scanpy as sc
adata = sc.datasets.pbmc68k_reduced()
rsc.tl.leiden(adata, resolution=0.6)
rsc.tl.louvain(adata, resolution=0.6)
sc.tl.leiden(adata, resolution=0.6, key_added="cpu_leiden")
sc.pl.umap(adata, color = ["louvain","leiden","cpu_leiden"])
The CPU version of Leiden gives me 9 Clusters.
The Bug still exists in Rapids-23.08
Any update here - was it solved in another thread? Just wanted to check in, in case I missed something. Would love to integrate this into my analysis pipeline and just waiting for this clustering issue to be solved. Thanks!
Hi, We are working on it. Will keep you updated. Thanks for your interest.
Hi, We have pushed new changes to branch-23.12 that would fix the issues.
Looks good to me now
@Intron7 Thank you for reporting.
Version
23.06 - 23.06.02
Which installation method(s) does this occur on?
Conda
Describe the bug.
Leiden Clustering produces over 1000 Clusters. This is in contrast to 23.04 where I got around 30-40 for the same test dataset. Louvain Clustering give me 23 clusters.
Minimum reproducible example
Relevant log output
No response
Environment details
Other/Misc.
This happens on 4 tested systems with both 3090 and A100s
Code of Conduct