Closed vtraag closed 8 months ago
Thanks for the update! I'm not sure if we'll be able to migrate very easily though. We allow users to choose the quality function, and use the leidenalg.RBConfigurationVertexPartition
as the default. We've also been considering using the multiplex partitioning methods.
leidenalg.RBConfigurationVertexPartition
?The Leiden algorithm from igraph
only implements two quality functions: CPM and modularity. Both indeed use resolution parameters, and are the equivalent of CPMVertexPartition
and RBConfigurationVertexPartition
. The reason the igraph
implementation is more performant than leidenalg
is exactly because it does not provide other quality functions or a multiplex approach. If you need the other quality functions (such as significance or surprise), you will need to stick to leidenalg
. So if you need those other quality functions, or a multiplex approach, you might still want to stick to the leidenalg
package after all.
If we can cleanly switch to the igraph implementation for modularity with weights, it could make sense for that to be the default. Any chance you could point me to some benchmarks on performance? An initial test looks very impressive!
I haven't performed an in-depth benchmark comparison. But results from a single run of modularity detection on an example (a Facebook graph) is sufficiently revealing I think:
Running Leiden 0.7.0.post1+71.g14ba1e4.dirty
Running igraph 0.8.0
Read graph (n=63731,m=817035), starting community detection.
leidenalg: t=8.048258741036989, m=0.6175825273363675
igraph community_leiden: t=1.159165252931416, m=0.6298702028415605
This is only a relatively small graph, and the difference is likely to be even bigger for larger graphs. Perhaps the igraph
Leiden algorithm can indeed be the default, with leidenalg
being an optional choice or something?
I have a large single cell set, where i'd love to speed up the Leiden clustering.
Is it the igraph version of leiden implemented in scanpy? Just wondering if this might be a potential area for speed gains.
Thanks!
This hasn't been implemented yet, but a pull request would be welcome.
There would also have to be documentation about changing results and how to get previous behavior. Some benchmarks of performance would also be great.
So there are definite speed gains to be had with the igraph implementation of Leiden clustering. But the results are not exactly the same.
I have run igraph straight from adata by piggy backing scanpy utility functions with the following code...
adjacency = sc._utils._choose_graph(adata, obsp=None, neighbors_key=None)
g = sc._utils.get_igraph_from_adjacency(adjacency)
clustering = g.community_leiden(objective_function='modularity', resolution_parameter=0.15)
adata_analyse.obs['leiden_igraph'] = pd.Series(clustering.membership, dtype='category', index=adata_analyse.obs.index)
On my spatial data (only mention this to explain the poor cell-cell separation in the results!) igraph is 5.86x faster at clustering a 185,000 cell dataset vs. sc.tl.leiden(adata, resolution=0.15)
. The methods output different cluster numbers, but the results broadly make sense for both outputs biologically speaking.
On the PBMC 3k dataset from scanpy tutorials, there is no practically speed difference (too small a dateset to matter). The cluster number output from both methods is the same for both methods. There are a few differences in cell cluster assignment though.
The Leiden algorithm is now included in the latest release of
python-igraph
, version 0.8.0. I believe this alleviates the need to depend on theleidenalg
packages. The Leiden algorithm provided inpython-igraph
is substantially faster than theleidenalg
package. It is simpler though, providing fewer options, but I believe the more extensive options of theleidenalg
package are not necessarily needed for the purposes ofscanpy
. We provide binary wheels on PyPI and binaries for conda are available from the conda-forge channel, also for Windows.
I have now done a speed comparison with adata object of 1.85 million cells.
igraph on adata as implemented above ran in 33 minutes vs sc.tl.leiden()
which took ~14 hours
@vtraag @ivirshup I accidentally ran the igraph_community_leiden
in a for loop, 4 times with the same settings and noticed the leiden output differed slightly each time. Is there a parameter to set to ensure the same output each time? randomseed maybe?
More generally, any advice on which settings to implement with the igraph_community_leiden
call to replicate sc.tl.leiden
? It is not clear to me what some of the sc.tl.leiden
default calls are to leidenalg
.
Thanks!
Thanks!
Good to see the large speed gains! Do note that g.community_leiden(objective_function='modularity', resolution_parameter=0.15)
won't use any edge weights by default. Possibly, sc.tl.leiden()
, might use edge weights? This might explain some of the discrepancies you might see. Another difference is that community_leiden
only runs on undirected graphs, while leidenalg
will also work on directed graphs, which makes a difference when using modularity. This could perhaps also explain some of the discrepancies.
I accidentally ran the
igraph_community_leiden
in a for loop, 4 times with the same settings and noticed the leiden output differed slightly each time. Is there a parameter to set to ensure the same output each time? randomseed maybe?
Yes, you can set a random seed. You can set the RNG in igraph
using set_random_number_generator
. When simply passing the standard random
Python library, you can set the seed using random.seed
.
I'm not sure the objective should be to get the same output from two runs. I recall louvain having a low average NMI (below 0.3 or sth like that?) for multiple runs at different random seeds. In the end this is an NP-hard problem with a good heuristic solution that is affected by the random ordering of nodes. Maybe it would be more useful to look at multiple runs of sc.tl.leiden
and multiple runs of leiden from python-igraph
and compare the outputs in terms of NMI to one another or to some baseline label assignment?
Getting identical output from two runs without using the same seed is unlikely in many cases indeed. In some cases, you really do want to get identical output, so then setting a seed could be useful.
In addition, it would be good to ensure that at least the approach is conceptually the same for both the python-igraph
implementation and sc.tl.leiden
. But the suggestion of taking multiple runs and compare them makes sense!
@vtraag, could you comment on what about this implementation makes it faster? I'm wondering how much the speed gains could be from number of iterations.
Some other notes:
The leidenalg
package was originally built for flexibility, and you can easily plugin new quality functions. As a result, some of the admin stuff is being done less efficiently than it could be done. In igraph
, there is less flexibility, so that the implementation can be made more efficient.
Additionally, some of the iteration over neighbours in the leidenalg
package is less efficient than how it is implemented in igraph
at the moment. This could be made more efficient though, but it is something that requires quite some rewriting, for which I would first need to find the time. I'm not sure how large the speed gains of this would be immediately.
For the 3k test dataset, introducing edge weights with
import random
random.seed(1234)
adjacency = sc._utils._choose_graph(adata, obsp=None, neighbors_key=None)
g = sc._utils.get_igraph_from_adjacency(adjacency)
clustering = g.community_leiden(objective_function='modularity', weights='weight')
adata.obs['leiden_igraph_edge_weighted'] = pd.Series(clustering.membership, dtype='category', index=adata.obs.index)
Leads to a more similar clustering to sc.tl.leiden
.
Setting the seed makes all reproducible.
Including igraph edge weights does not seem to impact run times on my larger datasets vs. igraph without weights.
I suspect a lot of the runtime increase may come from the number of iterations.
Using the default value, n_interations=2
, g.community_leiden("modularity", weights="weight")
took ~2.5 minutes.
Using n_iterations=-1
, which is what scanpy does, g.community_leiden("modularity", weights="weight", n_iterations=-1)
took ~1.5 hours.
I am unsure why we set n_iterations=-1
, since that's not what leidenalg
does by default. I'm not seeing much discussion of why it's done this way in the issue (#350) or PR (#361)...
Ah right, you meant those iterations, of course. Yes, that definitely can make a large difference! When comparing the speed of both implementations, indeed the number of iterations should of course be identical.
@vtraag @ivirshup I accidentally ran the
igraph_community_leiden
in a for loop, 4 times with the same settings and noticed the leiden output differed slightly each time. Is there a parameter to set to ensure the same output each time? randomseed maybe?More generally, any advice on which settings to implement with the
igraph_community_leiden
call to replicatesc.tl.leiden
? It is not clear to me what some of thesc.tl.leiden
default calls are toleidenalg
.Thanks!
Thanks!
As mentioned below, you should set the RNG. AFAIK scanpy does that by default.
It would probably good to see how important the n_iterations
parameter is for our data. Hopefully after the first couple iterations it's only a few points of the million shuffling around per iteration. I'll try to take a closer when I can, but basically need to try something like:
def iterativley_cluster(
g: igraph.Graph,
*,
n_iterations: int = 10,
random_state: int = 0,
leiden_kwargs: dict = {}
) -> list:
import random
random.seed(random_state)
_leiden_kwargs = {"objective_function": "modularity", "weights": "weight"}
_leiden_kwargs.update(leiden_kwargs)
partition = g.community_leiden(n_iterations=1, **_leiden_kwargs)
steps = [partition]
for _ in range(n_iterations-1):
partition = g.community_leiden(n_iterations=1, initial_membership=partition.membership, **_leiden_kwargs)
steps.append(partition)
return steps
My suspicion (and hope) would be that unstable clusters / points are the ones that drag on the optimization process. E.g. groups that aren't maintained when you change the random seed also aren't maintained through later iterations.
Setting n_iterations=-1
in g.community_leiden
certainly impacts run time (vs. default n_iterations=2
), making runtimes more similar to sc.tl.leiden()
. For large datasets though, run times with g.comunity_leiden
still appear faster.
The average of 4 leiden runs on my 185,000 cell subsampled dataset:
sc.tl.leiden
, 11.5 minutes
g.community_leiden
, 9.5 minutes
1 leiden run on my 1,850,000 cell subsampled dataset:
sc.tl.leiden
, 11 hours, 26 minutes
g.community_leiden
, 7 hours, 30 minutes
@mezwick, I don't fully understand something. The timings you report seem to be lower for sc.tl.leiden
than for g.community_leiden
, suggesting that g.community_leiden
is slower than sc.tl.leiden
, but in your comments you say that g.community_leiden
appears faster. Am I misunderstanding something? Could you perhaps clarify?
Sorry, i made a critical typo in the time reports, where i listed the functions the wrong way round. I have updated the comment to correct this.
To be clear. g.community_leiden
is faster than sc.tl.leiden
in my case, particulalrly for large datasets.
Setting
n_iterations=-1
ing.community_leiden
certainly impacts run time (vs. defaultn_iterations=2
), making runtimes more similar tosc.tl.leiden()
. For large datasets though, run times withg.comunity_leiden
still appear faster.The average of 4 leiden runs on my 185,000 cell subsampled dataset:
sc.tl.leiden
, 11.5 minutesg.community_leiden
, 9.5 minutes1 leiden run on my 1,850,000 cell subsampled dataset:
sc.tl.leiden
, 11 hours, 26 minutesg.community_leiden
, 7 hours, 30 minutes
Hi @vtraag @ivirshup,
I have been unable to get the leiden clustering to run on a large 18.5 million cell dataset with either sc.tl.leiden()
or the python-igraph
implementation outlined above. Smaller subsets of the data run fine. I think the problem lies somewhere with weights.
When i run
obs_name = f"leiden {leiden_suffix}"
sc.tl.leiden(adata, resolution=0.1, n_iterations=2, key_added=obs_name)
I get the following traceback
Traceback (most recent call last):
File "/opt/notebooks/output/../scripts/sc_cluster.py", line 92, in <module>
adjacency = sc._utils._choose_graph(adata, obsp=None, neighbors_key=None)
File "/opt/conda/envs/scanpy_py3pt9/lib/python3.9/site-packages/scanpy/_utils/__init__.py", line 219, in get_igraph_from_adjacency
g.es['weight'] = weights
SystemError: /opt/conda/conda-bld/python-split_1649141344976/work/Objects/listobject.c:138: bad argument to internal function
When i run the python-igraph
leiden implementation like so...
obs_name = f"leiden {leiden_suffix}"
g = sc._utils.get_igraph_from_adjacency(adjacency)
clustering = g.community_leiden(
objective_function="modularity",
resolution_parameter=0.1,
weights = 'weight',
n_iterations=2
)
adata.obs[obs_name] = (
pd.Series(
clustering.membership,
dtype='category',
index=adata.obs.index
)
)
I get the following, similar traceback
UMAP leiden clustering resolution 0.1 commenced 12:27:06.619473
running Leiden clustering
Traceback (most recent call last):
File "/opt/notebooks/output/../scripts/sc_cluster.py", line 88, in <module>
obs_name = f"leiden {leiden_suffix}"
File "/opt/conda/envs/scanpy_py3pt9/lib/python3.9/site-packages/scanpy/tools/_leiden.py", line 129, in leiden
g = _utils.get_igraph_from_adjacency(adjacency, directed=directed)
File "/opt/conda/envs/scanpy_py3pt9/lib/python3.9/site-packages/scanpy/_utils/__init__.py", line 219, in get_igraph_from_adjacency
g.es['weight'] = weights
SystemError: /opt/conda/conda-bld/python-split_1649141344976/work/Objects/listobject.c:138: bad argument to internal function
The script runs completely fine when I subsample the adata with sc.pp.neighbours
. So far, i have managed to run 0.5 fraction of the cells (9.25 million cells). On the cluster i am trying to run this on, memory and processing power are not limiting.
Any advice on what might be going wrong here?
Thanks!
This issue appears due to a memory overflow on the number of edges in my graph. Fixing will probably need to wait till igraph 0.10, which should handle 64-bit integers, and so be able to handle far larger graphs https://github.com/igraph/python-igraph/issues/531#issuecomment-1102792526
When the C core igraph
version 0.10 is released, including a new release of the Python interface building on this new version, i.e. including 64-bit support, I will also update the leidenalg
implementation to follow suit. In principle, leidenalg
is already working with 64-bit integers, but since it builds on igraph
, it is limited by that.
@ivirshup if it's still faster with the same value of n_iterations
, then why not make the switch?
Could also look at using https://igraph.org/c/doc/igraph-Layout.html#igraph_layout_umap_compute_weights while we're at it with python-igraph
@ivirshup The code diff is here but I'll explain the logic of the changes.
igraph
's implementation does not allow directed graphs; we throw a ValueError if someone tries to do directed + igraph
.igraph
's default resolution parameter is 2, so that changed as well. I don't think we should swap it back to -1 for leidenalg
.use_igraph
is now True
(and a new argument).I'll look into some larger datasets. Also there is no plotting test from what I see. Should we add that?
The Leiden algorithm is now included in the latest release of
python-igraph
, version 0.8.0. I believe this alleviates the need to depend on theleidenalg
packages. The Leiden algorithm provided inpython-igraph
is substantially faster than theleidenalg
package. It is simpler though, providing fewer options, but I believe the more extensive options of theleidenalg
package are not necessarily needed for the purposes ofscanpy
. We provide binary wheels on PyPI and binaries for conda are available from the conda-forge channel, also for Windows.