scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.92k stars 602 forks source link

Leiden now included in python-igraph #1053

Closed vtraag closed 8 months ago

vtraag commented 4 years ago

The Leiden algorithm is now included in the latest release of python-igraph, version 0.8.0. I believe this alleviates the need to depend on the leidenalg packages. The Leiden algorithm provided in python-igraph is substantially faster than the leidenalg package. It is simpler though, providing fewer options, but I believe the more extensive options of the leidenalg package are not necessarily needed for the purposes of scanpy. We provide binary wheels on PyPI and binaries for conda are available from the conda-forge channel, also for Windows.

ivirshup commented 4 years ago

Thanks for the update! I'm not sure if we'll be able to migrate very easily though. We allow users to choose the quality function, and use the leidenalg.RBConfigurationVertexPartition as the default. We've also been considering using the multiplex partitioning methods.

vtraag commented 4 years ago

The Leiden algorithm from igraph only implements two quality functions: CPM and modularity. Both indeed use resolution parameters, and are the equivalent of CPMVertexPartition and RBConfigurationVertexPartition. The reason the igraph implementation is more performant than leidenalg is exactly because it does not provide other quality functions or a multiplex approach. If you need the other quality functions (such as significance or surprise), you will need to stick to leidenalg. So if you need those other quality functions, or a multiplex approach, you might still want to stick to the leidenalg package after all.

ivirshup commented 4 years ago

If we can cleanly switch to the igraph implementation for modularity with weights, it could make sense for that to be the default. Any chance you could point me to some benchmarks on performance? An initial test looks very impressive!

vtraag commented 4 years ago

I haven't performed an in-depth benchmark comparison. But results from a single run of modularity detection on an example (a Facebook graph) is sufficiently revealing I think:

Running Leiden 0.7.0.post1+71.g14ba1e4.dirty
Running igraph 0.8.0
Read graph (n=63731,m=817035), starting community detection.
leidenalg: t=8.048258741036989, m=0.6175825273363675
igraph community_leiden: t=1.159165252931416, m=0.6298702028415605

This is only a relatively small graph, and the difference is likely to be even bigger for larger graphs. Perhaps the igraph Leiden algorithm can indeed be the default, with leidenalg being an optional choice or something?

mezwick commented 2 years ago

I have a large single cell set, where i'd love to speed up the Leiden clustering.

Is it the igraph version of leiden implemented in scanpy? Just wondering if this might be a potential area for speed gains.

Thanks!

ivirshup commented 2 years ago

This hasn't been implemented yet, but a pull request would be welcome.

There would also have to be documentation about changing results and how to get previous behavior. Some benchmarks of performance would also be great.

mezwick commented 2 years ago

So there are definite speed gains to be had with the igraph implementation of Leiden clustering. But the results are not exactly the same.

I have run igraph straight from adata by piggy backing scanpy utility functions with the following code...

adjacency = sc._utils._choose_graph(adata, obsp=None, neighbors_key=None)
g = sc._utils.get_igraph_from_adjacency(adjacency)
clustering = g.community_leiden(objective_function='modularity', resolution_parameter=0.15)
adata_analyse.obs['leiden_igraph'] = pd.Series(clustering.membership, dtype='category', index=adata_analyse.obs.index)

On my spatial data (only mention this to explain the poor cell-cell separation in the results!) igraph is 5.86x faster at clustering a 185,000 cell dataset vs. sc.tl.leiden(adata, resolution=0.15). The methods output different cluster numbers, but the results broadly make sense for both outputs biologically speaking. image

On the PBMC 3k dataset from scanpy tutorials, there is no practically speed difference (too small a dateset to matter). The cluster number output from both methods is the same for both methods. There are a few differences in cell cluster assignment though. image

mezwick commented 2 years ago

The Leiden algorithm is now included in the latest release of python-igraph, version 0.8.0. I believe this alleviates the need to depend on the leidenalg packages. The Leiden algorithm provided in python-igraph is substantially faster than the leidenalg package. It is simpler though, providing fewer options, but I believe the more extensive options of the leidenalg package are not necessarily needed for the purposes of scanpy. We provide binary wheels on PyPI and binaries for conda are available from the conda-forge channel, also for Windows.

I have now done a speed comparison with adata object of 1.85 million cells.

igraph on adata as implemented above ran in 33 minutes vs sc.tl.leiden() which took ~14 hours

mezwick commented 2 years ago

@vtraag @ivirshup I accidentally ran the igraph_community_leiden in a for loop, 4 times with the same settings and noticed the leiden output differed slightly each time. Is there a parameter to set to ensure the same output each time? randomseed maybe?

More generally, any advice on which settings to implement with the igraph_community_leiden call to replicate sc.tl.leiden? It is not clear to me what some of the sc.tl.leiden default calls are to leidenalg.

Thanks!

Thanks!

vtraag commented 2 years ago

Good to see the large speed gains! Do note that g.community_leiden(objective_function='modularity', resolution_parameter=0.15) won't use any edge weights by default. Possibly, sc.tl.leiden(), might use edge weights? This might explain some of the discrepancies you might see. Another difference is that community_leiden only runs on undirected graphs, while leidenalg will also work on directed graphs, which makes a difference when using modularity. This could perhaps also explain some of the discrepancies.

vtraag commented 2 years ago

I accidentally ran the igraph_community_leiden in a for loop, 4 times with the same settings and noticed the leiden output differed slightly each time. Is there a parameter to set to ensure the same output each time? randomseed maybe?

Yes, you can set a random seed. You can set the RNG in igraph using set_random_number_generator. When simply passing the standard random Python library, you can set the seed using random.seed.

LuckyMD commented 2 years ago

I'm not sure the objective should be to get the same output from two runs. I recall louvain having a low average NMI (below 0.3 or sth like that?) for multiple runs at different random seeds. In the end this is an NP-hard problem with a good heuristic solution that is affected by the random ordering of nodes. Maybe it would be more useful to look at multiple runs of sc.tl.leiden and multiple runs of leiden from python-igraph and compare the outputs in terms of NMI to one another or to some baseline label assignment?

vtraag commented 2 years ago

Getting identical output from two runs without using the same seed is unlikely in many cases indeed. In some cases, you really do want to get identical output, so then setting a seed could be useful.

In addition, it would be good to ensure that at least the approach is conceptually the same for both the python-igraph implementation and sc.tl.leiden. But the suggestion of taking multiple runs and compare them makes sense!

ivirshup commented 2 years ago

@vtraag, could you comment on what about this implementation makes it faster? I'm wondering how much the speed gains could be from number of iterations.

Some other notes:

vtraag commented 2 years ago

The leidenalg package was originally built for flexibility, and you can easily plugin new quality functions. As a result, some of the admin stuff is being done less efficiently than it could be done. In igraph, there is less flexibility, so that the implementation can be made more efficient.

Additionally, some of the iteration over neighbours in the leidenalg package is less efficient than how it is implemented in igraph at the moment. This could be made more efficient though, but it is something that requires quite some rewriting, for which I would first need to find the time. I'm not sure how large the speed gains of this would be immediately.

mezwick commented 2 years ago

For the 3k test dataset, introducing edge weights with

import random
random.seed(1234) 

adjacency = sc._utils._choose_graph(adata, obsp=None, neighbors_key=None)
g = sc._utils.get_igraph_from_adjacency(adjacency)
clustering = g.community_leiden(objective_function='modularity', weights='weight')
adata.obs['leiden_igraph_edge_weighted'] = pd.Series(clustering.membership, dtype='category', index=adata.obs.index)

Leads to a more similar clustering to sc.tl.leiden.

Setting the seed makes all reproducible. image

Including igraph edge weights does not seem to impact run times on my larger datasets vs. igraph without weights.

ivirshup commented 2 years ago

I suspect a lot of the runtime increase may come from the number of iterations.

Using the default value, n_interations=2, g.community_leiden("modularity", weights="weight") took ~2.5 minutes.

Using n_iterations=-1, which is what scanpy does, g.community_leiden("modularity", weights="weight", n_iterations=-1) took ~1.5 hours.

I am unsure why we set n_iterations=-1, since that's not what leidenalg does by default. I'm not seeing much discussion of why it's done this way in the issue (#350) or PR (#361)...

vtraag commented 2 years ago

Ah right, you meant those iterations, of course. Yes, that definitely can make a large difference! When comparing the speed of both implementations, indeed the number of iterations should of course be identical.

dawe commented 2 years ago

@vtraag @ivirshup I accidentally ran the igraph_community_leiden in a for loop, 4 times with the same settings and noticed the leiden output differed slightly each time. Is there a parameter to set to ensure the same output each time? randomseed maybe?

More generally, any advice on which settings to implement with the igraph_community_leiden call to replicate sc.tl.leiden? It is not clear to me what some of the sc.tl.leiden default calls are to leidenalg.

Thanks!

Thanks!

As mentioned below, you should set the RNG. AFAIK scanpy does that by default.

ivirshup commented 2 years ago

It would probably good to see how important the n_iterations parameter is for our data. Hopefully after the first couple iterations it's only a few points of the million shuffling around per iteration. I'll try to take a closer when I can, but basically need to try something like:

def iterativley_cluster(
    g: igraph.Graph,
    *,
    n_iterations: int = 10,
    random_state: int = 0,
    leiden_kwargs: dict = {}
) -> list:
    import random
    random.seed(random_state)

    _leiden_kwargs = {"objective_function": "modularity", "weights": "weight"}
    _leiden_kwargs.update(leiden_kwargs)

    partition = g.community_leiden(n_iterations=1, **_leiden_kwargs)

    steps = [partition]
    for _ in range(n_iterations-1):
        partition = g.community_leiden(n_iterations=1, initial_membership=partition.membership, **_leiden_kwargs)
        steps.append(partition)

    return steps

My suspicion (and hope) would be that unstable clusters / points are the ones that drag on the optimization process. E.g. groups that aren't maintained when you change the random seed also aren't maintained through later iterations.

mezwick commented 2 years ago

Setting n_iterations=-1 in g.community_leiden certainly impacts run time (vs. default n_iterations=2), making runtimes more similar to sc.tl.leiden(). For large datasets though, run times with g.comunity_leiden still appear faster.

The average of 4 leiden runs on my 185,000 cell subsampled dataset: sc.tl.leiden, 11.5 minutes g.community_leiden, 9.5 minutes

1 leiden run on my 1,850,000 cell subsampled dataset: sc.tl.leiden, 11 hours, 26 minutes
g.community_leiden, 7 hours, 30 minutes

vtraag commented 2 years ago

@mezwick, I don't fully understand something. The timings you report seem to be lower for sc.tl.leiden than for g.community_leiden, suggesting that g.community_leiden is slower than sc.tl.leiden, but in your comments you say that g.community_leiden appears faster. Am I misunderstanding something? Could you perhaps clarify?

mezwick commented 2 years ago

Sorry, i made a critical typo in the time reports, where i listed the functions the wrong way round. I have updated the comment to correct this.

To be clear. g.community_leiden is faster than sc.tl.leiden in my case, particulalrly for large datasets.

Setting n_iterations=-1 in g.community_leiden certainly impacts run time (vs. default n_iterations=2), making runtimes more similar to sc.tl.leiden(). For large datasets though, run times with g.comunity_leiden still appear faster.

The average of 4 leiden runs on my 185,000 cell subsampled dataset: sc.tl.leiden, 11.5 minutes g.community_leiden, 9.5 minutes

1 leiden run on my 1,850,000 cell subsampled dataset: sc.tl.leiden, 11 hours, 26 minutes g.community_leiden, 7 hours, 30 minutes

mezwick commented 2 years ago

Hi @vtraag @ivirshup,

I have been unable to get the leiden clustering to run on a large 18.5 million cell dataset with either sc.tl.leiden() or the python-igraph implementation outlined above. Smaller subsets of the data run fine. I think the problem lies somewhere with weights.

When i run

obs_name = f"leiden {leiden_suffix}"
sc.tl.leiden(adata, resolution=0.1, n_iterations=2, key_added=obs_name)

I get the following traceback

Traceback (most recent call last):
  File "/opt/notebooks/output/../scripts/sc_cluster.py", line 92, in <module>
    adjacency = sc._utils._choose_graph(adata, obsp=None, neighbors_key=None)
  File "/opt/conda/envs/scanpy_py3pt9/lib/python3.9/site-packages/scanpy/_utils/__init__.py", line 219, in get_igraph_from_adjacency
    g.es['weight'] = weights
SystemError: /opt/conda/conda-bld/python-split_1649141344976/work/Objects/listobject.c:138: bad argument to internal function

When i run the python-igraph leiden implementation like so...

obs_name = f"leiden {leiden_suffix}"
g = sc._utils.get_igraph_from_adjacency(adjacency)
clustering = g.community_leiden(
    objective_function="modularity", 
    resolution_parameter=0.1,            
    weights = 'weight',
    n_iterations=2
    )
adata.obs[obs_name] = (
    pd.Series(
        clustering.membership, 
        dtype='category', 
        index=adata.obs.index
        )
    )

I get the following, similar traceback

UMAP leiden clustering resolution 0.1 commenced 12:27:06.619473
running Leiden clustering
Traceback (most recent call last):
  File "/opt/notebooks/output/../scripts/sc_cluster.py", line 88, in <module>
    obs_name = f"leiden {leiden_suffix}"
  File "/opt/conda/envs/scanpy_py3pt9/lib/python3.9/site-packages/scanpy/tools/_leiden.py", line 129, in leiden
    g = _utils.get_igraph_from_adjacency(adjacency, directed=directed)
  File "/opt/conda/envs/scanpy_py3pt9/lib/python3.9/site-packages/scanpy/_utils/__init__.py", line 219, in get_igraph_from_adjacency
    g.es['weight'] = weights
SystemError: /opt/conda/conda-bld/python-split_1649141344976/work/Objects/listobject.c:138: bad argument to internal function

The script runs completely fine when I subsample the adata with sc.pp.neighbours. So far, i have managed to run 0.5 fraction of the cells (9.25 million cells). On the cluster i am trying to run this on, memory and processing power are not limiting.

Any advice on what might be going wrong here?

Thanks!

mezwick commented 2 years ago

This issue appears due to a memory overflow on the number of edges in my graph. Fixing will probably need to wait till igraph 0.10, which should handle 64-bit integers, and so be able to handle far larger graphs https://github.com/igraph/python-igraph/issues/531#issuecomment-1102792526

vtraag commented 2 years ago

When the C core igraph version 0.10 is released, including a new release of the Python interface building on this new version, i.e. including 64-bit support, I will also update the leidenalg implementation to follow suit. In principle, leidenalg is already working with 64-bit integers, but since it builds on igraph, it is limited by that.

adamgayoso commented 2 years ago

@ivirshup if it's still faster with the same value of n_iterations, then why not make the switch?

ilan-gold commented 10 months ago

Could also look at using https://igraph.org/c/doc/igraph-Layout.html#igraph_layout_umap_compute_weights while we're at it with python-igraph

ilan-gold commented 10 months ago

@ivirshup The code diff is here but I'll explain the logic of the changes.

  1. igraph's implementation does not allow directed graphs; we throw a ValueError if someone tries to do directed + igraph.
  2. igraph's default resolution parameter is 2, so that changed as well. I don't think we should swap it back to -1 for leidenalg.
  3. Of course use_igraph is now True (and a new argument).

I'll look into some larger datasets. Also there is no plotting test from what I see. Should we add that?