vtraag / leidenalg

Implementation of the Leiden algorithm for various quality functions to be used with igraph in Python.
GNU General Public License v3.0
596 stars 78 forks source link

Speed issue and multiple cores #39

Closed YubinXie closed 4 years ago

YubinXie commented 4 years ago

Hi Leiden team,

Thank you so much for you great work. I am using leiden for clustering of 720K points with 32-dim length input. It has been 24hours and it is still running. I do notice that after certain point, the usage of my cpu is only 100% instead of multiple cores. I was wondering if this is normal, and if leiden has multiple core support. Thank you again!

vtraag commented 4 years ago

The Leiden algorithm is not run in parallel. Such a network is reasonably small (at least in term of number of nodes), so I would have expected it to be finished in a few minutes at most.

I am not sure what you mean 32-dim length input?

There is probably some other issue. What version of leidenalg are you running? Can you share the network you are running it on?

YubinXie commented 4 years ago

The Leiden algorithm is not run in parallel. Such a network is reasonably small (at least in term of number of nodes), so I would have expected it to be finished in a few minutes at most.

I am not sure what you mean 32-dim length input?

There is probably some other issue. What version of leidenalg are you running? Can you share the network you are running it on?

Oh, really?! It would be interesting to know if it can be speeded up. So my input data is 720K x 32 array, 720K nodes, and 32 features for each node. I first do KNN to build the graph with k=30. Then use leidenalg to do clustering. All this is done with scanpy package (a wrapper). I will upload my 720K x 32 array to google drive soon if that is helpful!

vtraag commented 4 years ago

Parallelization is a tricky area, and some speedup is possible, but it remains challenging.

I can't work with the 720K x 32 array immediately, so please upload the resulting KNN graph instead.

YubinXie commented 4 years ago

Parallelization is a tricky area, and some speedup is possible, but it remains challenging.

I can't work with the 720K x 32 array immediately, so please upload the resulting KNN graph instead.

Got it. If 720K node graph can be clustered in mins as you mentioned, it is fast enough. I will upload the KNN graph soon. Thank you so much for your help.

YubinXie commented 4 years ago

Hi, i saved the connectivity as sparse matrix here: https://drive.google.com/file/d/1u7QPHggmERvI1lsaX9ZrWRJMXDiGvMzA/view?usp=sharing

I don't directly work with the connectivity matrix so let me know if there is an issue. Thanks!

vtraag commented 4 years ago

Hi @YubinXie , I wanted to take a closer look at the graph, but I cannot download the file in any way. Perhaps you need to adjust the sharing settings or something, or perhaps you can share it in another way?

YubinXie commented 4 years ago

The sharing setting was and is 'anyone in internet can view'. Do you mind to have another try?

vtraag commented 4 years ago

Sorry, I just tried again, on other files the "Download" option appears, but on this file that option is not visible. There might be a specific option in Google Drive that disallows "viewers" downloading the file. Alternatively, you could simply attach the file here on GitHub, I believe that should also work.

YubinXie commented 4 years ago

Now it should work:

https://drive.google.com/file/d/1zM2hTOiPGOmj284accCghvDtRfCtuF5L/view?usp=sharing

previously it was a npz file and it likely causes problems. I zipped it. Sorry for the inconvenience .

vtraag commented 4 years ago

@YubinXie, I just tried to replicate the problem, but detecting communities finishes in a matter of minutes on my laptop on your dataset. There was a speed issue with version 0.8.0 (see issue #35). Were you perhaps using that version? Could you retry with version 0.8.1?

YubinXie commented 4 years ago

wow Vincent you are right. so for 70K node, the running time reduces from 6 min to 1 min. This is really amazing. Thank you soooo much for optimizing the speed for a huge community behind it! It will largely reduce the carbon dioxide emissions from our single-cell biomedical research community 😝 .

vtraag commented 4 years ago

Well, actually I just messed up, version 0.7.0 was actually faster :frowning_face: Next release should be faster still, thanks to @ragibson (in PR #40).

YubinXie commented 4 years ago

Looking forward!