vtraag / leidenalg

Implementation of the Leiden algorithm for various quality functions to be used with igraph in Python.
GNU General Public License v3.0
575 stars 77 forks source link

Is it possible to run Leiden's algorithm in the form of a sparse matrix as input to the graph #108

Closed ChongruiYang closed 1 year ago

ChongruiYang commented 1 year ago

Hello Sir,thank you for your excellent Leiden package. I would like to ask if I have a 10million*10million matrix where many edges are related by 0 or below some threshold, is there a suitable way for them to generate classifiable graph data faster without taking up a huge amount of RAMs (like sparse matrix) ? As well, for a graph of this size, would the Leiden algorithm get quite a lot of clusters? (I don't really want to get as many classifications as 1million).

I'm not sure if stopping early would be a good idea (it doesn't converge), in other words, if I want to get roughly the number of clusters around a fixed category, how would it be better to adjust the parameters, or what should I do.

Thank you for your package and work!

SamGG commented 1 year ago

Hi, I guess that you have a dataset of 10 Mi elements and you want to relate them. To my knowledge this is typically carried out by computing the k nearest neighbors, k being between 50 and 150, instead of computing the whole matrix even if such a sparse matrix could be efficiently stored in RAM. Once you got the kNN matrix, you can compute a graph and then apply Leiden algorithm. See for example the Phenograph algorithm or a faster version. https://dpeerlab.github.io/dpeerlab-website/phenograph.html https://github.com/sararselitsky/FastPG Hope this helps.

ChongruiYang commented 1 year ago

Hi, I guess that you have a dataset of 10 Mi elements and you want to relate them. To my knowledge this is typically carried out by computing the k nearest neighbors, k being between 50 and 150, instead of computing the whole matrix even if such a sparse matrix could be efficiently stored in RAM. Once you got the kNN matrix, you can compute a graph and then apply Leiden algorithm. See for example the Phenograph algorithm or a faster version. https://dpeerlab.github.io/dpeerlab-website/phenograph.html https://github.com/sararselitsky/FastPG Hope this helps.

Thank you sir, this may indeed be a good way. I would like to ask if using KNN to keep a fixed number of points adjacent to each point if weight information is available, will this have an impact on those points that have multiple very heavily weighted connected points? Is it OK to keep a fixed number of neighbors for each point with different weights of connected edges(I'm a little confused about the amount of information that would be lost with this approach)?

SamGG commented 1 year ago

I can't answer. It works and is largely used in some fields, ie dataset types. You should have to try in your specific case in order to be sure. It might depends also on the dimension of the original space. The point is that there are very fast algorithm to build kNN, which makes these approaches interesting and achievable. Warning: I am not a member of the development team, just a user.

ChongruiYang commented 1 year ago

Thank you for your answer, I have started to do so

jaminmei commented 1 year ago

Hi @SamGG, I have a similar problem as this issue. I'm a finance researcher, and I need to construct a network from approximately 1 Mi elements. The RAM is the key obstruction, so I consider some method to sparse my network. KNN seems a good method to resolve my problem, but I'm afriad that a fixed parameter k may dessert some important infomation of the network. Further, I may not set the value of k directly without any criterion.

SamGG commented 1 year ago

Hi @jaminmei If you already some known information that could be used to check/judge whether the applied method may make sense or not, try a dimensional reduction such as UMAP (or tSNE). These DR consider a fixed number of nearest neighbors. This NN information will then be used to build a graph on which Leiden will be applied. Try it. If it works, super. If it don't, you can go another way, being sure that fixed k for NN does not work. Have a nice day.

vtraag commented 1 year ago

Hi @jaminmei and @ChongruiYang, thanks for your questions, and @SamGG, thanks for replying! I just mentioned elsewhere that I'm the only person watching this issue tracker, but I'm happy to find out that's not the case!

is there a suitable way for them to generate classifiable graph data faster without taking up a huge amount of RAMs (like sparse matrix)

Yes, you can create a new igraph.Graph from a scipy.sparse matrix. You of course first have to get your data into a sparse matrix format.

This might require other approaches, like the kNN that @SamGG proposed. I don't directly know what's the best way to get kNN in Python though. However, this is perhaps not always necessary. The most recent igraph release (0.10.x) release supports as many edges as your RAM can hold (up to $2^{62}$), so this might allow you to use the entire graph, depending on the size of your dataset.

At any rate, once you have an igraph.Graph object, you can run leidenalg on it without any problems. There is also a (somewhat less feature-rich) implementation of the Leiden algorithm in igraph itself, which should run faster than this leidenalg package.

As well, for a graph of this size, would the Leiden algorithm get quite a lot of clusters? (I don't really want to get as many classifications as 1million).

This depends on how you cluster your graph. If you use ModularityVertexPartition, you are more likely to get fewer clusters, partly due to the resolution-limit in modularity.

If you use CPMVertexPartition, you may get fewer or more clusters, depending on the resolution parameter you choose. Hence, this is a choice that you need to make, with whatever choice seems to serve best your purpose.

SamGG commented 1 year ago

Hi! @vtraag there are 15 persons watching this repo, a large audience IMHO. Why? Because you are providing a valuable work. Thanks to you.

vtraag commented 1 year ago

Hi! @vtraag there are 15 persons watching this repo, a large audience IMHO.

Great, I didn't actually realise this!

Because you are providing a valuable work. Thanks to you.

Thanks for the compliment, good to hear it's appreciated!

ChongruiYang commented 1 year ago

Hello, thank you @SamGG @vtraag so much for your patience, we are trying to use the sparse method for matrices, I’m also happy to see we can use the Leiden directly to calculate the whole graph! I still have a question, my original data structure is 30,000,000 high-dimensional vectors W_1 to W_30000000 generated by "word2vec" model, currently my idea is to calculate the cosine similarity between all vectors between W_1 and W_30000000 to form a graph of data structure, and then use Leiden algorithm to implement community classification. However, some other friends told me that it might be possible to achieve similar community classification way by using methods such as K-means directly on vectors. I realize that Leiden algorithm has the best performance among graph classification algorithms, so I would like to ask if my method of constructing a graph from the relationships between vectors and use Leiden would be better than using methods such as K-means directly. I know this is a difficult question to answer, but maybe you have some prior experience? Thank you so much for your kindly reply!

SamGG commented 1 year ago

Hi! I think that each dataset could have its own specificity, so I could not answer but only give you a feedback on my own experience. On my type of datasets, kmeans is working well when classes are balanced (none of the classes has a number of elements less than 1/100 of the average number of elements). The main point is that kmeans requires the number of clusters to be set before running. With Leiden, there are parameters, but we don't have to set the number of clusters. Alternatively, there are others approaches such as SOM (GigaSOM.jl), but I believe you should ask your question on StackExchange, as it turns to deviate from an issue of Leiden Algorithm. Best.

ChongruiYang commented 1 year ago

Hi! I think that each dataset could have its own specificity, so I could not answer but only give you a feedback on my own experience. On my type of datasets, kmeans is working well when classes are balanced (none of the classes has a number of elements less than 1/100 of the average number of elements). The main point is that kmeans requires the number of clusters to be set before running. With Leiden, there are parameters, but we don't have to set the number of clusters. Alternatively, there are others approaches such as SOM (GigaSOM.jl), but I believe you should ask your question on StackExchange, as it turns to deviate from an issue of Leiden Algorithm. Best.

Thank you very much for your reply, I intend to systematically try three approaches on a large sample. 1) Normalised Vectors to Graphs to Leiden Community Detection 2) Normalised Vectors (cosine similarity) to K-means (Like dimensionality reduction like PCA) 3) Normalised Vectors(cosine similarity) to SOM (I'm not familiar with Julia so I used the python package "neupy" to switch to cosine distance in Kohonen SOM)

Here is the feedback:

In our attempts on small samples, Leiden's method got more classifications compared to K-means and SOM which calculate Euclidean distances (this is one of the differences with Leiden and Louvain, which I initially wanted to avoid), but I looked at the classification results and found that indeed Leiden algorithm achieved better results for clustering than the other two. I believe that in the text vector clustering work, Leiden algorithm is well worth trying! Thanks for the suggestion and the excellent Leiden package! I will continue to expand the sample size and continue to try.

vtraag commented 1 year ago

Great, good to hear it gives some good results for your use case!

One suggestion: you might be interested in exploring CPM instead of modularity (which I assume you are using now). Especially when using something like similarity matrices, I believe CPM has a much more straightforward interpretation than modularity, and might give more reasonable results.