Closed samasalam closed 1 year ago
Crossposted from Stackoverflow:
If you want to define the number of clusters, you can try the following.
First create node embeddings using any of the available models like the FastRP or node2vec. Next, use the k-means algorithm to cluster nodes based on the embeddings.
Crossposted from Stackoverflow:
If you want to define the number of clusters, you can try the following.
First create node embeddings using any of the available models like the FastRP or node2vec. Next, use the k-means algorithm to cluster nodes based on the embeddings.
Thanks for your reply. I run the FastRP and got the embedding (4-dimensional vector since I set the embedding Dimension to 4) for each node. How can I use it in K-means? Please can you clarify more?
Hi @samasalam,
To run the k-means algorithm you only need to set it up with a nodeProperty.
For example, let's say you have mutated your graph with the fastRP result saved as "embedding".
You can now run K-means as follows:
CALL gds.alpha.kmeans.mutate('cities', {
nodeProperty: 'embedding',
k: 10,
mutateProperty: 'cluster'
})
YIELD communityDistribution
Now this code will run K-means, partition the nodes into ten communities, add a new property 'cluster' to each node, and also return some statistics about these communities.
For more information, please have a look at https://neo4j.com/docs/graph-data-science/current/algorithms/alpha/kmeans/ or ask us here.
Best regards, Ioannis.
I applied FastRP to create node embeddings for a graph consisting of 6,301 nodes. These embeddings are stored as node properties to be used in clustering using the K-means algorithm. I noticed that although the nodes are nearby to each other, they are assigned to different clusters. Notes: For FastRP, I set the embedding dimensions to 256. For K-means, I set K to 2. For the sub-graph below, the following are the results of clustering:
Many thanks.
On Wed, 23 Nov 2022 at 11:32 AM Ioannis Pan @.***> wrote:
Reopened #229 https://github.com/neo4j/graph-data-science/issues/229.
— Reply to this email directly, view it on GitHub https://github.com/neo4j/graph-data-science/issues/229#event-7874769678, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARKRJKM6X5DOPFOOZEMUY6TWJXJB7ANCNFSM6AAAAAARSHAXTA . You are receiving this because you were mentioned.Message ID: @.***>
Hi @samasalam ,
Thank you for bringing this to our attention. We will have test the "fastRP+kmeans" combination on this small graph, and see whether such behaviour is due to a bug or not. Did you specify any random seed for the fastRP algorithm or K-means?
I can only offer a small hunch at the moment, that perhaps 256 dimensions is a little too much for K-means. Perhaps you can have better results with a smaller embedding.
At any rate, we will let you know as soon as possible!
Hi @samasalam ,
Thank you for bringing this to our attention. We will have test the "fastRP+kmeans" combination on this small graph, and see whether such behaviour is due to a bug or not. Did you specify any random seed for the fastRP algorithm?
I can only offer a small hunch at the moment, that perhaps 256 dimensions is a little too much for K-means. Perhaps you can have better results with a smaller embedding.
At any rate, we will let you know as soon as possible!
Hello @IoannisPanagiotas
Hi @samasalam again,
I have tried to reproduce the small subgraph you have shared here, since with a small sized graph it's easier for us to comprehend what's going on internally. I've tried it several times( even with different parameters), but my output always puts "20" in the same community as "11, 13, 17, 21".
Would you mind sharing with me the exact scripts you have used to get the output on that particular subgraph? I have had a look at the CSV file, and it doesn't appear to be from that one (additionally, the small subgraphs has weights which are missing in the CSV file).
I could also recommend you two possibilities for K-Means to use in case they leads to better results.
So you can modify the example I've sent above as follows:
CALL gds.alpha.kmeans.mutate('cities', {
nodeProperty: 'embedding',
k: 10,
mutateProperty: 'cluster',
numberOfRestarts: 5,
initialSampler: 'kmeans++'
})
YIELD communityDistribution
On another note, randomSeed is an optimal configuration parameter available for gds algorithms that are based on random decisions. In our particular case, fastRP generates random vectors at start, and K-means can choose different initial clusters.
For example, by setting randomSeed: 99 you make such algorithms take different decisions that they would with randomSeed: 100. This might lead to different output.
By default, randomSeed is set automatically, so the user does not really need to worry, but the option to set it up manually is there.
Hello @IoannisPanagiotas Thanks for your help. I regenerate the graph without weights. The same results occur. For the small sub-graph (25 nodes out of 6,301), all the nodes should logically fall within the same cluster. However, some nodes are assigned to cluster 0, and the remaining nodes to cluster 1. Note: In my case, the graph is an abstract for network topology. I want to partition the graph into a number of clusters, each cluster should be controlled by an SDN controller. So, nearby nodes (nodes that share the same neighbors) should be within the same cluster.
Here are the scripts: //1. Graph Project
CALL gds.graph.project
(
'myGraph_1',
'node',
'to'
)
//FastRP Write
CALL gds.fastRP.write
(
'myGraph_1',
{
embeddingDimension: 8,
randomSeed:42,
writeProperty: 'fastrp-embedding'
}
)
YIELD nodePropertiesWritten
//FastRP Graph Project
CALL gds.graph.project
(
'myGraph',
{
node:
{
properties: 'fastrp-embedding'
}
},
'*'
)
//K-means Write
CALL gds.alpha.kmeans.write('myGraph',
{
nodeProperty: 'fastrp-embedding',
k: 2,
writeProperty: 'kmeans',
numberOfRestarts: 5,
initialSampler: 'kmeans++'
})
YIELD communityDistribution
As a side note, you needn't use the write
mode and re-projecting a graph for the fastRP
step. A much faster way is to use mutate
mode and then using the same graph for kmeans
. That's especially useful if you don't have any other need for the fastRP
embeddings in your database.
Concretely, what I mean is that you can do this:
//1. Graph Project
CALL gds.graph.project
(
'myGraph_1',
'node',
'to'
)
//FastRP Mutate
CALL gds.fastRP.mutate // mutate mode
(
'myGraph_1',
{
embeddingDimension: 8,
randomSeed:42,
mutateProperty: 'fastrp-embedding'
}
)
YIELD nodePropertiesWritten
//K-means Write
CALL gds.alpha.kmeans.write('myGraph_1', // same graph name
{
nodeProperty: 'fastrp-embedding',
k: 2,
writeProperty: 'kmeans',
numberOfRestarts: 5,
initialSampler: 'kmeans++'
})
YIELD communityDistribution
Indeed, the same applies for kmeans
; you can use mutate
mode also there if all you need is to inspect the community compositions later. For example:
//K-means Mutate
CALL gds.alpha.kmeans.mutate('myGraph_1',
{
nodeProperty: 'fastrp-embedding',
k: 2,
mutateProperty: 'kmeans',
numberOfRestarts: 5,
initialSampler: 'kmeans++'
})
YIELD communityDistribution
// Inspecting communities
CALL gds.graph.nodeProperty.stream('myGraph_1', 'kmeans')
YIELD nodeId, propertyValue AS kmeans
WITH gds.util.asNode(nodeId) AS node, kmeans
RETURN node, kmeans AS communityId
Hi again @samasalam ,
I have had a look in your graph, which seems to be composed of two weakly connected components. Probably the optimal clustering should put the two disjoint communities in distinct clusters.
I will explain why perhaps fastrp+kmeans did not behave as well as you might have wanted.
FastRP initially assigns random vectors to each node, and the values get updated based on their neighbours in order to get a final result. If some nodes in different components, get similar enough results, there might be the case that their are clusters together like you've noticed.
As an example, running FastRP in your graph provides some nodes in different component an embedding of [0,...0]. This in turn confuses the K-means algorithm as it will always cluster such nodes together.
We will try to consider alternatives some more robust alternatives for fixed size partitioning based on graph topology, and hopefully we can provide something better-suited in the future.
For the moment, I can provide some possible alternatives:
maxLevels
means less communities. If you converge early, you might have to tweak the tolerance parameter too.maxIterations
the likely more communities the final output could be.The above might require some fine-tuning on your part to see the ideal configuration.
There is also some alternative you could use to potentially get better embeddings:
to
to UNDIRECTED
. featureProperties
option. I am thinking you can specify the componentID as this might end up giving nodes in different components more differing embeddings.
featureProperties
effective, you also need to set propertyRatio
to a value greater than 0.For the last option, you would have to do the following:
CALL gds.wcc.mutate('myGraph_1', { mutateProperty: 'componentId' })
YIELD componentCount
and then
CALL gds.fastRP.mutate
(
'myGraph_1',
{
embeddingDimension: 8,
randomSeed:42,
mutateProperty: 'fastrp-embedding'
featureProperties: ['componentId'],
propertyRatio: 0.5 // you may need to tune this
}
)
YIELD nodePropertiesWritten
Hope you find one of the above options useful :) Ioannis.
@IoannisPanagiotas Hi loannis Many thanks for your detailed replay. About this point: 1. You could consider an algorithm Label Propagation. This again does not allow for a fixed number of communities, but generally speaking the lower the maxIterations the likely more communities the final output could be.
I try to set the maxIterations to bigger values. The same results obtained. For a graph consisting of 62,586 nodes, The number of communities 210. I want about 6 communities for this graph. How can I control the number of communities other that the maxIterations parameter? I attached the file of 62,586 node.
As a next step, can I use the k means after Label Propagation? I stored the community as a node property. However, an error occurred:
Failed to invoke procedure gds.alpha.kmeans.write: Caused by: java.lang.NullPointerException
I think that the community should be converted to float and stored as an array. How can I store it as an array?
Update:
I use this script:
//Label_Propagation
CALL gds.labelPropagation.write
('myGraph_1',
{ maxIterations:15,
writeProperty: 'community'
}
)
YIELD communityCount, ranIterations, didConverge
//Community Set to array of float
MATCH (n:node) SET n.community = [Tofloat(n.community)]
//K-means Write
CALL gds.alpha.kmeans.write('myGraph_1', {
nodeProperty: 'community',
k: 2,
writeProperty: 'kmeans'
})
YIELD communityDistribution
Now the community property is stored as a list of floats (1 value). However, when running the k-means, the same error occurs. 62,586.csv
Hi @samasalam ,
I think the reason you got the
Failed to invoke procedure gds.alpha.kmeans.write: Caused by: java.lang.NullPointerException
is because you use ".write" instead of '.mutate" for labelPropagatation so kmeans does not know which label to find. If that is the case, we will try to make the algorithm fail more pretty (i.e., with a more comprehensive error) in such cases in the future.
For your other question, in general using kmeans with labelPropagation is probably not a good idea, as it will just arbitrarily link communities together only because their ids are close (for example let's say for five communities and k=2 {1,2} {3,4,5}) and not because of any reasons related to the graph structure.
Since, Label Propagation alone doesn't seem to work for you, I do suggest you give a try to the alternatives I've suggested above for making fastRP a little more accurate. Perhaps adding the community you obtained from labelPropagation as a featureProperties might make sense too.
Best,
Hi @samasalam , I think will close this issue for now due to inactivity (hope you have figured a way for doing your task after all :) ) But if you have any other questions about this or another topic, do feel free to contact us again either via this or another issue :)
Best, Ioannis.
Hi @samasalam , I think will close this issue for now due to inactivity (hope you have figured a way for doing your task after all :) ) But if you have any other questions about this or another topic, do feel free to contact us again either via this or another issue :)
Best, Ioannis.
Well, until now I didn't find a solution for clustering graphs. Hope that if anyone has an idea, it will be so much appreciated.
I have a large-scale network consisting of 62,578 nodes. It represents the network topology of an SDN network (Software Defined Networking).
I want to partition the graph into a number of clusters, each cluster should be controlled by an SDN controller.
I tried to use the k-means algorithm, but it doesn't take into account the relationships between the nodes. This algorithm relies on the nodes and their properties.
Then I tried the similarity algorithm, but it calculates the similarity score between 2 nodes and creates a new relationship that holds this value between these 2 nodes. As a result, I couldn't benefit from this way in the k-means algorithm.
Louvain and Leiden don't allow the number of clusters in advance. Is there any way to do that with these algorithms?
Suggestions Please. Any idea may be of great help to me. Many thanks.