neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
596 stars 157 forks source link

Network graph clustering #229

Closed samasalam closed 1 year ago

samasalam commented 1 year ago

I have a large-scale network consisting of 62,578 nodes. It represents the network topology of an SDN network (Software Defined Networking).

Suggestions Please. Any idea may be of great help to me. Many thanks.

tomasonjo commented 1 year ago

Crossposted from Stackoverflow:

If you want to define the number of clusters, you can try the following.

First create node embeddings using any of the available models like the FastRP or node2vec. Next, use the k-means algorithm to cluster nodes based on the embeddings.

samasalam commented 1 year ago

Crossposted from Stackoverflow:

If you want to define the number of clusters, you can try the following.

First create node embeddings using any of the available models like the FastRP or node2vec. Next, use the k-means algorithm to cluster nodes based on the embeddings.

Thanks for your reply. I run the FastRP and got the embedding (4-dimensional vector since I set the embedding Dimension to 4) for each node. How can I use it in K-means? Please can you clarify more?

IoannisPanagiotas commented 1 year ago

Hi @samasalam,

To run the k-means algorithm you only need to set it up with a nodeProperty.

For example, let's say you have mutated your graph with the fastRP result saved as "embedding".

You can now run K-means as follows:

CALL gds.alpha.kmeans.mutate('cities', {
  nodeProperty: 'embedding',
  k: 10,
  mutateProperty: 'cluster'
})
YIELD communityDistribution

Now this code will run K-means, partition the nodes into ten communities, add a new property 'cluster' to each node, and also return some statistics about these communities.

For more information, please have a look at https://neo4j.com/docs/graph-data-science/current/algorithms/alpha/kmeans/ or ask us here.

Best regards, Ioannis.

samasalam commented 1 year ago

I applied FastRP to create node embeddings for a graph consisting of 6,301 nodes. These embeddings are stored as node properties to be used in clustering using the K-means algorithm. I noticed that although the nodes are nearby to each other, they are assigned to different clusters. Notes: For FastRP, I set the embedding dimensions to 256. For K-means, I set K to 2. For the sub-graph below, the following are the results of clustering:

image

image

samasalam commented 1 year ago

Many thanks.

On Wed, 23 Nov 2022 at 11:32 AM Ioannis Pan @.***> wrote:

Reopened #229 https://github.com/neo4j/graph-data-science/issues/229.

— Reply to this email directly, view it on GitHub https://github.com/neo4j/graph-data-science/issues/229#event-7874769678, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARKRJKM6X5DOPFOOZEMUY6TWJXJB7ANCNFSM6AAAAAARSHAXTA . You are receiving this because you were mentioned.Message ID: @.***>

IoannisPanagiotas commented 1 year ago

Hi @samasalam ,

Thank you for bringing this to our attention. We will have test the "fastRP+kmeans" combination on this small graph, and see whether such behaviour is due to a bug or not. Did you specify any random seed for the fastRP algorithm or K-means?

I can only offer a small hunch at the moment, that perhaps 256 dimensions is a little too much for K-means. Perhaps you can have better results with a smaller embedding.

At any rate, we will let you know as soon as possible!

samasalam commented 1 year ago

Hi @samasalam ,

Thank you for bringing this to our attention. We will have test the "fastRP+kmeans" combination on this small graph, and see whether such behaviour is due to a bug or not. Did you specify any random seed for the fastRP algorithm?

I can only offer a small hunch at the moment, that perhaps 256 dimensions is a little too much for K-means. Perhaps you can have better results with a smaller embedding.

At any rate, we will let you know as soon as possible!

Hello @IoannisPanagiotas

IoannisPanagiotas commented 1 year ago

Hi @samasalam again,

I have tried to reproduce the small subgraph you have shared here, since with a small sized graph it's easier for us to comprehend what's going on internally. I've tried it several times( even with different parameters), but my output always puts "20" in the same community as "11, 13, 17, 21".

Would you mind sharing with me the exact scripts you have used to get the output on that particular subgraph? I have had a look at the CSV file, and it doesn't appear to be from that one (additionally, the small subgraphs has weights which are missing in the CSV file).

I could also recommend you two possibilities for K-Means to use in case they leads to better results.

So you can modify the example I've sent above as follows:

CALL gds.alpha.kmeans.mutate('cities', {
  nodeProperty: 'embedding',
  k: 10,
  mutateProperty: 'cluster',
  numberOfRestarts: 5,
  initialSampler: 'kmeans++'
})
YIELD communityDistribution

On another note, randomSeed is an optimal configuration parameter available for gds algorithms that are based on random decisions. In our particular case, fastRP generates random vectors at start, and K-means can choose different initial clusters.

For example, by setting randomSeed: 99 you make such algorithms take different decisions that they would with randomSeed: 100. This might lead to different output.

By default, randomSeed is set automatically, so the user does not really need to worry, but the option to set it up manually is there.

samasalam commented 1 year ago

Hello @IoannisPanagiotas Thanks for your help. I regenerate the graph without weights. The same results occur. For the small sub-graph (25 nodes out of 6,301), all the nodes should logically fall within the same cluster. However, some nodes are assigned to cluster 0, and the remaining nodes to cluster 1. Note: In my case, the graph is an abstract for network topology. I want to partition the graph into a number of clusters, each cluster should be controlled by an SDN controller. So, nearby nodes (nodes that share the same neighbors) should be within the same cluster.

Here are the scripts: //1. Graph Project

CALL gds.graph.project
(
    'myGraph_1',
    'node',
    'to'
)

//FastRP Write

CALL gds.fastRP.write
(
  'myGraph_1',
  {
    embeddingDimension: 8,
    randomSeed:42,
    writeProperty: 'fastrp-embedding' 
  }
)
YIELD nodePropertiesWritten

//FastRP Graph Project

CALL gds.graph.project
(
    'myGraph',
    {
      node: 
      {
        properties: 'fastrp-embedding'
      }
    },
    '*'
)

//K-means Write

CALL gds.alpha.kmeans.write('myGraph', 
{
  nodeProperty: 'fastrp-embedding',
  k: 2,
  writeProperty: 'kmeans',
  numberOfRestarts: 5,
  initialSampler: 'kmeans++'
})
YIELD communityDistribution
Mats-SX commented 1 year ago

As a side note, you needn't use the write mode and re-projecting a graph for the fastRP step. A much faster way is to use mutate mode and then using the same graph for kmeans. That's especially useful if you don't have any other need for the fastRP embeddings in your database.

Concretely, what I mean is that you can do this:

//1. Graph Project

CALL gds.graph.project
(
    'myGraph_1',
    'node',
    'to'
)

//FastRP Mutate

CALL gds.fastRP.mutate // mutate mode
(
  'myGraph_1',
  {
    embeddingDimension: 8,
    randomSeed:42,
    mutateProperty: 'fastrp-embedding' 
  }
)
YIELD nodePropertiesWritten

//K-means Write

CALL gds.alpha.kmeans.write('myGraph_1', // same graph name
{
  nodeProperty: 'fastrp-embedding',
  k: 2,
  writeProperty: 'kmeans',
  numberOfRestarts: 5,
  initialSampler: 'kmeans++'
})
YIELD communityDistribution

Indeed, the same applies for kmeans; you can use mutate mode also there if all you need is to inspect the community compositions later. For example:

//K-means Mutate

CALL gds.alpha.kmeans.mutate('myGraph_1',
{
  nodeProperty: 'fastrp-embedding',
  k: 2,
  mutateProperty: 'kmeans',
  numberOfRestarts: 5,
  initialSampler: 'kmeans++'
})
YIELD communityDistribution

// Inspecting communities

CALL gds.graph.nodeProperty.stream('myGraph_1', 'kmeans')
YIELD nodeId, propertyValue AS kmeans
WITH gds.util.asNode(nodeId) AS node, kmeans
RETURN node, kmeans AS communityId
IoannisPanagiotas commented 1 year ago

Hi again @samasalam ,

I have had a look in your graph, which seems to be composed of two weakly connected components. Probably the optimal clustering should put the two disjoint communities in distinct clusters.

I will explain why perhaps fastrp+kmeans did not behave as well as you might have wanted.

FastRP initially assigns random vectors to each node, and the values get updated based on their neighbours in order to get a final result. If some nodes in different components, get similar enough results, there might be the case that their are clusters together like you've noticed.

As an example, running FastRP in your graph provides some nodes in different component an embedding of [0,...0]. This in turn confuses the K-means algorithm as it will always cluster such nodes together.

We will try to consider alternatives some more robust alternatives for fixed size partitioning based on graph topology, and hopefully we can provide something better-suited in the future.

For the moment, I can provide some possible alternatives:

The above might require some fine-tuning on your part to see the ideal configuration.

There is also some alternative you could use to potentially get better embeddings:

For the last option, you would have to do the following:

CALL gds.wcc.mutate('myGraph_1', { mutateProperty: 'componentId' }) 
YIELD componentCount

and then

CALL gds.fastRP.mutate 
(
  'myGraph_1',
  {
    embeddingDimension: 8,
    randomSeed:42,
    mutateProperty: 'fastrp-embedding' 
    featureProperties: ['componentId'],
    propertyRatio: 0.5 // you may need to tune this 
  }
)
YIELD nodePropertiesWritten

Hope you find one of the above options useful :) Ioannis.

samasalam commented 1 year ago

@IoannisPanagiotas Hi loannis Many thanks for your detailed replay. About this point: 1. You could consider an algorithm Label Propagation. This again does not allow for a fixed number of communities, but generally speaking the lower the maxIterations the likely more communities the final output could be.

I try to set the maxIterations to bigger values. The same results obtained. For a graph consisting of 62,586 nodes, The number of communities 210. I want about 6 communities for this graph. How can I control the number of communities other that the maxIterations parameter? I attached the file of 62,586 node.

As a next step, can I use the k means after Label Propagation? I stored the community as a node property. However, an error occurred: Failed to invoke procedure gds.alpha.kmeans.write: Caused by: java.lang.NullPointerException

I think that the community should be converted to float and stored as an array. How can I store it as an array?
Update: I use this script: //Label_Propagation

CALL gds.labelPropagation.write
('myGraph_1',
 {   maxIterations:15,
     writeProperty: 'community' 
 }
)
YIELD communityCount, ranIterations, didConverge

//Community Set to array of float MATCH (n:node) SET n.community = [Tofloat(n.community)]

//K-means Write

CALL gds.alpha.kmeans.write('myGraph_1', {
  nodeProperty: 'community',
  k: 2,
  writeProperty: 'kmeans'
})
YIELD communityDistribution

Now the community property is stored as a list of floats (1 value). However, when running the k-means, the same error occurs. 62,586.csv

IoannisPanagiotas commented 1 year ago

Hi @samasalam ,

I think the reason you got the Failed to invoke procedure gds.alpha.kmeans.write: Caused by: java.lang.NullPointerException

is because you use ".write" instead of '.mutate" for labelPropagatation so kmeans does not know which label to find. If that is the case, we will try to make the algorithm fail more pretty (i.e., with a more comprehensive error) in such cases in the future.

For your other question, in general using kmeans with labelPropagation is probably not a good idea, as it will just arbitrarily link communities together only because their ids are close (for example let's say for five communities and k=2 {1,2} {3,4,5}) and not because of any reasons related to the graph structure.

Since, Label Propagation alone doesn't seem to work for you, I do suggest you give a try to the alternatives I've suggested above for making fastRP a little more accurate. Perhaps adding the community you obtained from labelPropagation as a featureProperties might make sense too.

Best,

IoannisPanagiotas commented 1 year ago

Hi @samasalam , I think will close this issue for now due to inactivity (hope you have figured a way for doing your task after all :) ) But if you have any other questions about this or another topic, do feel free to contact us again either via this or another issue :)

Best, Ioannis.

samasalam commented 1 year ago

Hi @samasalam , I think will close this issue for now due to inactivity (hope you have figured a way for doing your task after all :) ) But if you have any other questions about this or another topic, do feel free to contact us again either via this or another issue :)

Best, Ioannis.

Well, until now I didn't find a solution for clustering graphs. Hope that if anyone has an idea, it will be so much appreciated.