neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
621 stars 160 forks source link

consecutiveIds improvement #170

Closed andyhegedus closed 2 years ago

andyhegedus commented 2 years ago

My use case is to exploration of different clustering patterns and am combining GDS with Bloom as visualization tool.

I would like the consecutiveIds option to use ids that start at zero for the largest, non-null community and decrement with member size. This would allow me to have a consistent color pattern with rule based coloring within Bloom. Additionally further analytics would not require a query to the database to get the largest communityID, it would be 0(zero).

Mats-SX commented 2 years ago

Hello @andyhegedus and thank you for reaching out to us!

Your request is straight-forward, but we are not sure that it generalises very well as a use case. We think it could be possible to design something along the lines of Scale Properties, which remaps properties according to some group metric, but it would replace doing the same in a Cypher query that already solves the problem. From that perspective we think the operation would not add a lot of user value for the effort required to implement it.

If you have appetite, we would be happy to review and evaluate a contribution, of course.

For reference, it is possible to do it using this Cypher query:

// imagine a graph with :X nodes and a community id written to the c property

// map ids to sizes
MATCH (n:X)
  WHERE n.c IS NOT NULL
WITH n.c AS id, count(*) AS size
WITH collect({id: id, size: size}) AS idSizes
// calculate maximum size
CALL {
  WITH idSizes
  UNWIND idSizes AS idSize
  RETURN max(idSize.size) AS maxSize
}
// re-map relative to maximum size
UNWIND idSizes AS idSize
MATCH (n:X) 
  WHERE n.c = idSize.id
SET n.newC = maxSize - idSize.size

It may be better to split this into two queries: one for computing maximum size and one for the re-mapping.

A general design principle we try to maintain is to not reimplement functionality that is expressible and feasibly performant with Cypher, and I would like to argue that this falls into that category. Please reach back in case you disagree and we can have another look.

All the best!