neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
597 stars 157 forks source link

Allow to add weights on relationships in the Betweenness Centrality algorithm in the Neo4j Graph Data Science library. #171

Closed Toolicat closed 1 year ago

Toolicat commented 2 years ago

I try to make parallel relationships between nodes counted as weights in betweenness centrality algorithm.

It would be great if it was possible to add weights both in the cypher projection and in the native projections. It is necessary that the weight can be set by the number of parallel relations betwen two conected nodes or by degre of node or any other integer or float

I considered gds influence maximization algorithms as alternative in solving my task but still I assume that betweenness centrality with weights will increase performance.

Mats-SX commented 2 years ago

Hello @Toolicat and thanks for reaching out to us!

The betweenness centrality metric for a node n is defined as the number of shortest paths between any two nodes that passes through n. Our implementation does not currently support any weight configuration. How would you like such a weighted algorithm to be defined?

It doesn't seem straight forward to me to just configure a relationship weight. If we do that, then the accumulated costs for a shortest path between a and b could be x. Let's say that this path passes through nodes n and m; how much would we increase the betweenness centrality (BC) of n and m? In the unweighted case, we would increase the BC by 1 for both (and any other node on the path). Would we now instead increase it by x? Or by x / length(path)? I'm not sure if that results in a good definition of the metric.

I am curious as to how you imagine defining a weighted BC, or if you have read papers or other academic definitions that you could share?

Regards Mats

Toolicat commented 2 years ago

Thanks for the quick response! I'm not sure either, but the x / length(path) option looks better.

This work is closest to a universal solution that will be useful not only to me https://arxiv.org/pdf/2006.08668.pdf

Mats-SX commented 2 years ago

By the way I now react a bit to your statement

I assume that betweenness centrality with weights will increase performance.

I should note that Betweenness Centrality is a particularly expensive algorithm, running in O(n * m), meaning it does not scale well with increasing graph sizes. The GDS implementation offers a sampled variant to deal with this, but using sampling will not give exact results.

I'm not sure if you mean performance as in runtime, or as in quality of results. But I think that BC risks compromising both aspects, because it will likely either be slow or inaccurate, unless your graph is relatively small.

Toolicat commented 2 years ago

I know about aproximation I mean perfomance in quality of results

In my case i try to define community as node with connections through nodes that represent people's and then replace multiple connections between communities through peoples to 1 conection with weight. In output I must get most influential communities.

I deliberately skipped this detail in the explanation of the problem, since I think the weighted BC will be useful for many developers and there is a general weighted BC algorithm.

AliciaFrame commented 2 years ago

Hi @Toolicat - we've added your request to our backlog. We'll reach out if we need more information about your requirements, or to alert you when we've added support for weights in Betweenness Centrality ๐Ÿ™‚

Toolicat commented 2 years ago

Thanks Alicia and Mats I'm really appreciate that! I'm big fan of Neo4j, you making a great product ๐Ÿฅ‡

Toolicat commented 2 years ago

Mats, hi again!

In that comment https://github.com/neo4j/graph-data-science/issues/171#issuecomment-1057784002 you say "I'm not sure if that results in a good definition of the metric." Did you mean that If node โ€œnโ€ has weight โ€œ30โ€ and node โ€œmโ€ has weight 20 and they lie on a path with length โ€œ2โ€ then node โ€œnโ€ gives โ€œ15โ€ score to โ€œmโ€ and a โ€œmโ€ gives only โ€œ10โ€ score to โ€œnโ€ ?.

It can ruin the results when local bridges have a lot of direct relations and accumulate weight bigger than global bridge.

Maybe a good decision will be to calculate BC as unweighted and after that multiply the score of unweighted BC on weight counted by the following expression "x / length(path)"

What's in practice ?

We have node โ€œnโ€ with weight 30 and 10 shortest paths goin through โ€œnโ€ and node โ€œmโ€ with weight 15 and 20 shortest paths going through โ€œmโ€ and total amount of 100 shortest paths over that network. In results we get weight equal to โ€œ3โ€ for both nodes but for different reasons. In case if weight represents frequency than node โ€œnโ€ will be influential because of high frequency of signal transmission and node โ€œmโ€ will be influential because of better position in transmission of signal

If I miss something in logic or have lack knowledge about low level computation of algorithm then I be grateful for suggestions

Regards Jack

Mats-SX commented 2 years ago

Hi @Toolicat

I was thinking that the weights were on the relationships. For example in this graph:

(a)-[{w: 1}]->(b),
(a)-[{w: 3}]->(c),
(b)-[{w: 1}]->(c),
(b)-[{w: 2}]->(d),
(c)-[{w: 2}]->(e),
(d)-[{w: 1}]->(e)

Then the unweighted shortest paths would be

a->b->d
a->c->e
b->d->e / b->c->e

resulting in BC scores of

a: 0
b: 1
c: 1.5
d: 0.5
e: 0

but if we use the relationship weights, we would get weighted shortest paths of

a->b->d (cost 3)
a->b->d->e (cost 4)
b->d->e / b->c->e (both cost 3)

and one interpretation would be to just count these for the BC score as for the unweighted case, resulting in

a: 0
b: 2
c: 0.5
d: 1.5
e: 0

And I guess from this point of view the metric remains defined exactly the same, so definitely well defined.

I didn't think this thoroughly about the problem when I wrote my last reply. But what I was thinking was that the accumulated cost of the shortest path would be taken into account for the BC score. This comes out of the observation that unweighted shortest paths contribute with a mass of (length - 1) / p where p is the number of shortest paths between the same endpoints. That is, the three unweighted paths I listed above all contribute with a mass of 1 to the BC computation, and indeed the sum of BCs is three (1 + 1.5 + 0.5).

As it turns out, that fact stays the same for the weighted case as well. I thought that perhaps the actual cost of the path, which is now generally not equal to the length, would need to be taken into account. If that is not the idea, then I think my comment about poor definition is not so helpful.


Anyway, as I've learned from my coworkers, the problem for the weighted BC is not so much the definition but the fact that computing weighted shortest paths between all pairs of nodes is very computationally expensive. So the resulting algorithm will not scale very well. But it still is in the cards for us to implement in the future, so stay tuned!

Toolicat commented 2 years ago

Still not sure that I understand the algorithm correctly. Need time to digest that info. Anyway thanks for the explanation, You and Alicia are my personal Rock Stars :-)

I will try to keep in touch, but I am from Ukraine and may disappear from communication

pcofoche commented 2 years ago

Hi all,

I am new to Neo4j and would like to join Jack [Toolicat] in thanking Alicia, Mats, and the rest of the Neo4j team for the great product they are creating.

Of course, I found this thread because I had the same request in mind for a different use case. So, thanks to Jack for expressing the question in broad terms.

Please permit me to chime in. I am still a rookie, so please correct me if I misstate any fact. Concerning Mats last comment and I quote โ€œโ€ฆ the problem for the weighted BC is not so much the definition but the fact that computing weighted shortest paths between all pairs of nodes is very computationally expensive. So the resulting algorithm will not scale very wellโ€œ.

Perhaps the algorithm already exists and is just a few steps from implementing it as a Betweenness Centrality with weighted relationships. All that needs to be done (in my understanding) is to execute the Dijkstra Single-Source Shortest Path algorithm, which already exists for weighted relationships. However, to achieve the Betweenness Centrality on these weighted paths, we count the number of times each node appears on the single-source shortest paths for all the other nodes. These counts are essentially a representation of the Betweenness Centrality on the weighted path.

The Betweenness Centrality could even be imagined as a histogram of the number of times of occurrence of each node on the Dijkstra Single-Source Shortest Path algorithm of all the other nodes.

In other words, what distinguishes the Betweenness Centrality for the weighted case from the unweighted case is that the single-source shortest path computation is based on counting the least number of hops between each pair of nodes is counted for the unweighted case;

whereas for the weighted case, the shortest path is determined by summing the weights on the path between each pair of nodes.

I hope I was clear enough. Please correct me if I am wrong in my understanding.

Thank you, Paul

Ahmadshoh commented 1 year ago

Hi, everyone I am also interested in this feature, when will it be available? Regards, Ahmadshoh.

pcofoche commented 1 year ago

The feature exists in other computing languages. I was able to execute it in R. So, I guess it should be achievable in Neo4j as well. Many thanks, Paul

Toolicat commented 1 year ago

Hi, Mats

You said

Anyway, as I've learned from my coworkers, the problem for the weighted BC is not so much the definition but the fact that computing weighted shortest paths between all pairs of nodes is very computationally expensive. So the resulting algorithm will not scale very well. But it still is in the cards for us to implement in the future, so stay tuned!

But If I collapse paths and count parallel relationships between a pair of nodes as weight before run BC that decrease computational intensity or that make no sense and cypher projection can do the same as collapse paths ?

jjaderberg commented 1 year ago

Hi all :wave: We have added support for relationship weights in Betwenness Centrality. You can specify a relationshipWeightProperty, which is expected to exist already in your projected graph. I.e., the algorithm will not collapse paths or aggregate parallel relationships for you. For that, specify an aggregation when projecting the graph, or use Collapse Path (which also received an upgrade with support for more complex "path templates")

The weighted BC implementation uses Dijkstra to compute the shortest paths, which means it is defined for positive relationship weights.

Weighted BC will be in the upcoming 2.2 release. You can have a look at the preview documentation and try it out. Thanks for your input, let us know what you think!

Toolicat commented 1 year ago

Hi, jjaderberg!

Thank you so much

As for this

The weighted BC implementation uses Dijkstra to compute the shortest paths, which means it is defined for positive relationship weights.

Does it mean that higher the value in weight property, the faster path? Or it means that path with higher value more expensive?

jjaderberg commented 1 year ago

Does it mean that higher the value in weight property, the faster path? Or it means that path with higher value more expensive?

Higher weight means path is more expensive (longer).

p1=(a1)-[{weight: 1.0}]->(b)
p2=(a2)-[{weight: 0.5}]->(b)

Traversing the relationship with a1->b, weight 1.0, is more expensive than a2->b, with weight 0.5. Path p2 is shorter than p1 because the weight on the path is lower.

Toolicat commented 1 year ago

Thank you, the algorithm gives accurate results, so far I have only tested it on a small graph and the speed was good.

I think it should be added to the documentation

Higher weight means path is more expensive (longer).