Closed tomasonjo closed 6 years ago
I added a unit test using the HugeGraph and its deduplication logic. These are the results:
n.id | centrality |
---|---|
Alice | 0,556 |
Michael | 1,000 |
Karin | 0,625 |
Chris | 0,714 |
Will | 0,714 |
Mark | 0,625 |
@tomasonjo it's even not displaying this behaviour if 'Alice's is in a different component:
CREATE (alice:Person{id:"Alice"}),
(michael:Person{id:"Michael"}),
(karin:Person{id:"Karin"}),
(chris:Person{id:"Chris"}),
(will:Person{id:"Will"}),
(mark:Person{id:"Mark"})
CREATE (michael)<-[:KNOWS]-(karin),
(michael)-[:KNOWS]->(chris),
(will)-[:KNOWS]->(michael),
(mark)<-[:KNOWS]-(michael),
(mark)-[:KNOWS]->(will),
(will)-[:KNOWS]->(chris),
(chris)-[:KNOWS]->(karin);
CALL algo.closeness.stream('Person', 'KNOWS') YIELD nodeId, centrality
match (n:Person) WHERE id(n) = nodeId
RETURN n.id,centrality
order by centrality desc limit 20;
╒═════════╤══════════════════╕
│"n.id" │"centrality" │
╞═════════╪══════════════════╡
│"Michael"│0.8333333333333334│
├─────────┼──────────────────┤
│"Chris" │0.8333333333333334│
├─────────┼──────────────────┤
│"Karin" │0.625 │
├─────────┼──────────────────┤
│"Mark" │0.625 │
├─────────┼──────────────────┤
│"Will" │0.5 │
├─────────┼──────────────────┤
│"Alice" │0 │
└─────────┴──────────────────┘
Just for fun I tried out this same graph using the Python networkx library. This considers all graphs to be undirected so to compare results we need to have each relationship going both ways.
If I do it where Alice knows Michael it gives these results:
G = nx.Graph()
relationships = [
("Karin", "Michael"),
("Michael", "Chris"),
("Will", "Michael"),
("Michael", "Mark"),
("Mark", "Will"),
("Will", "Chris"),
("Chris", "Karin"),
("Alice", "Michael")
]
G.add_edges_from(relationships)
nx.closeness_centrality(G)
{'Karin': 0.625, 'Michael': 1.0, 'Chris': 0.7142857142857143, 'Will': 0.7142857142857143, 'Mark': 0.625, 'Alice': 0.5555555555555556}
And our corresponding graph:
CREATE (alice:Person{id:"Alice"}),
(michael:Person{id:"Michael"}),
(karin:Person{id:"Karin"}),
(chris:Person{id:"Chris"}),
(will:Person{id:"Will"}),
(mark:Person{id:"Mark"})
CREATE
(michael)<-[:KNOWS]-(karin),
(michael)-[:KNOWS]->(chris),
(will)-[:KNOWS]->(michael),
(mark)<-[:KNOWS]-(michael),
(mark)-[:KNOWS]->(will),
(alice)-[:KNOWS]->(michael),
(will)-[:KNOWS]->(chris),
(chris)-[:KNOWS]->(karin),
(michael)-[:KNOWS]->(karin),
(michael)<-[:KNOWS]-(chris),
(will)<-[:KNOWS]-(michael),
(mark)-[:KNOWS]->(michael),
(mark)<-[:KNOWS]-(will),
(alice)<-[:KNOWS]-(michael),
(will)<-[:KNOWS]-(chris),
(chris)<-[:KNOWS]-(karin);
CALL algo.closeness.stream('Person', 'KNOWS') YIELD nodeId, centrality
match (n:Person) WHERE id(n) = nodeId
RETURN n.id,centrality
order by centrality desc limit 20;
╒═════════╤══════════════════╕
│"n.id" │"centrality" │
╞═════════╪══════════════════╡
│"Michael"│1 │
├─────────┼──────────────────┤
│"Chris" │0.7142857142857143│
├─────────┼──────────────────┤
│"Will" │0.7142857142857143│
├─────────┼──────────────────┤
│"Karin" │0.625 │
├─────────┼──────────────────┤
│"Mark" │0.625 │
├─────────┼──────────────────┤
│"Alice" │0.5555555555555556│
└─────────┴──────────────────┘
after some search i found this article talking about centrality in directed networks or how they call it prestige:
http://mrvar.fdv.uni-lj.si/sola/info4/uvod/part4.pdf -> page 6
talking about closeness centrality:
If network is not strongly connected, we take only reachable vertices into account, but we weight the result with number of reachable vertices.
@tomasonjo I put another node into the smaller connected component and now I'm really confused by the output!
CREATE (alice:Person{id:"Alice"}),
(michael:Person{id:"Michael"}),
(karin:Person{id:"Karin"}),
(chris:Person{id:"Chris"}),
(will:Person{id:"Will"}),
(mark:Person{id:"Mark"}),
(dave:Person{id:"Dave"})
CREATE
(michael)<-[:KNOWS]-(karin),
(michael)-[:KNOWS]->(chris),
(will)-[:KNOWS]->(michael),
(mark)<-[:KNOWS]-(michael),
(mark)-[:KNOWS]->(will),
(alice)-[:KNOWS]->(dave),
(will)-[:KNOWS]->(chris),
(chris)-[:KNOWS]->(karin),
(michael)-[:KNOWS]->(karin),
(michael)<-[:KNOWS]-(chris),
(will)<-[:KNOWS]-(michael),
(mark)-[:KNOWS]->(michael),
(mark)<-[:KNOWS]-(will),
(alice)<-[:KNOWS]-(dave),
(will)<-[:KNOWS]-(chris),
(chris)<-[:KNOWS]-(karin);
CALL algo.closeness.stream('Person', 'KNOWS') YIELD nodeId, centrality
match (n:Person) WHERE id(n) = nodeId
RETURN n.id,centrality
order by centrality desc limit 20;
╒═════════╤════════════╕
│"n.id" │"centrality"│
╞═════════╪════════════╡
│"Alice" │6 │
├─────────┼────────────┤
│"Dave" │6 │
├─────────┼────────────┤
│"Michael"│1.5 │
├─────────┼────────────┤
│"Chris" │1.2 │
├─────────┼────────────┤
│"Will" │1.2 │
├─────────┼────────────┤
│"Karin" │1 │
├─────────┼────────────┤
│"Mark" │1 │
└─────────┴────────────┘
How do Alice and Dave get scores of 6?!
I have noticed this behaviour in https://github.com/neo4j-contrib/neo4j-graph-algorithms/issues/546, I guess we have again two options how to solve this.
First one, could be frustrating for the user:
The Closeness Centrality algorithm is very sensitive to a single large distance or missing relationship - the closeness centrality for a node is 0 if there is just one other node it can't reach.
If there's one node that no other nodes can reach then every node will have a closeness centrality of 0.
Second option is to run algorithm treating each connected component as a separate graph, so Alice and David would have closeness centrality 1. This would mean following the logic from http://mrvar.fdv.uni-lj.si/sola/info4/uvod/part4.pdf
Above can solve undirected closeness. In a directed network we will almost always deal with unreachable nodes, so first option is not really an option, but the second option might also have its flaws in a less connected network.
This is what networkx returns for the undirected 2 connected component graph from above:
# 2 connected components
G = nx.Graph()
relationships = [
("Karin", "Michael"),
("Michael", "Chris"),
("Will", "Michael"),
("Michael", "Mark"),
("Mark", "Will"),
("Alice", "Dave"),
("Will", "Chris"),
("Chris", "Karin")
]
G.add_edges_from(relationships)
sorted(nx.closeness_centrality(G).items(), key=lambda x: x[1] *-1)
[('Michael', 0.6666666666666666), ('Chris', 0.5333333333333333), ('Will', 0.5333333333333333), ('Karin', 0.4444444444444444), ('Mark', 0.4444444444444444), ('Alice', 0.16666666666666666), ('Dave', 0.16666666666666666)]
Interestingly the score it gives to Dave and Alice is 1/6, which is score / (#nodes - 1). I'm gonna debug our code a bit to see how it came up with the score of 6.
Ok I see how we get the 6. The calculation is:
nodes -1 / farness
and it's considering the farness of 'Dave' and 'Alice' to be a total of 1 when you could argue that it's actually infinity since they can't reach 'Michael', 'Mark', 'Chris', 'Karin', or 'Will'
I wanted to try out @mneedham statement.
Create graph:
All nodes are reachable between each other except that other nodes cannot reach Alice
run closeness:
Results:
Our algorithm actually return 0 for a node that can reach all others but cannot be reached by others.