neo4j-contrib / neo4j-graph-algorithms

Efficient Graph Algorithms for Neo4j
https://github.com/neo4j/graph-data-science/
GNU General Public License v3.0
772 stars 194 forks source link

Directed closeness centrality #547

Closed tomasonjo closed 6 years ago

tomasonjo commented 6 years ago

I wanted to try out @mneedham statement.

The Closeness Centrality algorithm is very sensitive to a single large distance or missing relationship - the closeness centrality for a node is 0 if there is just one other node it can't reach. If there's one node that no other nodes can reach then every node will have a closeness centrality of 0.

Create graph:

CREATE (alice:Person{id:"Alice"}),
       (michael:Person{id:"Michael"}),
       (karin:Person{id:"Karin"}),
       (chris:Person{id:"Chris"}),
       (will:Person{id:"Will"}),
       (mark:Person{id:"Mark"})
CREATE (michael)<-[:KNOWS]-(karin),
       (michael)-[:KNOWS]->(chris),
       (will)-[:KNOWS]->(michael),
       (mark)<-[:KNOWS]-(michael),
       (mark)-[:KNOWS]->(will),
       (alice)-[:KNOWS]->(michael),
       (will)-[:KNOWS]->(chris),
       (chris)-[:KNOWS]->(karin);

directed

All nodes are reachable between each other except that other nodes cannot reach Alice

run closeness:

CALL algo.closeness.stream('Person', 'KNOWS') YIELD nodeId, centrality
RETURN nodeId,centrality order by centrality desc limit 20;

Results:

n.id centrality
"Michael" 0.7142857142857143
"Chris" 0.625
"Mark" 0.5
"Karin" 0.45454545454545453
"Will" 0.38461538461538464
"Alice" 0

Our algorithm actually return 0 for a node that can reach all others but cannot be reached by others.

mknblch commented 6 years ago

I added a unit test using the HugeGraph and its deduplication logic. These are the results:

n.id centrality
Alice 0,556
Michael 1,000
Karin 0,625
Chris 0,714
Will 0,714
Mark 0,625
mneedham commented 6 years ago

@tomasonjo it's even not displaying this behaviour if 'Alice's is in a different component:

CREATE (alice:Person{id:"Alice"}),
       (michael:Person{id:"Michael"}),
       (karin:Person{id:"Karin"}),
       (chris:Person{id:"Chris"}),
       (will:Person{id:"Will"}),
       (mark:Person{id:"Mark"})
CREATE (michael)<-[:KNOWS]-(karin),
       (michael)-[:KNOWS]->(chris),
       (will)-[:KNOWS]->(michael),
       (mark)<-[:KNOWS]-(michael),
       (mark)-[:KNOWS]->(will),
       (will)-[:KNOWS]->(chris),
       (chris)-[:KNOWS]->(karin);
CALL algo.closeness.stream('Person', 'KNOWS') YIELD nodeId, centrality
match (n:Person) WHERE id(n) = nodeId
RETURN n.id,centrality 
order by centrality desc limit 20;

╒═════════╤══════════════════╕
│"n.id"   │"centrality"      │
╞═════════╪══════════════════╡
│"Michael"│0.8333333333333334│
├─────────┼──────────────────┤
│"Chris"  │0.8333333333333334│
├─────────┼──────────────────┤
│"Karin"  │0.625             │
├─────────┼──────────────────┤
│"Mark"   │0.625             │
├─────────┼──────────────────┤
│"Will"   │0.5               │
├─────────┼──────────────────┤
│"Alice"  │0                 │
└─────────┴──────────────────┘
mneedham commented 6 years ago

Just for fun I tried out this same graph using the Python networkx library. This considers all graphs to be undirected so to compare results we need to have each relationship going both ways.

If I do it where Alice knows Michael it gives these results:

G = nx.Graph()

relationships = [
    ("Karin", "Michael"),
    ("Michael", "Chris"),
    ("Will", "Michael"),
    ("Michael", "Mark"),
    ("Mark", "Will"),
    ("Will", "Chris"),
    ("Chris", "Karin"),
    ("Alice", "Michael")
]

G.add_edges_from(relationships)

nx.closeness_centrality(G)
{'Karin': 0.625, 'Michael': 1.0, 'Chris': 0.7142857142857143, 'Will': 0.7142857142857143, 'Mark': 0.625, 'Alice': 0.5555555555555556}

And our corresponding graph:

CREATE (alice:Person{id:"Alice"}),
       (michael:Person{id:"Michael"}),
       (karin:Person{id:"Karin"}),
       (chris:Person{id:"Chris"}),
       (will:Person{id:"Will"}),
       (mark:Person{id:"Mark"})
CREATE 
       (michael)<-[:KNOWS]-(karin),
       (michael)-[:KNOWS]->(chris),
       (will)-[:KNOWS]->(michael),
       (mark)<-[:KNOWS]-(michael),
       (mark)-[:KNOWS]->(will),
       (alice)-[:KNOWS]->(michael),
       (will)-[:KNOWS]->(chris),
       (chris)-[:KNOWS]->(karin),

       (michael)-[:KNOWS]->(karin),
       (michael)<-[:KNOWS]-(chris),
       (will)<-[:KNOWS]-(michael),
       (mark)-[:KNOWS]->(michael),
       (mark)<-[:KNOWS]-(will),
       (alice)<-[:KNOWS]-(michael),
       (will)<-[:KNOWS]-(chris),
       (chris)<-[:KNOWS]-(karin);
CALL algo.closeness.stream('Person', 'KNOWS') YIELD nodeId, centrality
match (n:Person) WHERE id(n) = nodeId
RETURN n.id,centrality 
order by centrality desc limit 20;

╒═════════╤══════════════════╕
│"n.id"   │"centrality"      │
╞═════════╪══════════════════╡
│"Michael"│1                 │
├─────────┼──────────────────┤
│"Chris"  │0.7142857142857143│
├─────────┼──────────────────┤
│"Will"   │0.7142857142857143│
├─────────┼──────────────────┤
│"Karin"  │0.625             │
├─────────┼──────────────────┤
│"Mark"   │0.625             │
├─────────┼──────────────────┤
│"Alice"  │0.5555555555555556│
└─────────┴──────────────────┘
tomasonjo commented 6 years ago

after some search i found this article talking about centrality in directed networks or how they call it prestige:

http://mrvar.fdv.uni-lj.si/sola/info4/uvod/part4.pdf -> page 6

talking about closeness centrality:

If network is not strongly connected, we take only reachable vertices into account, but we weight the result with number of reachable vertices.

mneedham commented 6 years ago

@tomasonjo I put another node into the smaller connected component and now I'm really confused by the output!

CREATE (alice:Person{id:"Alice"}),
       (michael:Person{id:"Michael"}),
       (karin:Person{id:"Karin"}),
       (chris:Person{id:"Chris"}),
       (will:Person{id:"Will"}),
       (mark:Person{id:"Mark"}),
       (dave:Person{id:"Dave"})
CREATE 
       (michael)<-[:KNOWS]-(karin),
       (michael)-[:KNOWS]->(chris),
       (will)-[:KNOWS]->(michael),
       (mark)<-[:KNOWS]-(michael),
       (mark)-[:KNOWS]->(will),
       (alice)-[:KNOWS]->(dave),
       (will)-[:KNOWS]->(chris),
       (chris)-[:KNOWS]->(karin),

       (michael)-[:KNOWS]->(karin),
       (michael)<-[:KNOWS]-(chris),
       (will)<-[:KNOWS]-(michael),
       (mark)-[:KNOWS]->(michael),
       (mark)<-[:KNOWS]-(will),
       (alice)<-[:KNOWS]-(dave),
       (will)<-[:KNOWS]-(chris),
       (chris)<-[:KNOWS]-(karin);
CALL algo.closeness.stream('Person', 'KNOWS') YIELD nodeId, centrality
match (n:Person) WHERE id(n) = nodeId
RETURN n.id,centrality 
order by centrality desc limit 20;

╒═════════╤════════════╕
│"n.id"   │"centrality"│
╞═════════╪════════════╡
│"Alice"  │6           │
├─────────┼────────────┤
│"Dave"   │6           │
├─────────┼────────────┤
│"Michael"│1.5         │
├─────────┼────────────┤
│"Chris"  │1.2         │
├─────────┼────────────┤
│"Will"   │1.2         │
├─────────┼────────────┤
│"Karin"  │1           │
├─────────┼────────────┤
│"Mark"   │1           │
└─────────┴────────────┘

How do Alice and Dave get scores of 6?!

2018-01-30_20-56-29

tomasonjo commented 6 years ago

I have noticed this behaviour in https://github.com/neo4j-contrib/neo4j-graph-algorithms/issues/546, I guess we have again two options how to solve this.

First one, could be frustrating for the user:

The Closeness Centrality algorithm is very sensitive to a single large distance or missing relationship - the closeness centrality for a node is 0 if there is just one other node it can't reach.
If there's one node that no other nodes can reach then every node will have a closeness centrality of 0.

Second option is to run algorithm treating each connected component as a separate graph, so Alice and David would have closeness centrality 1. This would mean following the logic from http://mrvar.fdv.uni-lj.si/sola/info4/uvod/part4.pdf

Above can solve undirected closeness. In a directed network we will almost always deal with unreachable nodes, so first option is not really an option, but the second option might also have its flaws in a less connected network.

mneedham commented 6 years ago

This is what networkx returns for the undirected 2 connected component graph from above:

# 2 connected components

G = nx.Graph()

relationships = [
    ("Karin", "Michael"),
    ("Michael", "Chris"),
    ("Will", "Michael"),
    ("Michael", "Mark"),
    ("Mark", "Will"),
    ("Alice", "Dave"),
    ("Will", "Chris"),
    ("Chris", "Karin")
]

G.add_edges_from(relationships)

sorted(nx.closeness_centrality(G).items(), key=lambda x: x[1] *-1)

[('Michael', 0.6666666666666666), ('Chris', 0.5333333333333333), ('Will', 0.5333333333333333), ('Karin', 0.4444444444444444), ('Mark', 0.4444444444444444), ('Alice', 0.16666666666666666), ('Dave', 0.16666666666666666)]

Interestingly the score it gives to Dave and Alice is 1/6, which is score / (#nodes - 1). I'm gonna debug our code a bit to see how it came up with the score of 6.

mneedham commented 6 years ago

Ok I see how we get the 6. The calculation is:

nodes -1 / farness and it's considering the farness of 'Dave' and 'Alice' to be a total of 1 when you could argue that it's actually infinity since they can't reach 'Michael', 'Mark', 'Chris', 'Karin', or 'Will'

tomasonjo commented 6 years ago

closed with https://github.com/neo4j-contrib/neo4j-graph-algorithms/issues/554