memgraph / mage

MAGE - Memgraph Advanced Graph Extensions :crystal_ball:
Apache License 2.0
260 stars 27 forks source link

[BUG] Pagerank Online doesn't update correctly when the graph is changed #510

Open pippellia-btc opened 2 months ago

pippellia-btc commented 2 months ago

Describe the bug There are various problems with the implementation of pagerank_online (https://memgraph.com/docs/advanced-algorithms/available-algorithms/pagerank_online) :

  1. (small) the parameter epsilon isn't coherent with the non-online version of pagerank (which uses d, the dampening). d is supposed to be 1 - epsilon, but the default values don't match. My suggestion is to use either d or epsilon in both algos.

  2. (medium) The CalculatePagerank function (see here ) takes the vector of all the visits, then divides it by a constant and then normalizes it. This doesn't make any sense, since normalizing right away would be faster and give the same result.

  3. (urgent) The algorithm simply doesn't return the correct values when the graph gets updates, as shown in the next section

To Reproduce Steps to reproduce the behavior, starting from an empty database:

  1. Create the graph CREATE (a:Person {name: 'A'}), (b:Person {name: 'B'}), (c:Person {name: 'C'}), (a)-[:FOLLOWS]->(b), (b)-[:FOLLOWS]->(c), (c)-[:FOLLOWS]->(a);

  2. Set pagerank_online CALL pagerank_online.set(1000, 0.15) YIELD node, rank;

  3. create a trigger so pagerank_online is updated when the graph changes CREATE TRIGGER pagerank_trigger BEFORE COMMIT EXECUTE CALL pagerank_online.update(createdVertices, createdEdges, deletedVertices, deletedEdges) YIELD node, rank

  4. Update the graph MATCH (c:Person {name: 'C'}) CREATE (d:Person {name: 'D'}) CREATE (c)-[:FOLLOWS]->(d);

  5. Get new pagerank values: CALL pagerank_online.get() YIELD node, rank RETURN node, rank;

  6. Compare the values with other implementations of the pagerank algorithm:

    
    import networkx as nx

Create an empty directed graph

G = nx.DiGraph()

Add nodes to the graph

nodes = ['A', 'B', 'C', 'D'] G.add_nodes_from(nodes)

Add edges to the graph

edges = [('A','B'), ('B','C'), ('C','A'), ('C','D')] G.add_edges_from(edges)

alpha = 0.85 print(nx.pagerank(G, alpha))



or use this simple website tool: http://computerscience.chemeketa.edu/cs160Reader/_static/pageRankApp/index.html

**Expected behavior**
It should give the correct results

**Video**
Here is a link to a video where I comment these issues and provide additional context.
It's on your Discord server: https://discord.com/channels/842007348272169002/852201290880385044/1280112013921619999
katarinasupe commented 2 months ago

Hi @pippellia-btc, thank you for opening the issue šŸ™

For concerns 1 and 2, I'll let the engineers dive deeper once the issue becomes a priority.

Regarding 3, the fact that the values are not the same might be related to the fact that the online algorithm is only an approximation of PageRank to make it as fast as possible. Still, it carries the same informationā€”the likelihood of a random walk ending in a particular vertex. Does that make sense?

Does this block your further work with Memgraph?

pippellia-btc commented 2 months ago

Hey katarina, thanks for the fast reply.

Regarding 3, the fact that the values are not the same might be related to the fact that the online algorithm is only an approximation of PageRank to make it as fast as possible. Still, it carries the same informationā€”the likelihood of a random walk ending in a particular vertex. Does that make sense?

No, that's not the reason. As I show in the video, when one re-set it with CALL pagerank_online.set(1000, 0.15) YIELD node, rank; the error is small (<1%).

When I change the graph the error is order of magnitudes higher, like ~50%. In the paper it is shown that the error should always be quite small with that many random walks (1000 in the example), so it must be a bug in the implementation.

Does this block your further work with Memgraph?

Yes, because I'll have to rewrite most of the code myself for the use-case I have in mind.

katarinasupe commented 2 months ago

Thank you for the further explanation and a really great video, @pippellia-btc. Just one small question - what happens if you set epsilon to default 1.0 or 0.85 instead of 0.15? I tried that and it seems that the values are then close to what you expected?

EDIT: Oh sorry, that might be related to me resetting, like you mentioned at the end of the video.

katarinasupe commented 2 months ago

I agree with you that it would be better if it's consistent, having just d in both pagerank and pagerank online. Not sure why there was a choice to implement it differently, but probably a different dev worked on each. Still, not sure if it's good to change API and a priority is definitely to have it working correctly. I read on Discord that you tried changing half epsilon and that didn't work. In case you have some other idea on how to fix it, feel free to open a PR since MAGE is an open source project. I would be really happy to see outside contributions šŸ˜„