Open pippellia-btc opened 2 months ago
Hi @pippellia-btc, thank you for opening the issue š
For concerns 1 and 2, I'll let the engineers dive deeper once the issue becomes a priority.
Regarding 3, the fact that the values are not the same might be related to the fact that the online algorithm is only an approximation of PageRank to make it as fast as possible. Still, it carries the same informationāthe likelihood of a random walk ending in a particular vertex. Does that make sense?
Does this block your further work with Memgraph?
Hey katarina, thanks for the fast reply.
Regarding 3, the fact that the values are not the same might be related to the fact that the online algorithm is only an approximation of PageRank to make it as fast as possible. Still, it carries the same informationāthe likelihood of a random walk ending in a particular vertex. Does that make sense?
No, that's not the reason. As I show in the video, when one re-set it with
CALL pagerank_online.set(1000, 0.15) YIELD node, rank;
the error is small (<1%).
When I change the graph the error is order of magnitudes higher, like ~50%. In the paper it is shown that the error should always be quite small with that many random walks (1000 in the example), so it must be a bug in the implementation.
Does this block your further work with Memgraph?
Yes, because I'll have to rewrite most of the code myself for the use-case I have in mind.
Thank you for the further explanation and a really great video, @pippellia-btc. Just one small question - what happens if you set epsilon to default 1.0 or 0.85 instead of 0.15? I tried that and it seems that the values are then close to what you expected?
EDIT: Oh sorry, that might be related to me resetting, like you mentioned at the end of the video.
I agree with you that it would be better if it's consistent, having just d
in both pagerank and pagerank online. Not sure why there was a choice to implement it differently, but probably a different dev worked on each. Still, not sure if it's good to change API and a priority is definitely to have it working correctly. I read on Discord that you tried changing half epsilon and that didn't work. In case you have some other idea on how to fix it, feel free to open a PR since MAGE is an open source project. I would be really happy to see outside contributions š
Describe the bug There are various problems with the implementation of
pagerank_online
(https://memgraph.com/docs/advanced-algorithms/available-algorithms/pagerank_online) :(small) the parameter
epsilon
isn't coherent with the non-online version of pagerank (which usesd
, the dampening).d
is supposed to be1 - epsilon
, but the default values don't match. My suggestion is to use eitherd
orepsilon
in both algos.(medium) The CalculatePagerank function (see here ) takes the vector of all the visits, then divides it by a constant and then normalizes it. This doesn't make any sense, since normalizing right away would be faster and give the same result.
(urgent) The algorithm simply doesn't return the correct values when the graph gets updates, as shown in the next section
To Reproduce Steps to reproduce the behavior, starting from an empty database:
Create the graph
CREATE (a:Person {name: 'A'}), (b:Person {name: 'B'}), (c:Person {name: 'C'}), (a)-[:FOLLOWS]->(b), (b)-[:FOLLOWS]->(c), (c)-[:FOLLOWS]->(a);
Set pagerank_online
CALL pagerank_online.set(1000, 0.15) YIELD node, rank;
create a trigger so pagerank_online is updated when the graph changes
CREATE TRIGGER pagerank_trigger BEFORE COMMIT EXECUTE CALL pagerank_online.update(createdVertices, createdEdges, deletedVertices, deletedEdges) YIELD node, rank
Update the graph
MATCH (c:Person {name: 'C'}) CREATE (d:Person {name: 'D'}) CREATE (c)-[:FOLLOWS]->(d);
Get new pagerank values:
CALL pagerank_online.get() YIELD node, rank RETURN node, rank;
Compare the values with other implementations of the pagerank algorithm:
Create an empty directed graph
G = nx.DiGraph()
Add nodes to the graph
nodes = ['A', 'B', 'C', 'D'] G.add_nodes_from(nodes)
Add edges to the graph
edges = [('A','B'), ('B','C'), ('C','A'), ('C','D')] G.add_edges_from(edges)
alpha = 0.85 print(nx.pagerank(G, alpha))