neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
614 stars 161 forks source link

gds.randomWalk.mutate #319

Open Mintactus opened 1 month ago

Mintactus commented 1 month ago

The title says it all, a mutate mode for random walk would be perfect! Each node could would get a property with the number of times it has been visited of all the walks.

IoannisPanagiotas commented 1 month ago

Hi @Mintactus,

Thank you for your request. This is not something that we had considered before. I cannot promise we will implement it, but we will look into it. Might I ask what is the use-case for?

Are you trying to somehow use these statistics to extract some centrality statistic? I am asking because I have your recent question about betweeness-centrality for longest paths in mind.

Are you trying to use random walks as a means to stochastically find heavy paths i.e., more probable to be visited in a random walk, and order the nodes by frequency of visits. If so, it could make sense looking into the pagerank algorithm.

Best regards, Ioannis.

Mintactus commented 1 month ago

Thanks for the amazing response

I really appreciate the touch about my use case and previous attempt to qualify the value of a channel in an aggregation of converted customer journeys as a simple markov chain. Indeed, I'm trying to figure out different ways to create an attribution model, as we call it in marketing, with graph. There are so many options when you look at GDS.

And often understanding the exact behavior of them to make it fits with your need or the question you are trying to answer is not easy. In this case, a core principal of my analysis is to make sure we stick the the data without making any assumption.

I will have to compare the results on both of them, but I'm 100% sure pageRank and random walk behaves completly differently after I read carefully about both of them in books. PageRank will focus on the node itself and be given an importance based on it's neighbour, randomwalk can start from a point and simulate the behavior of a customer going from one channel to another an it stops when it reaches then end, in this case the conversion point.

For instance A -> B -> C being a cusomter journey, for pagerank b and c are important over time but not A that much since it doesn't have inbound relationships. Which is not right in marketing, a random walk in this case will give equal credit to all of them as they all contribute to reach C if we simply quantify the number of time a cusomter had to cross over A or B to reach C. When adding the weights to the relationships random walk just perflectly model the this behavior and stops at the right moment. Where page rank has this random dumping factor that i'm not interresting in and focus the importance of a node based on its inbound relationships instead of the overall behavior of a customer starting from a precise point. ( even if both algo can have sourceNodes )

So I think having the number of time a node as been part of a walk across would be awsome, it's just the perfect piece of the puzzle missing and it fits the algo.

Thanks

IoannisPanagiotas commented 1 month ago

Hi again,

Thank you for providing some more input.

I think I will stick with my suggestion for pagerank for the moment. Pagerank is very much tied with the random walk concept, if it helps, you can set the set the damping factor to 0 which would eliminate the jumping-around effect. For example, one way of approximating pagerank is by conducting a simulating a fixed number of random walks per node, and setting the pagerank score of each node to be the number of its visits divided by the total amount of random walks. This was what I was thinking around. There is a paper with more details around this (see section 2.1).

As for your example, A will only be relevant in its own random walks (so it will have a low number of visits), so this would be reflected by a low pagerank score I think.

Anyway, we will discuss within the team if we can implement this based on the use-case, but I really cannot you promise anything for the moment! I can also ask around if someone has any suggestions for attribution models, as I myself am not too familiar with such concepts.

Last, if you have some Java experience, you can also try to implement this yourself using our pregel framework. and create some separate new procedures. I can give you some pointers if you would like to try this. I think it could be easier than actually trying to change the existing randomwalk implementation which is very stream-based.

Best regards, Ioannis.

Mintactus commented 1 month ago

Thanks Loannis

I really appreciate the really detailed or pro answer, I will look into it a bit more this month, but If I understand correctly, in certain conditions and for the exact same graph, if we were about to aggregate or count the ids of random walk, pagerank will give the same or really similar results over its iterations. I'm not sure yet, but I will investigate deeper, thanks for the paper and pregel suggestions.

I read I could use eigenvector instead of pagerank as page pagerank dump factor cannot be set to 0, but really close to. Is eigenvector simply put pagerank without damping factor ? But this question is not priority

Thanks for your support

On Tue, Jul 23, 2024 at 4:27 AM Ioannis Pan @.***> wrote:

Hi again,

Thank you for providing some more input.

I think I will stick with my suggestion for pagerank for the moment. Pagerank is very much tied with the random walk concept, if it helps, you can set the set the damping factor to 0 which would eliminate the jumping-around effect. For example, one way of approximating pagerank is by conducting a simulating a fixed number of random walks per node, and setting the pagerank score of each node to be the number of its visits divided by the total amount of random walks. This was what I was thnking around. There is a paper https://www.vldb.org/pvldb/vol4/p173-bahmani.pdf with more details around this (see section 2.1).

As for your example, A will only be relevant in its own random walks (so it will have a low number of visits), so this would be reflected by a low pagerank score I think.

Anyway, we will discuss within the team if we can implement this based on the use-case, but I really cannot you promise anything for the moment! I can also ask around if someone has any suggestions for attribution models, as I myself am not too familiar with such concepts.

Last, if you have some Java experience, you can also try to implement this yourself using our pregel https://neo4j.com/docs/graph-data-science/current/algorithms/pregel-api/ framework. and create some separate new procedures. I can give you some pointers if you would like to try this. I think it could be easier than actually trying to change the existing randomwalk implementation which is very stream-based.

Best regards, Ioannis.

— Reply to this email directly, view it on GitHub https://github.com/neo4j/graph-data-science/issues/319#issuecomment-2244575209, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHIBVDPKDBTYWWUEBRZTBU3ZNYHWHAVCNFSM6AAAAABLBIINKKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBUGU3TKMRQHE . You are receiving this because you were mentioned.Message ID: @.***>

IoannisPanagiotas commented 1 month ago

Yes, I believe the results should be close to the actual pagerank through such a scheme. Not 100% identical and possibly not for each node, but in expectation they shouldn't be far off. In fact, in the paper I have sent you, this technique is used to approximate page rank for dynamic purposes (where the graph gets updated with new edges etc).

As far as I can see from our code, you can actually set up the jump probability to be 0 in our pagerank implementation, but it does look like eigenvector algorithm should behave the same (but that's from skimming a bit in the documentation and code).

Best and good luck! Ioannis.

Mintactus commented 3 weeks ago

Hi Ioannis

I did some deep investigation about pageRank and randomWalk to know if I could use pageRank instead of randomWalk for marketing attribution, and indeed, in some conditions related to the size of the graph if

In these conditions, randomWalk and pageRank will gives the same rounded and max scaled scores for both algo.

But you cannot set the DampingFactor to 0 to simply eliminate the teleport behavior of pageRank, when looking at the formula, this will simply makes all nodes and sources nodes stick to their initial values ( Iteration 1 ) forever regardless the number of iterations. ( I tested it in practice ) so maybe the current implementation should not allow it either since there is no point.

You cannot either set the dampingFactor to 1 since this will initialy set all source nodes to 0 ( And it's not possible to do so with the current implementation anyway wich make sens)

Here are my test results Also the data & algo settings I used

For now, briefly the solution is to set dampingFactor to 0.999. And even greater solution would be to have an option to remove the dampingFactor completly, so the initial value of a source node will not be substracted by the dampingFactor and the second part of the formula will not be multiply by the dampingFactor at all. This, would give me exactly what I'm looking for.

Basically, pageRank without DP, ( Which translate into the exact behavior of randomWalk with the scores inside each node instead of the walks as outputs ). These is a slight difference with randomWalk since it's random, but the mean of many random walk execution is the same as pageRank without DP.

So in short, once initial values are set, being 1/source nodes for each source node and a not source nodes being 0 The formula without DP should simply be

formula

The main reason why I'm not interrested into using the dampingFactor is that in the attribution model I'm building, we do have a lot of data with the exact behavior of the customer and we want to stick to it to credit a channel impact. If a customer was "jumping or teleporting" we would know in the data, we don't want to guess a gross pourcentage based on global internet behavior, we know their behavior and we must credit each channel according to this.

The closest solution I had first, was to use eigenVector centrality, because it says "The PageRank algorithm is a variant of Eigenvector Centrality with an additional jump probability.", but in this one, a node with no incoming relationships is deflating over time which doesn't fit the behavior of randomWalks. So the dampingFactor is far from being the only difference between pageRank and eigenVector and eigenVector doesn't represent a simple pageRank without DP.

I hope these explanations where not too long, but indeed, definitily, an intuitive mode like mutate for randomWalk would be awsome and necessary for this use case. The main reason I can give you it's everything in this comment, intuitiveness in business is really critical, it makes things move way faster.

So to support my request of adding a mutate mode to randomWalk or an option to pageRank:

Understanding that pageRank with a dampingFactor of 0.999 with a high iteration values will gives same scores as a random walk is extrêmely time consuming for a business trying to devellop a tool that credit channels based only on actual behavior. But, randomWalk is a REALLY STRONGLY INTUITIVE algo to use for this use case, in this by itself worth a shitload of money and time.

I got the point that pageRank without DP are both alike from a logic and tech point of view, but intuitiveness is much more important when you are about to use it for your use case. Make pageRank act like randomWalk is not inuitive at all ( even you made a mitstake with the 0 ) and that makes it a bad choice from a business perspective.

So yeah, I still think that adding to mutate option for randomWalk will be absolutly brillant, I would double the paycheck of the tech who does it if I could, and if this tech could also add a scaler option of the mutate variant I would promote him tech chief lol.

As there are no pageRank without dampingFactor option existing, and because this idea will not be intuitive anyway, I think gds.randomWalk.mutate would be a must.

Thanks a lot for reading up to this point if you are still alive, I appreciate it. You can reach me using videoconference anytime here if needed.

IoannisPanagiotas commented 3 weeks ago

Hi again @Mintactus ,

Thank your the very detailed post. Sorry about the damping factor issue, I had in mind the formula that jumps with d insted of 1-d. I understand your motivation a bit better now, but will think about it some more.

One think I am not so clear is : The closest solution I had first, was to use eigenVector centrality, because it says "The PageRank algorithm is a variant of Eigenvector Centrality with an additional jump probability.", but in this one, a node with no incoming relationships is deflating over time which doesn't fit the behavior of randomWalks. So the dampingFactor is far from being the only difference between pageRank and eigenVector and eigenVector doesn't represent a simple pageRank without DP.

Shouldn't such nodes be deflated even in random walks? I think I already wrote something of a sort above too. If you do 10 walks per starting node, a node with no incoming relationships, will have about 10 visits. Any other node will have at least 10 visits, but probably more as in most cases t will be visited by other nodes.

At any rate, the request has been up for consideration using our usual team practices. But I cannot guarantee when/if it is picked up.

Best, Ioannis.