WriteProperty in Pregel framework

frank-zsy commented 2 years ago

Is your feature request related to a problem? Please describe.

I am using Pregel framework to implement a PageRank like algorithm and basically all works fine right now.

But I got some minor problems and I looked through the document several times without any clue to solve them.

In the PageRank example by Pregel, I do not find any code about node voting to halt, so the algorithm will run to maxIterations to stop and not guaranteed to converge, am I right?
So when I come to implement my own algorithm, I need to flag nodes as converged on themselves and check the flag in masterCompute to find out if the whole algorithm is converged. Which leads to the third problem.
If I add a new property in PregelSchema for every node to indicate the convergence state, then the property will be written back into the original database node too and I can not eliminate it from the process. And also in some complex situation we may need to add more temporary properties for computation use and all the properties will be written back too. So how can I prevent that.

Here is my code, and the properties using in computation like retention_factor, init_value and converged will also be written back to the database when the computation is done.

Describe the solution you would like

I would like to control which properties will be written back to database in Pregel write mode.

Or access to nodeProperties in ComputeContext just like in InitContext, in that way I won't need to save the values on nodes for compute.

s1ck commented 2 years ago

Hi Frank. Thanks for the problem description.

In the PageRank example by Pregel, I do not find any code about node voting to halt, so the algorithm will run to maxIterations to stop and not guaranteed to converge, am I right?

In a Pregel computation you can use the ComputeContext to vote to halt. You can see how we apply this in the product version PageRank. In the example PageRank, we could vote to halt if the new rank does not differ from the old rank. In the current case, it runs for maxIterations.

So when I come to implement my own algorithm, I need to flag nodes as converged on themselves and check the flag in masterCompute to find out if the whole algorithm is converged. Which leads to the third problem.

In your code, you could do:

if (Math.abs(newRank - oldRank) < this.tolerance) {
    context.voteToHalt();
} else {
    context.sendToNeighbors(context.doubleNodeValue(openRank));    
}

And also make sure to send the messages in the first superstep, i.e. in an else for the outer if statement. This way you can avoid the additional property and the master compute step.

If I add a new property in PregelSchema for every node to indicate the convergence state, then the property will be written back into the original database node too and I can not eliminate it from the process. And also in some complex situation we may need to add more temporary properties for computation use and all the properties will be written back too. So how can I prevent that.

When you define a PregelSchema for your computation, you can set the visibility for each individual node value. You can see an example in the Hits implementation. In your code, you can do:

return new PregelSchema.Builder()
    .add(openRank, ValueType.DOUBLE)
    .add(initValue, ValueType.DOUBLE, Visiblity.PRIVATE)
    .add(rententionFactor, ValueType.DOUBLE, Visiblity.PRIVATE)
    .add(converged, ValueType.LONG, Visiblity.PRIVATE)
    .build();

I think this is not part of the documentation, so I'll update it accordingly. Thanks for pointing that out!

I hope this helps. Let me know if you have more questions.

Really cool, that you're using the Pregel framework.

frank-zsy commented 2 years ago

Thanks for the quick reply, I will enjoy to develop my own algorithm with GDS.

Firstly, the Visibility.PRIVATE is really helpful, it solved the temporary property problem.

Secondly, I also read about the production level PageRank implementation and I also use voteToHalt in the first place. But here is a tricky difference between that implementation and the demo one.

In the production level PageRank, the value passed between nodes is delta, which means the value will reduce to 0 eventually. But in the demo PageRank algorithm and in my implementation the value passed between nodes is the PageRank value of each node normalized by relationship weight. So the value will converge to a certain value rather than 0, this difference is very important here.

Since if a node voteToHalt, it will not send message anymore and will be activated again if it receives any message in the future. So in my algorithm, if a node votes to halt, the neighbors will not receive the value from it, so the neighbor's PageRank in the next superstep will change since it relies on the values sent to it. Then the neighbor node will send the new value back to the halt node in the next superstep and activate it again.

So in the real world, if the value passed between nodes will not converge to 0, then the algorithm will not converge unless all the nodes call voteToHalt in the same superstep, I am not sure do I explain this clearly?

Thirdly, I also want add to some temporary nodes and relationships into the in-memory graph created by gds.graph.create.cypher(just like a background node in LeaderRank), I found that I can use Cypher for GDS graph but it only enabled for enterprise edition, so is there any way to add temporary nodes and relationships in create process, or I need to add the nodes into database and project into the in-memory graph?

BTW, the OpenRank is quite a complex version of PageRank which supports initial value and dumping factor(retention factor) for different nodes. And I will give a mathematical convergence prove in my doctoral thesis that OpenRank will converge. And I really like to contribute the algorithm to GDS library in the future since heterogeneous information network becomes really hot these days and more suitable for the real world problems.

UPDATE: If there is a built-in or injected logger for Procedure, it will be really good.

s1ck commented 2 years ago

What you're describing is a toggling behaviour, i.e., nodes activating / de-activating themselves all the time. I can think of two way that might solve this:

1) you could specify a delta constant which defines the change you need to make between current and new rank to consider it updated. I assume that if you get less messages because neighbours halt, you would at some point fall below the delta threshold and not update a node. I actually think that's what you're already doing with your threshold, right? Should this not lead to smaller changes which eventually lead to a global halt? Maybe I misunderstood your explanation.

2) iirc, I saw a label propagation implementation (I think in Giraph?) which counted the halts/activations for each node and immediately returned from compute method if the threshold is hit. Maybe that could work for you, too?

3) We could maybe expose computeContext.hasVotedToHalt() which would allow you to skip any computation and just remain halted which would converge to the state where all nodes are on halt faster? You could emulate this by checking your converged flag at the beginning of the compute method and just return and do nothing. Using the voteToHalt is just using less memory as it's a BitSet.

gds.graph.create.cypher is not an enterprise feature. You can write any Cypher query you want as long as it produces nodes / relationships. E.g. you could load parts of your data from the database and write a union query to add additional (virtual) nodes. Would that help? If you are referring to "Cypher on GDS", that indeed requires enterprise edition.

Exposing a logger in the init and compute context is a good idea. I will create dedicated issues in our internal tracking system.

frank-zsy commented 2 years ago

you could specify a delta constant which defines the change you need to make between current and new rank to consider it updated. I assume that if you get less messages because neighbours halt, you would at some point fall below the delta threshold and not update a node. I actually think that's what you're already doing with your threshold, right? Should this not lead to smaller changes which eventually lead to a global halt? Maybe I misunderstood your explanation.

Yes, I checked the threshold(tolerance) to determine if the node already converged.

iirc, I saw a label propagation implementation (I think in Giraph?) which counted the halts/activations for each node and immediately returned from compute method if the threshold is hit. Maybe that could work for you, too?

Thanks, this is really a good optimization, check the converged flag first and avoid redundant calculation to speed up the process.

We could maybe expose computeContext.hasVotedToHalt() which would allow you to skip any computation and just remain halted which would converge to the state where all nodes are on halt faster? You could emulate this by checking your converged flag at the beginning of the compute method and just return and do nothing. Using the voteToHalt is just using less memory as it's a BitSet.

Here is the thing, as far as I understand, if a node calls voteToHalt then it won't enter compute function in next superstep so hasVotedToHalt can not be used in computeContext since halted node will not enter the function. So a flag on node is just fine. And if I was wrong and halted node will call the function, it will be helpful to expose the flag.

gds.graph.create.cypher is not an enterprise feature. You can write any Cypher query you want as long as it produces nodes / relationships. E.g. you could load parts of your data from the database and write a union query to add additional (virtual) nodes. Would that help? If you are referring to "Cypher on GDS", that indeed requires enterprise edition.

Yes, I currently use gds.graph.create.cypher to create the graph. But I am not sure how to add virtual nodes in the query for two reasons:

To create a virtual node in an union query may involve sub query like:
```
MATCH (u:User) RETURN id(u)
UNION ALL
CALL {
CREATE (n) RETURN n
}
RETURN id(n)
```
I am not quite familiar with Cypher so I am not sure how to do this.
With a created virtual node in nodeQuery, how can I use it in relationshipQuery? Since the query get data from database rather than the in-memory graph, so even I create a node in the nodeQuery, I won't find it in relationshipQuery because the virtual node is not in the database. Am I right?

So if I can use "Cypher on GDS", it will be quite easy to do this.

Exposing a logger in the init and compute context is a good idea. I will create dedicated issues in our internal tracking system.

Great, thank you and could the logger is for all procedures rather than Pregel?

s1ck commented 2 years ago

Here is the thing, as far as I understand, if a node calls voteToHalt then it won't enter compute function in next superstep so hasVotedToHalt can not be used in computeContext since halted node will not enter the function. So a flag on node is just fine. And if I was wrong and halted node will call the function, it will be helpful to expose the flag.

You're right, I missed that. In that case, you'd need the converged value to abort at the beginning of compute.

So if I can use "Cypher on GDS", it will be quite easy to do this.

If I understand you correctly, you want to add relationships to the in-memory graph using a CREATE/MERGE query. Unfortunately, Cypher on GDS does not support updates to the in-memory graph, only read-only queries. And to your other question wrt cypher projections: you'd need to create the same virtual nodes in the relationship query... a bit annoying .. the alternative would be to create them in the database and project them via native (which is also the fastest projection).

and could the logger is for all procedures rather than Pregel?

What do you mean by that? The procedures are auto-generated, no need to log anything there. I thought you wanted to log inside init, compute and masterCompute?

frank-zsy commented 2 years ago

Thanks you very much, you solved all my problems now.

By all procedures, I mean custom procedures written by ourselves. Since I am not only using Pregel, I also developed a custom procedure to do some normal stat works like counting Motifs in a graph.

But I looked into the code, in a normal procedure I can inject a Log instance like this:

public class MyProcedure {
    @Context
    public Log logger;
}

And I tried for a Pregel procedure but the test failed with a NullPointerException if I use the instance.

So either a logger instance in init, compute and masterCompute context or support the normal inject will be fine.

s1ck commented 2 years ago

Thanks you very much, you solved all my problems now.

Great! Happy to help.

You're correct, for custom procedures you can inject a log. If you use auto-generated Pregel Procedures, they get overridden on each compile, which means you cannot modify them.

I'm closing this for now, feel free to open new issues if you run into new problems.

frank-zsy commented 2 years ago

OK, thanks again for the help.

s1ck commented 2 years ago

@frank-zsy just wanted to note that we updated the documentation on Pregel schema: https://neo4j.com/docs/graph-data-science/current/algorithms/pregel-api/#algorithms-pregel-api-schema

frank-zsy commented 2 years ago

Great to know, thanks~

neo4j / graph-data-science

WriteProperty in Pregel framework #175