neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
596 stars 157 forks source link

Link Prediction Pipeline not working with GraphSage #214

Closed gurugecl closed 1 year ago

gurugecl commented 1 year ago

Hi,

I'm trying to utilize the GDS link prediction pipeline with the GraphSage embedding algorithm

https://neo4j.com/docs/graph-data-science/current/machine-learning/linkprediction-pipelines/config/

In the Adding Node Properties step it says that the procedureName parameter takes the name of the procedure but doesn't specify which procedures can be used. I can get it to work with 'fastRP' but when I try 'graphSage' or any version of the word graphsage I get the error

ClientError: [Procedure.ProcedureCallFailed] Failed to invoke procedure gds.beta.pipeline.linkPrediction.addNodeProperty: Caused by: java.lang.IllegalArgumentException: Could not find a procedure called gds.graphsage.mutate

This is how I am making that call

CALL gds.beta.pipeline.linkPrediction.addNodeProperty('lp-pipeline2', 'graphSage', { mutateProperty: 'embedding', randomSeed: 42 })

I see in the graphSage documentation that there is a procedure called gds.beta.graphSage.mutate. So is it not possible to use graphSage within the link prediction pipeline?

I know you can utilize graphSage separately without the pipeline but I would really like to utilize it within the pipeline because it makes it easier to split the dataset to train the embedding algorithm on separate data using the feature-input set

Thank you

GDS version: 2.1.6 Neo4j version: 4.4.8 Operating system: Mac

vnickolov commented 1 year ago

Hi @gurugecl, thank you for raising this, you can run GraphSage, just need to specify beta.graphSage instead of just graphSage in the addNodeProperty procedure name, also make sure to set the required configuration parameters. Please give it a try and let us know.

gurugecl commented 1 year ago

Hi @vnickolov, thank you for quick response and great that did work as long as I only include the required properties like model name and mutate property.

[{name: 'gds.beta.graphSage.mutate', config: {modelName: 'gs_model', mutateProperty: 'embedding'}}]

However as soon as I include the other properties for graphSage that I need and described in the documentation like below

CALL gds.beta.pipeline.linkPrediction.addNodeProperty('lp-pipeline2', 'beta.graphSage', { modelName: 'gs_model', mutateProperty: 'embedding', embeddingDimension: 64, featureProperties: ['x','b','c'], projectedFeatureDimension: 16, activationFunction: 'relu', negativeSampleWeight: 20, searchDepth: 5, epochs: 1, learningRate: 0.1, batchSize: 100, aggregator: 'mean' })

I get the following error

ClientError: [Procedure.ProcedureCallFailed] Failed to invoke procedure gds.beta.pipeline.linkPrediction.addNodeProperty: Caused by: java.lang.IllegalArgumentException: Unexpected configuration keys: activationFunction, aggregator, embeddingDimension, epochs, featureProperties, learningRate, negativeSampleWeight, projectedFeatureDimension, searchDepth

However the configuration mappings work fine when using fastRP and the ones associated with it

vnickolov commented 1 year ago

Thank you @gurugecl, we will have a look at this, it is definitely unexpected behaviour.

adamnsch commented 1 year ago

Hi @gurugecl,

Only the graphsage predict mutate mode is supported as a pipeline node property step. That is, you will first have have to train the graphsage model outside of the pipeline directly by a call to gds.beta.graphSage.train, and then you can add graphsage predict (beta.graphSage) as a node property steps to your pipeline providing the modelName configuration key of the model you have just trained. When you add the graphsage node property step you may only provide parameters relevant to the predict mutate mode. So parameters like epoch and so on should be passed directly to gds.beta.graphSage.train, not when adding the node property step.

Does this make sense? We probably have to document this better.

gurugecl commented 1 year ago

Oh I see. Yes overall that does make sense. But could you then clarify how applying the prediction in mutate would work on the various sets in terms of the relationship splits functionality

https://neo4j.com/docs/graph-data-science/current/machine-learning/linkprediction-pipelines/config/

Because based on my understanding it splits the data into 3 sets for testing, training, and feature-input. feature-input is where the embedding algorithm training happens right to avoid data leakage. So would I set the feature-input fraction to zero then since I will have to train on a separate sub graph set (to mirror the feature-input set) outside of the pipeline?

Also it sounds like the predictions in mutate mode happen on the test and train sets right in order to then use those embedding properties in the link feature steps where the (L2, Hadamard, cosine) feature vectors are created for downstream classification. "The node property steps that are added to the pipeline will be executed both when training a pipeline and when the trained model is applied for prediction". But won't the embedding then incorporate the link we are interested in predicting on into the embedding which wouldn't be available in new data we want to predict links for in the future?

But later on the documentation it does also state that train and test sets...."are not used by the algorithms run in the node property steps. The reason for this is that otherwise the model would use the prediction target (existence of a relationship) as a feature." So on the other hand, if that's the case, how are the link features calculated for train/test data without first running graphsage predictions on that data to provide the predicted embeddings as properties?

It does say "The feature-input graph is used, both in training and testing, for computing node properties and therefore also features which depend on node properties". But Im not understanding how those node properties in the feature-input set are utilized in the train/test data which would have different nodes. Can you please clarify. Thank you

gurugecl commented 1 year ago

Also @adamnsch, I just tried the steps you mentioned and I get the following error stating that the graphSage model doesn't exist

ClientError: [Procedure.ProcedureCallFailed] Failed to invoke procedure gds.beta.pipeline.linkPrediction.train: Caused by: java.util.NoSuchElementException: Model with name multiLabelModel does not exist.

However when I run

CALL gds.beta.model.list() YIELD modelInfo, loaded, shared, stored RETURN modelInfo.modelName AS modelName, loaded, shared, stored

It does show that such a model exists and is loaded

modelName | loaded | shared | stored multiLabelModel | true | false | false

This is my code for training the graphSage model separately

CALL gds.beta.graphSage.train( 'gene_disease_train', { modelName: 'multiLabelModel', featureProperties: ['a','b','c'], embeddingDimension: 64, projectedFeatureDimension: 4, activationFunction: 'relu', negativeSampleWeight: 20, searchDepth: 5, epochs: 1, learningRate: 0.1, batchSize: 100, aggregator: 'mean' } ) YIELD modelInfo as info RETURN info.modelName as modelName, info.metrics.didConverge as didConverge, info.metrics.ranEpochs as ranEpochs, info.metrics.epochLosses as epochLosses

and here is my code for utilizing the trained model in the pipeline

CALL gds.beta.pipeline.linkPrediction.addNodeProperty('lp-pipeline2', 'beta.graphSage', { modelName: 'multiLabelModel', mutateProperty: 'embedding' })

When I run the following link prediction train step is when it says the multiLabelModel doesn't exist however

CALL gds.beta.pipeline.linkPrediction.train('gene_disease', {pipeline: 'lp-pipeline2', modelName: 'lp-model2', randomSeed: 42}) YIELD modelInfo RETURN modelInfo.bestParameters AS winningModel, modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore, modelInfo.metrics.AUCPR.test AS testGraphScore;

adamnsch commented 1 year ago

Oh I see. Yes overall that does make sense. But could you then clarify how applying the prediction in mutate would work on the various sets in terms of the relationship splits functionality

https://neo4j.com/docs/graph-data-science/current/machine-learning/linkprediction-pipelines/config/

Because based on my understanding it splits the data into 3 sets for testing, training, and feature-input. feature-input is where the embedding algorithm training happens right to avoid data leakage. So would I set the feature-input fraction to zero then since I will have to train on a separate sub graph set (to mirror the feature-input set) outside of the pipeline?

You should not set the feature-input fraction to zero even if the GraphSAGE model is trained outside of the pipeline, since the trained GraphSAGE model uses relationships also when inferring node embeddings.

For training the model before adding it to the pipeline, it would be nice to not include what will later be the test and train set in the subsequent pipeline where the model will be applied. Unfortunately we don't yet support doing the dataset split outside of the pipeline and then providing it to the pipeline at training time. In the upcoming GDS 2.2 release, there will be support for something called contextRelationshipTypes which will allow you to avoid data leakage with pretrained GraphSAGE models when combined with the Split relationships procedure.

Also it sounds like the predictions in mutate mode happen on the test and train sets right in order to then use those embedding properties in the link feature steps where the (L2, Hadamard, cosine) feature vectors are created for downstream classification. "The node property steps that are added to the pipeline will be executed both when training a pipeline and when the trained model is applied for prediction". But won't the embedding then incorporate the link we are interested in predicting on into the embedding which wouldn't be available in new data we want to predict links for in the future?

But later on the documentation it does also state that train and test sets...."are not used by the algorithms run in the node property steps. The reason for this is that otherwise the model would use the prediction target (existence of a relationship) as a feature." So on the other hand, if that's the case, how are the link features calculated for train/test data without first running graphsage predictions on that data to provide the predicted embeddings as properties?

It does say "The feature-input graph is used, both in training and testing, for computing node properties and therefore also features which depend on node properties". But Im not understanding how those node properties in the feature-input set are utilized in the train/test data which would have different nodes. Can you please clarify. Thank you

So I think you're confusing nodes and relationships here. It's true that relationships of the train and test sets are not included when running the node property steps to avoid data leakage, but the nodes that are adjacent via train or tests sets relationships are included. So that means that even if certain relationships are missing in the feature-input graph, all nodes of the input graph are included and will get embeddings (and whatever other node properties are computed by node property steps). For this reason it is also possible for the link feature steps to derive link features for every possible node pair in the input graph (though it doesn't actually look at every node pair in practice), and in particular those that are adjacent via train or tests sets relationships. I hope that makes sense :)

adamnsch commented 1 year ago

Also @adamnsch, I just tried the steps you mentioned and I get the following error stating that the graphSage model doesn't exist

ClientError: [Procedure.ProcedureCallFailed] Failed to invoke procedure gds.beta.pipeline.linkPrediction.train: Caused by: java.util.NoSuchElementException: Model with name multiLabelModel does not exist.

However when I run

CALL gds.beta.model.list() YIELD modelInfo, loaded, shared, stored RETURN modelInfo.modelName AS modelName, loaded, shared, stored

It does show that such a model exists and is loaded

modelName | loaded | shared | stored multiLabelModel | true | false | false

This is my code for training the graphSage model separately

CALL gds.beta.graphSage.train( 'gene_disease_train', { modelName: 'multiLabelModel', featureProperties: ['a','b','c'], embeddingDimension: 64, projectedFeatureDimension: 4, activationFunction: 'relu', negativeSampleWeight: 20, searchDepth: 5, epochs: 1, learningRate: 0.1, batchSize: 100, aggregator: 'mean' } ) YIELD modelInfo as info RETURN info.modelName as modelName, info.metrics.didConverge as didConverge, info.metrics.ranEpochs as ranEpochs, info.metrics.epochLosses as epochLosses

and here is my code for utilizing the trained model in the pipeline

CALL gds.beta.pipeline.linkPrediction.addNodeProperty('lp-pipeline2', 'beta.graphSage', { modelName: 'multiLabelModel', mutateProperty: 'embedding' })

When I run the following link prediction train step is when it says the multiLabelModel doesn't exist however

CALL gds.beta.pipeline.linkPrediction.train('gene_disease', {pipeline: 'lp-pipeline2', modelName: 'lp-model2', randomSeed: 42}) YIELD modelInfo RETURN modelInfo.bestParameters AS winningModel, modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore, modelInfo.metrics.AUCPR.test AS testGraphScore;

It does look like you're doing the correct steps. That you get this error is very surprising to me, since we have tests that pass that essentially do exactly this. We will discuss internally what the issue could be.

Also, are you entirely sure that the GraphSAGE model is definitely available in the model catalog at the time when you train the pipeline? Models are dropped from the catalog automatically if you restart the database (the caveat being using GDS EE store/load functionality).

gurugecl commented 1 year ago

Hi @adamnsch, yes I'm aware that restarting the DB drops the existing models so I'm running all the steps without restarts/stops

If it's helpful to understand the problem better here are all my steps (some are simplified here for convenience eg single projection instead of using separate subgraphs). I do have multiple DBs and am running from my jupyter notebook but based on my testing model candidates show up across DBs and anyway since I don't specify a DB I believe I'm using the default one here

graph projection

CALL gds.graph.project( 'gene_disease', { Label A: { label: 'label a', properties:['a','b','c','d'] }, Label B: { label: 'label b' }, Label C: { label: label c', properties: ['e'] } }, { relationship_a_b: { type: 'relationship a b', orientation: 'UNDIRECTED' }, relationship_b_c: { type: 'relationship b c', orientation: 'UNDIRECTED' }, relationship_c_a: { type: 'relationship c a', orientation: 'UNDIRECTED' } })

Training the graphsage model outside of the pipeline

CALL gds.beta.graphSage.train( 'gene_disease', { modelName: 'multiLabelModel', featureProperties: ['a','b','c','d'], embeddingDimension: 64, projectedFeatureDimension: 16 } ) YIELD modelInfo as info RETURN info.modelName as modelName, info.metrics.didConverge as didConverge, info.metrics.ranEpochs as ranEpochs, info.metrics.epochLosses as epochLosses

verify the model exists - shows loaded as true

CALL gds.beta.model.list() YIELD modelInfo, loaded, shared, stored RETURN modelInfo.modelName AS modelName, loaded, shared, stored

initiate the pipeline

CALL gds.beta.pipeline.linkPrediction.create('lp-pipeline2')

utilize the trained graphSage model in the pipeline

CALL gds.beta.pipeline.linkPrediction.addNodeProperty('lp-pipeline2', 'beta.graphSage', { modelName: 'multiLabelModel', mutateProperty: 'embedding' })

add a link property step

CALL gds.beta.pipeline.linkPrediction.addFeature('lp-pipeline2', 'cosine', { nodeProperties: ['embedding'] }) YIELD featureSteps;

configure the split

CALL gds.beta.pipeline.linkPrediction.configureSplit( 'lp-pipeline2', {
testFraction: 0.2, trainFraction: 0.2 }) YIELD splitConfig

add a model candidate

CALL gds.alpha.pipeline.linkPrediction.addRandomForest('lp-pipeline2', { numberOfDecisionTrees: {range: [10, 70]}, numberOfSamplesRatio: {range: [0.0, 1.0]} }) YIELD parameterSpace

configure autotune

CALL gds.alpha.pipeline.linkPrediction.configureAutoTuning('lp-pipeline2', { maxTrials: 10 }) YIELD autoTuningConfig

use pipeline to train the classification model - this is where the error happens

CALL gds.beta.pipeline.linkPrediction.train('gene_disease', {pipeline: 'lp-pipeline2', modelName: 'lp-model2', randomSeed: 42}) YIELD modelInfo RETURN modelInfo.bestParameters AS winningModel, modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore, modelInfo.metrics.AUCPR.test AS testGraphScore;

gurugecl commented 1 year ago

Ok thanks a lot for the explanation regarding the feature-input graph/set! Sorry, I missed that those were two separate things. So basically those various sets just consist of relationships and the feature input graph is used to compute the surrounding topological information (neighboring nodes etc) for a node not including the links we want to predict. These information is embedded as node properties to calculate the link features which are used to predict relationships. Feature-input set consists of the remaining positive relationships and are included in the embeddings while the train/test positive/negative relationships are what are predicted and therefore left out of the embeddings to avoid the data leakage. Hopefully I got it :)

gurugecl commented 1 year ago

Hi @adamnsch, just wanted to check in and see if anyone had any luck in determining what the issue maybe in regards to the the trained graphSage model not being found when trying to utilize it in the pipeline?

adamnsch commented 1 year ago

Hi @adamnsch, just wanted to check in and see if anyone had any luck in determining what the issue maybe in regards to the the trained graphSage model not being found when trying to utilize it in the pipeline?

Hi @gurugecl,

Sorry for the delay. Indeed there was a bug causing the issue you observed. It was fixed in our latest patch release, which is GDS version 2.1.9. Are you able to upgrade to this version?

Thanks for reporting the issue :) Adam

gurugecl commented 1 year ago

No problem and yes I was able to upgrade to that version and validate that it works. Thanks a lot for resolving it!