Closed gurugecl closed 1 year ago
Hi @gurugecl, thank you for raising this, you can run GraphSage, just need to specify beta.graphSage
instead of just graphSage
in the addNodeProperty
procedure name, also make sure to set the required configuration parameters.
Please give it a try and let us know.
Hi @vnickolov, thank you for quick response and great that did work as long as I only include the required properties like model name and mutate property.
[{name: 'gds.beta.graphSage.mutate', config: {modelName: 'gs_model', mutateProperty: 'embedding'}}]
However as soon as I include the other properties for graphSage that I need and described in the documentation like below
CALL gds.beta.pipeline.linkPrediction.addNodeProperty('lp-pipeline2', 'beta.graphSage', { modelName: 'gs_model', mutateProperty: 'embedding', embeddingDimension: 64, featureProperties: ['x','b','c'], projectedFeatureDimension: 16, activationFunction: 'relu', negativeSampleWeight: 20, searchDepth: 5, epochs: 1, learningRate: 0.1, batchSize: 100, aggregator: 'mean' })
I get the following error
ClientError: [Procedure.ProcedureCallFailed] Failed to invoke procedure gds.beta.pipeline.linkPrediction.addNodeProperty
: Caused by: java.lang.IllegalArgumentException: Unexpected configuration keys: activationFunction, aggregator, embeddingDimension, epochs, featureProperties, learningRate, negativeSampleWeight, projectedFeatureDimension, searchDepth
However the configuration mappings work fine when using fastRP and the ones associated with it
Thank you @gurugecl, we will have a look at this, it is definitely unexpected behaviour.
Hi @gurugecl,
Only the graphsage predict mutate mode is supported as a pipeline node property step. That is, you will first have have to train the graphsage model outside of the pipeline directly by a call to gds.beta.graphSage.train
, and then you can add graphsage predict (beta.graphSage
) as a node property steps to your pipeline providing the modelName
configuration key of the model you have just trained.
When you add the graphsage node property step you may only provide parameters relevant to the predict mutate mode. So parameters like epoch
and so on should be passed directly to gds.beta.graphSage.train
, not when adding the node property step.
Does this make sense? We probably have to document this better.
Oh I see. Yes overall that does make sense. But could you then clarify how applying the prediction in mutate would work on the various sets in terms of the relationship splits functionality
https://neo4j.com/docs/graph-data-science/current/machine-learning/linkprediction-pipelines/config/
Because based on my understanding it splits the data into 3 sets for testing, training, and feature-input. feature-input is where the embedding algorithm training happens right to avoid data leakage. So would I set the feature-input fraction to zero then since I will have to train on a separate sub graph set (to mirror the feature-input set) outside of the pipeline?
Also it sounds like the predictions in mutate mode happen on the test and train sets right in order to then use those embedding properties in the link feature steps where the (L2, Hadamard, cosine) feature vectors are created for downstream classification. "The node property steps that are added to the pipeline will be executed both when training a pipeline and when the trained model is applied for prediction". But won't the embedding then incorporate the link we are interested in predicting on into the embedding which wouldn't be available in new data we want to predict links for in the future?
But later on the documentation it does also state that train and test sets...."are not used by the algorithms run in the node property steps. The reason for this is that otherwise the model would use the prediction target (existence of a relationship) as a feature." So on the other hand, if that's the case, how are the link features calculated for train/test data without first running graphsage predictions on that data to provide the predicted embeddings as properties?
It does say "The feature-input graph is used, both in training and testing, for computing node properties and therefore also features which depend on node properties". But Im not understanding how those node properties in the feature-input set are utilized in the train/test data which would have different nodes. Can you please clarify. Thank you
Also @adamnsch, I just tried the steps you mentioned and I get the following error stating that the graphSage model doesn't exist
ClientError: [Procedure.ProcedureCallFailed] Failed to invoke procedure gds.beta.pipeline.linkPrediction.train
: Caused by: java.util.NoSuchElementException: Model with name multiLabelModel
does not exist.
However when I run
CALL gds.beta.model.list() YIELD modelInfo, loaded, shared, stored RETURN modelInfo.modelName AS modelName, loaded, shared, stored
It does show that such a model exists and is loaded
modelName | loaded | shared | stored multiLabelModel | true | false | false
This is my code for training the graphSage model separately
CALL gds.beta.graphSage.train( 'gene_disease_train', { modelName: 'multiLabelModel', featureProperties: ['a','b','c'], embeddingDimension: 64, projectedFeatureDimension: 4, activationFunction: 'relu', negativeSampleWeight: 20, searchDepth: 5, epochs: 1, learningRate: 0.1, batchSize: 100, aggregator: 'mean' } ) YIELD modelInfo as info RETURN info.modelName as modelName, info.metrics.didConverge as didConverge, info.metrics.ranEpochs as ranEpochs, info.metrics.epochLosses as epochLosses
and here is my code for utilizing the trained model in the pipeline
CALL gds.beta.pipeline.linkPrediction.addNodeProperty('lp-pipeline2', 'beta.graphSage', { modelName: 'multiLabelModel', mutateProperty: 'embedding' })
When I run the following link prediction train step is when it says the multiLabelModel doesn't exist however
CALL gds.beta.pipeline.linkPrediction.train('gene_disease', {pipeline: 'lp-pipeline2', modelName: 'lp-model2', randomSeed: 42}) YIELD modelInfo RETURN modelInfo.bestParameters AS winningModel, modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore, modelInfo.metrics.AUCPR.test AS testGraphScore;
Oh I see. Yes overall that does make sense. But could you then clarify how applying the prediction in mutate would work on the various sets in terms of the relationship splits functionality
https://neo4j.com/docs/graph-data-science/current/machine-learning/linkprediction-pipelines/config/
Because based on my understanding it splits the data into 3 sets for testing, training, and feature-input. feature-input is where the embedding algorithm training happens right to avoid data leakage. So would I set the feature-input fraction to zero then since I will have to train on a separate sub graph set (to mirror the feature-input set) outside of the pipeline?
You should not set the feature-input fraction to zero even if the GraphSAGE model is trained outside of the pipeline, since the trained GraphSAGE model uses relationships also when inferring node embeddings.
For training the model before adding it to the pipeline, it would be nice to not include what will later be the test and train set in the subsequent pipeline where the model will be applied. Unfortunately we don't yet support doing the dataset split outside of the pipeline and then providing it to the pipeline at training time. In the upcoming GDS 2.2 release, there will be support for something called contextRelationshipTypes
which will allow you to avoid data leakage with pretrained GraphSAGE models when combined with the Split relationships procedure.
Also it sounds like the predictions in mutate mode happen on the test and train sets right in order to then use those embedding properties in the link feature steps where the (L2, Hadamard, cosine) feature vectors are created for downstream classification. "The node property steps that are added to the pipeline will be executed both when training a pipeline and when the trained model is applied for prediction". But won't the embedding then incorporate the link we are interested in predicting on into the embedding which wouldn't be available in new data we want to predict links for in the future?
But later on the documentation it does also state that train and test sets...."are not used by the algorithms run in the node property steps. The reason for this is that otherwise the model would use the prediction target (existence of a relationship) as a feature." So on the other hand, if that's the case, how are the link features calculated for train/test data without first running graphsage predictions on that data to provide the predicted embeddings as properties?
It does say "The feature-input graph is used, both in training and testing, for computing node properties and therefore also features which depend on node properties". But Im not understanding how those node properties in the feature-input set are utilized in the train/test data which would have different nodes. Can you please clarify. Thank you
So I think you're confusing nodes and relationships here. It's true that relationships of the train and test sets are not included when running the node property steps to avoid data leakage, but the nodes that are adjacent via train or tests sets relationships are included. So that means that even if certain relationships are missing in the feature-input graph, all nodes of the input graph are included and will get embeddings (and whatever other node properties are computed by node property steps). For this reason it is also possible for the link feature steps to derive link features for every possible node pair in the input graph (though it doesn't actually look at every node pair in practice), and in particular those that are adjacent via train or tests sets relationships. I hope that makes sense :)
Also @adamnsch, I just tried the steps you mentioned and I get the following error stating that the graphSage model doesn't exist
ClientError: [Procedure.ProcedureCallFailed] Failed to invoke procedure
gds.beta.pipeline.linkPrediction.train
: Caused by: java.util.NoSuchElementException: Model with namemultiLabelModel
does not exist.However when I run
CALL gds.beta.model.list() YIELD modelInfo, loaded, shared, stored RETURN modelInfo.modelName AS modelName, loaded, shared, stored
It does show that such a model exists and is loaded
modelName | loaded | shared | stored multiLabelModel | true | false | false
This is my code for training the graphSage model separately
CALL gds.beta.graphSage.train( 'gene_disease_train', { modelName: 'multiLabelModel', featureProperties: ['a','b','c'], embeddingDimension: 64, projectedFeatureDimension: 4, activationFunction: 'relu', negativeSampleWeight: 20, searchDepth: 5, epochs: 1, learningRate: 0.1, batchSize: 100, aggregator: 'mean' } ) YIELD modelInfo as info RETURN info.modelName as modelName, info.metrics.didConverge as didConverge, info.metrics.ranEpochs as ranEpochs, info.metrics.epochLosses as epochLosses
and here is my code for utilizing the trained model in the pipeline
CALL gds.beta.pipeline.linkPrediction.addNodeProperty('lp-pipeline2', 'beta.graphSage', { modelName: 'multiLabelModel', mutateProperty: 'embedding' })
When I run the following link prediction train step is when it says the multiLabelModel doesn't exist however
CALL gds.beta.pipeline.linkPrediction.train('gene_disease', {pipeline: 'lp-pipeline2', modelName: 'lp-model2', randomSeed: 42}) YIELD modelInfo RETURN modelInfo.bestParameters AS winningModel, modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore, modelInfo.metrics.AUCPR.test AS testGraphScore;
It does look like you're doing the correct steps. That you get this error is very surprising to me, since we have tests that pass that essentially do exactly this. We will discuss internally what the issue could be.
Also, are you entirely sure that the GraphSAGE model is definitely available in the model catalog at the time when you train the pipeline? Models are dropped from the catalog automatically if you restart the database (the caveat being using GDS EE store/load functionality).
Hi @adamnsch, yes I'm aware that restarting the DB drops the existing models so I'm running all the steps without restarts/stops
If it's helpful to understand the problem better here are all my steps (some are simplified here for convenience eg single projection instead of using separate subgraphs). I do have multiple DBs and am running from my jupyter notebook but based on my testing model candidates show up across DBs and anyway since I don't specify a DB I believe I'm using the default one here
CALL gds.graph.project( 'gene_disease', { Label A: { label: 'label a', properties:['a','b','c','d'] }, Label B: { label: 'label b' }, Label C: { label: label c', properties: ['e'] } }, { relationship_a_b: { type: 'relationship a b', orientation: 'UNDIRECTED' }, relationship_b_c: { type: 'relationship b c', orientation: 'UNDIRECTED' }, relationship_c_a: { type: 'relationship c a', orientation: 'UNDIRECTED' } })
CALL gds.beta.graphSage.train( 'gene_disease', { modelName: 'multiLabelModel', featureProperties: ['a','b','c','d'], embeddingDimension: 64, projectedFeatureDimension: 16 } ) YIELD modelInfo as info RETURN info.modelName as modelName, info.metrics.didConverge as didConverge, info.metrics.ranEpochs as ranEpochs, info.metrics.epochLosses as epochLosses
CALL gds.beta.model.list() YIELD modelInfo, loaded, shared, stored RETURN modelInfo.modelName AS modelName, loaded, shared, stored
CALL gds.beta.pipeline.linkPrediction.create('lp-pipeline2')
CALL gds.beta.pipeline.linkPrediction.addNodeProperty('lp-pipeline2', 'beta.graphSage', { modelName: 'multiLabelModel', mutateProperty: 'embedding' })
CALL gds.beta.pipeline.linkPrediction.addFeature('lp-pipeline2', 'cosine', { nodeProperties: ['embedding'] }) YIELD featureSteps;
CALL gds.beta.pipeline.linkPrediction.configureSplit(
'lp-pipeline2', {
testFraction: 0.2,
trainFraction: 0.2
})
YIELD splitConfig
CALL gds.alpha.pipeline.linkPrediction.addRandomForest('lp-pipeline2', { numberOfDecisionTrees: {range: [10, 70]}, numberOfSamplesRatio: {range: [0.0, 1.0]} }) YIELD parameterSpace
CALL gds.alpha.pipeline.linkPrediction.configureAutoTuning('lp-pipeline2', { maxTrials: 10 }) YIELD autoTuningConfig
CALL gds.beta.pipeline.linkPrediction.train('gene_disease', {pipeline: 'lp-pipeline2', modelName: 'lp-model2', randomSeed: 42}) YIELD modelInfo RETURN modelInfo.bestParameters AS winningModel, modelInfo.metrics.AUCPR.outerTrain AS trainGraphScore, modelInfo.metrics.AUCPR.test AS testGraphScore;
Ok thanks a lot for the explanation regarding the feature-input graph/set! Sorry, I missed that those were two separate things. So basically those various sets just consist of relationships and the feature input graph is used to compute the surrounding topological information (neighboring nodes etc) for a node not including the links we want to predict. These information is embedded as node properties to calculate the link features which are used to predict relationships. Feature-input set consists of the remaining positive relationships and are included in the embeddings while the train/test positive/negative relationships are what are predicted and therefore left out of the embeddings to avoid the data leakage. Hopefully I got it :)
Hi @adamnsch, just wanted to check in and see if anyone had any luck in determining what the issue maybe in regards to the the trained graphSage model not being found when trying to utilize it in the pipeline?
Hi @adamnsch, just wanted to check in and see if anyone had any luck in determining what the issue maybe in regards to the the trained graphSage model not being found when trying to utilize it in the pipeline?
Hi @gurugecl,
Sorry for the delay. Indeed there was a bug causing the issue you observed. It was fixed in our latest patch release, which is GDS version 2.1.9. Are you able to upgrade to this version?
Thanks for reporting the issue :) Adam
No problem and yes I was able to upgrade to that version and validate that it works. Thanks a lot for resolving it!
Hi,
I'm trying to utilize the GDS link prediction pipeline with the GraphSage embedding algorithm
https://neo4j.com/docs/graph-data-science/current/machine-learning/linkprediction-pipelines/config/
In the Adding Node Properties step it says that the procedureName parameter takes the name of the procedure but doesn't specify which procedures can be used. I can get it to work with 'fastRP' but when I try 'graphSage' or any version of the word graphsage I get the error
ClientError: [Procedure.ProcedureCallFailed] Failed to invoke procedure
gds.beta.pipeline.linkPrediction.addNodeProperty
: Caused by: java.lang.IllegalArgumentException: Could not find a procedure called gds.graphsage.mutateThis is how I am making that call
CALL gds.beta.pipeline.linkPrediction.addNodeProperty('lp-pipeline2', 'graphSage', { mutateProperty: 'embedding', randomSeed: 42 })
I see in the graphSage documentation that there is a procedure called gds.beta.graphSage.mutate. So is it not possible to use graphSage within the link prediction pipeline?
I know you can utilize graphSage separately without the pipeline but I would really like to utilize it within the pipeline because it makes it easier to split the dataset to train the embedding algorithm on separate data using the feature-input set
Thank you
GDS version: 2.1.6 Neo4j version: 4.4.8 Operating system: Mac