Closed devineyfajr closed 1 year ago
Hello @devineyfajr, thank you for reporting the bug! Your reported AUC-PR values are indeed unexpected. Could you help me reproduce your result?
How does your pipeline look like? Ideally, can you provide a data set?
No, I can't provide the dataset that I used. Here are the relevant bits of code, though:
CALL gds.beta.pipeline.linkPrediction.create(
"LinkPipeline"
)
YIELD
name, nodePropertySteps, featureSteps, splitConfig, parameterSpace;
CALL gds.beta.pipeline.linkPrediction.addFeature(
'LinkPipeline',
'L2',
{
nodeProperties: ['embedding']
}
) YIELD featureSteps;
match ()-[r:IS_RELATED_TO_IDENTITY]->() where r.relationship = "DAUGHTER"
with count(r) as subN
match (i:Identity) with subN, count(i) as N
with (N-subN)/toFloat(subN) as nSR
CALL gds.beta.pipeline.linkPrediction.configureSplit('LinkPipeline', {
testFraction: 0.3,
trainFraction: 0.3,
negativeSamplingRatio: nSR,
validationFolds: 3
})
YIELD splitConfig
return splitConfig
;
CALL gds.beta.pipeline.linkPrediction.addLogisticRegression('LinkPipeline',
{penalty: 0.0, batchSize: 1000, minEpochs: 1, maxEpochs: 100, patience: 1, tolerance: 0.001 }) YIELD parameterSpace;
CALL gds.beta.pipeline.linkPrediction.addLogisticRegression('LinkPipeline',
{penalty: 0.0, batchSize: 1000, minEpochs: 1, maxEpochs: 100, patience: 3, tolerance: 0.001 }) YIELD parameterSpace;
CALL gds.alpha.pipeline.linkPrediction.addRandomForest('LinkPipeline',
{maxDepth: 10, minSplitSize: 5, numberOfDecisionTrees: 50, numberOfSamplesRatio: 0.1}) YIELD parameterSpace;
CALL gds.alpha.pipeline.linkPrediction.addRandomForest('LinkPipeline',
{maxDepth: 10, minSplitSize: 5, numberOfDecisionTrees: 100, numberOfSamplesRatio: 0.1}) YIELD parameterSpace;
call gds.graph.project.cypher(
"forLinkPredFRP",
"match (i:Identity) return id(i) as id, i.embedding as embedding",
"match (i1:Identity)-[r:IS_RELATED_TO_IDENTITY]->(i2:Identity)
where r.relationship = 'DAUGHTER'
return id(i1) as source, id(i2) as target, type(r) as type"
)
YIELD
graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipQuery, relationshipCount AS rels;
match ()-[r]->()
where type(r) = "IS_RELATED_TO_IDENTITY" and r.relationship = "DAUGHTER"
with count(r) as subN
match (i:Identity) with subN, count(i) as N
with (N-subN)/toFloat(subN) as nSR
CALL gds.beta.pipeline.linkPrediction.train('forLinkPredFRP', {
negativeClassWeight: 1.0 / nSR,
pipeline: 'LinkPipeline',
modelName: 'lp-pipeline-FRP'
}) YIELD modelInfo, configuration, modelSelectionStats
RETURN
configuration,
modelInfo.bestParameters AS winningModel,
round(modelInfo.metrics.AUCPR.outerTrain,4) AS trainGraphScore,
round(modelInfo.metrics.AUCPR.test,4) AS testGraphScore,
round(modelInfo.metrics.AUCPR.train.avg,4) AS trainAvg,
round(modelInfo.metrics.AUCPR.validation.avg,4) AS validationAvg,
modelSelectionStats
;
@devineyfajr Thanks a lot for this report. We believe we have found the issue and will deliver a patched version as soon as we can.
@devineyfajr we just released a patched version that should fix this bug: https://github.com/neo4j/graph-data-science/releases/tag/2.0.3
Let us know if we've fixed your problem!
that was fast! Sprint planning tomorrow. Will test it out early next week.
Closing since this has been fixed
Describe the bug
Link prediction pipeline outputs values <0 and >1 for RF AUC-PR
GDS version: 2.0.1 Neo4j version: 4.4.4 Operating system: linux
Expected behavior AUC-PR between 0 and 1