neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
597 stars 157 forks source link

AUC-PR reported for Random Forest seems wrong #187

Closed devineyfajr closed 1 year ago

devineyfajr commented 2 years ago

Describe the bug

Link prediction pipeline outputs values <0 and >1 for RF AUC-PR

{validationStats:
    {AUCPR:[
        {min: 0.9217002575722579, avg: 0.9332871778913939, params: {maxEpochs: 100, minEpochs: 20, penalty: 0.0, patience: 1, methodName: "LogisticRegression", batchSize: 1000, tolerance: 0.001}, max: 0.9442416534407375},
        {min: 0.9217002575722579, avg: 0.9332871778913939, params: {maxEpochs: 100, minEpochs: 20, penalty: 0.0, patience: 3, methodName: "LogisticRegression", batchSize: 1000, tolerance: 0.001}, max: 0.9442416534407375},
        {min: -10.79811743206902, avg: -8.580734678671034, params: {maxDepth: 10, methodName: "RandomForest", minSplitSize: 5, numberOfDecisionTrees: 50, numberOfSamplesRatio: 0.1}, max: -5.704912187554019},
        {min: 3.3837056148636755, avg: 4.467121895656157, params: {maxDepth: 10, methodName: "RandomForest", minSplitSize: 5, numberOfDecisionTrees: 100, numberOfSamplesRatio: 0.1}, max: 6.521771270895979}
    ]},
    trainStats: {AUCPR: [
            {min: 0.9261038073636204, avg: 0.9333034811150953, params: {maxEpochs: 100, minEpochs: 20, penalty: 0.0, patience: 1, methodName: "LogisticRegression", batchSize: 1000, tolerance: 0.001}, max: 0.9411860117389778},
            {min: 0.9261038073636204, avg: 0.9333034811150953, params: {maxEpochs: 100, minEpochs: 20, penalty: 0.0, patience: 3, methodName: "LogisticRegression", batchSize: 1000, tolerance: 0.001}, max: 0.9411860117389778},
            {min: -8.712799907905683, avg: -6.789797243883313, params: {maxDepth: 10, methodName: "RandomForest", minSplitSize: 5, numberOfDecisionTrees: 50, numberOfSamplesRatio: 0.1}, max: -5.525856987374333},
            {min: 2.107312956024476, avg: 2.4308584048446877, params: {maxDepth: 10, methodName: "RandomForest", minSplitSize: 5, numberOfDecisionTrees: 100, numberOfSamplesRatio: 0.1}, max: 2.804754216901845}
    ]},
    bestParameters: {maxDepth: 10, methodName: "RandomForest", minSplitSize: 5, numberOfDecisionTrees: 100, numberOfSamplesRatio: 0.1}
}

GDS version: 2.0.1 Neo4j version: 4.4.4 Operating system: linux

Expected behavior AUC-PR between 0 and 1

FlorentinD commented 2 years ago

Hello @devineyfajr, thank you for reporting the bug! Your reported AUC-PR values are indeed unexpected. Could you help me reproduce your result?

How does your pipeline look like? Ideally, can you provide a data set?

devineyfajr commented 2 years ago

No, I can't provide the dataset that I used. Here are the relevant bits of code, though:

CALL gds.beta.pipeline.linkPrediction.create(
  "LinkPipeline"
)
YIELD
  name, nodePropertySteps, featureSteps, splitConfig, parameterSpace;
CALL gds.beta.pipeline.linkPrediction.addFeature(
     'LinkPipeline',
     'L2',
     {
    nodeProperties: ['embedding']
     }
) YIELD featureSteps;
match ()-[r:IS_RELATED_TO_IDENTITY]->() where r.relationship = "DAUGHTER"
with count(r) as subN
match (i:Identity) with subN, count(i) as N
with (N-subN)/toFloat(subN) as nSR
CALL gds.beta.pipeline.linkPrediction.configureSplit('LinkPipeline', {
  testFraction: 0.3,
  trainFraction: 0.3,
  negativeSamplingRatio: nSR,
  validationFolds: 3
})
YIELD splitConfig
return splitConfig
;
CALL gds.beta.pipeline.linkPrediction.addLogisticRegression('LinkPipeline',
     {penalty: 0.0, batchSize: 1000, minEpochs: 1, maxEpochs: 100, patience: 1, tolerance: 0.001 }) YIELD parameterSpace;
CALL gds.beta.pipeline.linkPrediction.addLogisticRegression('LinkPipeline',
     {penalty: 0.0, batchSize: 1000, minEpochs: 1, maxEpochs: 100, patience: 3, tolerance: 0.001 }) YIELD parameterSpace;
CALL gds.alpha.pipeline.linkPrediction.addRandomForest('LinkPipeline',
     {maxDepth: 10, minSplitSize: 5, numberOfDecisionTrees: 50, numberOfSamplesRatio: 0.1}) YIELD parameterSpace;
CALL gds.alpha.pipeline.linkPrediction.addRandomForest('LinkPipeline',
     {maxDepth: 10, minSplitSize: 5, numberOfDecisionTrees: 100, numberOfSamplesRatio: 0.1}) YIELD parameterSpace;
call gds.graph.project.cypher(
     "forLinkPredFRP",
     "match (i:Identity) return id(i) as id, i.embedding as embedding",
     "match (i1:Identity)-[r:IS_RELATED_TO_IDENTITY]->(i2:Identity) 
            where r.relationship = 'DAUGHTER'
            return id(i1) as source, id(i2) as target, type(r) as type"
     )
YIELD
  graphName AS graph, nodeQuery, nodeCount AS nodes, relationshipQuery, relationshipCount AS rels;
match ()-[r]->()
where type(r) = "IS_RELATED_TO_IDENTITY" and r.relationship = "DAUGHTER"
with count(r) as subN
match (i:Identity) with subN, count(i) as N
with (N-subN)/toFloat(subN) as nSR
CALL gds.beta.pipeline.linkPrediction.train('forLinkPredFRP', {
  negativeClassWeight: 1.0 / nSR,
  pipeline: 'LinkPipeline',
  modelName: 'lp-pipeline-FRP'
}) YIELD modelInfo, configuration, modelSelectionStats
RETURN
  configuration,
  modelInfo.bestParameters AS winningModel,
  round(modelInfo.metrics.AUCPR.outerTrain,4) AS trainGraphScore,
  round(modelInfo.metrics.AUCPR.test,4) AS testGraphScore,
  round(modelInfo.metrics.AUCPR.train.avg,4) AS trainAvg,
  round(modelInfo.metrics.AUCPR.validation.avg,4) AS validationAvg,
  modelSelectionStats
;
Mats-SX commented 2 years ago

@devineyfajr Thanks a lot for this report. We believe we have found the issue and will deliver a patched version as soon as we can.

AliciaFrame commented 2 years ago

@devineyfajr we just released a patched version that should fix this bug: https://github.com/neo4j/graph-data-science/releases/tag/2.0.3

Let us know if we've fixed your problem!

devineyfajr commented 2 years ago

that was fast! Sprint planning tomorrow. Will test it out early next week.

adamnsch commented 1 year ago

Closing since this has been fixed