neo4j / graph-data-science

Source code for the Neo4j Graph Data Science library of graph algorithms.
https://neo4j.com/docs/graph-data-science/current/
Other
621 stars 160 forks source link

Report "test" and "outerTrain" metrics for all modelCandidates in pipeline #258

Closed devineyfajr closed 1 year ago

devineyfajr commented 1 year ago

"test" and "outerTrain" metrics are only reported for the winning model. The "validation" and "train" components reported for modelCandidates don't shed much light on relative performance, as shown below. And don't make much sense in this example.

| {pipeline: {featureProperties: [{feature: "degreeCentrality"}, {feature: "embedding"}], nodePropertySteps: []}, modelName: "nc-LR-model", nodePropertySteps: [], classes: [11870232, 11870233, 11870234, 11870236, 11870237, 11870238, 11870239, 11870240, 11870241, 11870242, 11870244, 11870246, 11870247], modelType: "NodeClassification", metrics: {ACCURACY: {test: 0.36777584, outerTrain: 0.37088389, validation: {avg: 1.0, min: 1.0, max: 1.0}, train: {avg: 1.0, min: 1.0, max: 1.0}}}, featureProperties: ["degreeCentrality", "embedding"], bestParameters: {maxEpochs: 100, minEpochs: 1, classWeights: [], penalty: 0.0, patience: 1, methodName: "LogisticRegression", focusWeight: 0.5172789803526464, batchSize: 100, tolerance: 0.001, learningRate: 0.001}} |
{modelCandidates: [
{parameters: {maxEpochs: 100, minEpochs: 1, classWeights: [], penalty: 0.0, patience: 1, methodName: "LogisticRegression", focusWeight: 0.5172789803526464, batchSize: 100, tolerance: 0.001, learningRate: 0.001}, metrics: {ACCURACY: {validation: {avg: 1.0, min: 1.0, max: 1.0}, train: {avg: 1.0, min: 1.0, max: 1.0}}}},
{parameters: {maxEpochs: 100, minEpochs: 1, classWeights: [], penalty: 0.0, patience: 1, methodName: "LogisticRegression", focusWeight: 0.9741508763694268, batchSize: 100, tolerance: 0.001, learningRate: 0.001}, metrics: {ACCURACY: {validation: {avg: 1.0, min: 1.0, max: 1.0}, train: {avg: 1.0, min: 1.0, max: 1.0}}}},
{parameters: {maxEpochs: 100, minEpochs: 1, classWeights: [], penalty: 0.0, patience: 1, methodName: "LogisticRegression", focusWeight: 0.5220077554378219, batchSize: 100, tolerance: 0.001, learningRate: 0.001}, metrics: {ACCURACY: {validation: {avg: 1.0, min: 1.0, max: 1.0}, train: {avg: 1.0, min: 1.0, max: 1.0}}}},
{parameters: {maxEpochs: 100, minEpochs: 1, classWeights: [], penalty: 0.0, patience: 1, methodName: "LogisticRegression", focusWeight: 0.49693986601495244, batchSize: 100, tolerance: 0.001, learningRate: 0.001}, metrics: {ACCURACY: {validation: {avg: 1.0, min: 1.0, max: 1.0}, train: {avg: 1.0, min: 1.0, max: 1.0}}}},
{parameters: {maxEpochs: 100, minEpochs: 1, classWeights: [], penalty: 0.0, patience: 1, methodName: "LogisticRegression", focusWeight: 0.21773699553667014, batchSize: 100, tolerance: 0.001, learningRate: 0.001}, metrics: {ACCURACY: {validation: {avg: 1.0, min: 1.0, max: 1.0}, train: {avg: 1.0, min: 1.0, max: 1.0}}}},
{parameters: {maxEpochs: 100, minEpochs: 1, classWeights: [], penalty: 0.0, patience: 1, methodName: "LogisticRegression", focusWeight: 0.7247162317145882, batchSize: 100, tolerance: 0.001, learningRate: 0.001}, metrics: {ACCURACY: {validation: {avg: 1.0, min: 1.0, max: 1.0}, train: {avg: 1.0, min: 1.0, max: 1.0}}}},
{parameters: {maxEpochs: 100, minEpochs: 1, classWeights: [], penalty: 0.0, patience: 1, methodName: "LogisticRegression", focusWeight: 0.02909844579931331, batchSize: 100, tolerance: 0.001, learningRate: 0.001}, metrics: {ACCURACY: {validation: {avg: 1.0, min: 1.0, max: 1.0}, train: {avg: 1.0, min: 1.0, max: 1.0}}}},
{parameters: {maxEpochs: 100, minEpochs: 1, classWeights: [], penalty: 0.0, patience: 1, methodName: "LogisticRegression", focusWeight: 0.9284448402106017, batchSize: 100, tolerance: 0.001, learningRate: 0.001}, metrics: {ACCURACY: {validation: {avg: 1.0, min: 1.0, max: 1.0}, train: {avg: 1.0, min: 1.0, max: 1.0}}}}],
bestParameters: {maxEpochs: 100, minEpochs: 1, classWeights: [], penalty: 0.0, patience: 1, methodName: "LogisticRegression", focusWeight: 0.5172789803526464, batchSize: 100, tolerance: 0.001, learningRate: 0.001}, bestTrial: 1} |
Mats-SX commented 1 year ago

Hello @devineyfajr and thanks for contacting us.

The primary reason we don't report the outerTrain and test metrics for the losing candidates is time spent. We let all candidates learn from the train set and evaluate them on validation sets. But once we've established which candidate is the winner, we discard the others and only retain the winning candidate. Then we retrain and evaluate it on the full train set, including learning from all validation folds, and finally we evaluate it on the test set.

If we would evaluate also the losing candidates on the outerTrain and test sets, we are gaining very little from the validation fold phase. The whole point of that phase is to make candidates compete, and the assumption is that the candidate that performs the best on the validation folds will also perform best on the test set, and there's little value in spending time and resources on the other candidates after having determined that they are not the best.

Another way to do this could be to retain all candidates after training, retrain them on the full train set including validation folds, and let their eventual test score determine the winner. That would be more resource-consuming and in the general case, when the above assumption holds, it would be wasteful. Of course, the assumption may not be true in all cases, and indeed it's possible that a candidate that lost during the candidate selection phase (perhaps by a small margin) will perform better on the test set. However, we believe that this a rare enough case to not warrant optimising for it.

It is possible to achieve what you're after with some additional work. If you think one of your losing candidates is the one that would actually perform best on the test set, you can simply rerun your pipeline with specifically only that candidate as the training method. If you set the randomSeed parameter to the same value you should get the same train/validation/test splits for the same train graph, and the eventual test score will be evaluated on your single candidate and comparable to your previous run.

In your example above, it seems that all candidates scored equally well on the validation sets, so the choice of winner is effectively arbitrary. That does not seem like a healthy configuration for the training pipeline; either there is not enough data, the features are not differentiable enough, the model candidates are too similar, or something else is causing all model candidates to make exactly the same choices when learning. From what I can read from your output, only the focusWeight parameter is different. Perhaps your dataset is not imbalanced? This parameter will only be effective for imbalanced datasets (see here for a discussion of this parameter).

I hope this makes sense! All the best Mats

devineyfajr commented 1 year ago

Thanks Mats. I sorta figured that out afterwards. It's a large, unbalanced dataset, so I'm confused why all the models are getting such good validation scores. More research is needed.

devineyfajr commented 1 year ago

Thanks Mats. I sorta figured that out afterwards. It's a large, unbalanced dataset, so I'm confused why all the models are getting such good validation scores. More research is needed.