materialsproject / matbench

Matbench: Benchmarks for materials science property prediction
https://matbench.materialsproject.org
MIT License
125 stars 46 forks source link

score_array computes roc-auc values on discretized predictions #181

Open pbenner opened 2 years ago

pbenner commented 2 years ago

I am not sure if the following behavior is intended:

> cat matbench-scores-bug.py

from matbench.data_ops import score_array, CLF_KEY
from sklearn.metrics import roc_auc_score

true_array = 8*[True]+2*[False]
pred_array = 8*[ 0.4]+2*[0.2  ]

scores = score_array(true_array, pred_array, CLF_KEY)

print('matbench roc-auc:', scores['rocauc'])
print('    true roc-auc:', roc_auc_score(true_array, pred_array))

> python matbench-scores-bug.py 
matbench roc-auc: 0.5
    true roc-auc: 1.0

The mismatch is caused by a discretization of the values in pred_array prior to calling roc_auc_score. ROC-AUC values are typically evaluated on class probabilities. The following patch fixes the problem:

> cat matbench-scores-bug.patch 
--- data_ops.py 2022-08-29 09:51:34.565746826 +0200
+++ data_ops.py.new     2022-09-08 08:52:52.994181877 +0200
@@ -108,18 +108,22 @@
     for metric in metrics:
         mfunc = METRIC_MAP[metric]

+        true_array_ = true_array
+        pred_array_ = pred_array
+
         if metric == "rocauc":
             # Both arrays must be in probability form
             # if pred. array is given in probabilities
             if isinstance(pred_array[0], float):
-                true_array = homogenize_clf_array(true_array, to_probs=True)
+                true_array_ = homogenize_clf_array(true_array, to_probs=True)

         # Other clf metrics always be converted to labels
         elif metric in CLF_METRICS:
             if isinstance(pred_array[0], float):
-                pred_array = homogenize_clf_array(pred_array, to_labels=True)
+                pred_array_ = homogenize_clf_array(pred_array, to_labels=True)
+
+        computed[metric] = mfunc(true_array_, pred_array_)

-        computed[metric] = mfunc(true_array, pred_array)
     return computed

Matbench version: 70c79fbbd12130c4166e8e0bd3a6c18822e40398

ml-evs commented 2 years ago

Related to https://github.com/materialsproject/matbench/issues/40 (where probabilities were introduced) and https://github.com/materialsproject/matbench/issues/137 (which I think is reporting the same underlying issue as this)