kosukeimai / fastLink

R package fastLink: Fast Probabilistic Record Linkage
253 stars 46 forks source link

New feature request: Matthews correlation coefficient #43

Open aalexandersson opened 4 years ago

aalexandersson commented 4 years ago

Please add Matthews correlation coefficient (MCC) as an additional statistic for the confusion table:

      TP * TN - FP * FN
MCC = -----------------------------------------------------
      [(TP + FP) * (FN + TN) * (FP + TN) * (TP + FN)]^(1/2)

The MCC is useful as an overall measure of the linkage quality. The MCC is better than Accuracy and the F1-score for imbalanced data because it adjusts for the balance ratios of the four confusion table categories (TP, TN, FP, and FN). In practice, I find that most linkage data are imbalanced by having mostly TN.

Wikipedia: https://en.wikipedia.org/wiki/Matthews_correlation_coefficient Matthew's article (1975): https://doi.org/10.1016/0005-2795(75)90109-9

Matthew, page 445:

"A correlation of:
   C =  1 indicates perfect agreement,
   C =  0 is expected for a prediction no better than random, and
   C = -1 indicates total disagreement between prediction and observation".

Mentioned in Tharwat's article (2018): https://doi.org/10.1016/j.aci.2018.08.003 Recommended by Luque et al (2019): https://doi.org/10.1016/j.patcog.2019.02.023

Anders

aalexandersson commented 2 years ago

Recommended by Canbek et al. (2021): https://rdcu.be/cvT7d

Conclusion:

In conclusion, this study proposes a new comprehensive benchmarking method to analyze the robustness of performance metrics and ranks 15 performance metrics in the literature. Researchers can use MCC as the most robust metric for general objective purposes to be on the safe side.

Full reference: Canbek, G., Taskaya Temizel, T. & Sagiroglu, S. BenchMetrics: a systematic benchmarking method for binary classification performance metrics. Neural Computing and Applications 33, 14623–14650 (2021). https://doi.org/10.1007/s00521-021-06103-6