scikit-learn-contrib / scikit-learn-extra

scikit-learn contrib estimators
https://scikit-learn-extra.readthedocs.io
BSD 3-Clause "New" or "Revised" License
187 stars 42 forks source link

Add robust metric #122

Open TimotheeMathieu opened 3 years ago

TimotheeMathieu commented 3 years ago

This PR use Huber robust mean estimator to make a robust metric.

Description: one of the big challenge of robust machine learning is that the usual scoring scheme (cross_validation with MSE for instance) is not robust. Indeed, if the dataset has some outliers, then the test sets in cross_validation may have outliers and then the cross_validation MSE would give us a huge error for our robust algorithm on any corrupted data. This is why for example robust methods cannot be efficient for regression challenges in kaggle, because the error computation is not robust. This PR propose a robust metric that would allow us to compute a robust cross-validation MSE for instance.

Example :

import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn_extra.robust import make_huber_metric

robust_mse = make_huber_metric(mean_squared_error, c=9) 
# c = 9 -> more than 99% of a normal is within [-3, 3]. Hence more that 99% of a normal squared is within [0,9].

y_true = np.random.normal(size=100)
y_true_cor = y_true.copy()
y_true_cor[42] = 20 # this is an outlier in the test set
y_pred = np.random.normal(size=100)

print('MSE on uncorrupted : %.3F ' %(mean_squared_error(y_true, y_pred)))
print('Robust MSE on uncorrupted : %.3F ' %(robust_mse(y_true, y_pred)))
print('MSE on corrupted : %.3F ' %(mean_squared_error(y_true_cor, y_pred)))
print('Robust MSE on corrupted : %.3F ' %(robust_mse(y_true_cor, y_pred)))

This returns

MSE on uncorrupted : 2.152 
Robust MSE on uncorrupted : 2.072 
MSE on corrupted : 7.202 
Robust MSE on corrupted : 2.072 
lorentzenchr commented 3 years ago

Having a huber loss available as metric makes sense for models fitted with huber loss.

Be aware that the huber loss elicits something in between the median and the expectation, so it is not really clear what you get/estimate. The omnipresent point about MSE not being robust has at least 2 important points:

Last but not least, my all time favorite reference: https://arxiv.org/abs/0912.0902

TimotheeMathieu commented 3 years ago

Thanks for the comments.

@lorentzenchr what I did is not the Huber loss. It is a robust estimator of the mean applied to the squared errors. I used the MSE only as an example, I can also do a robust version of mean absolute error if I use make_huber_metric(mean_absolute_error, c=9), this is very different because our aim is always to estimate the MSE or mean absolute error but while ignoring the outliers. I don't use a different loss function, I use a different way to estimate the mean in MEAN squared error and MEAN absolute error. because the empirical mean is not robust while Huber estimator is robust. This can be a problem for people used to Huber loss but in fact this is very different and it is also from Huber so I can't really change the name.

If you want to see references, for instance there is Robust location estimator by Huber or more recently Challenging the empirical mean and empirical variance: a deviation study by Catoni.

EDIT : I added an explanation in the user guide that gives some equations to explain this.

lorentzenchr commented 3 years ago

@TimotheeMathieu Thanks for the explanation. Now I get it. Something that could be mentioned in the example is the trimmed mean as a simpler entry point to robust estimation.