BayesianRidge fails when input and output data are of very different sizes

SB6988 commented 4 years ago

Example 1 (working):

from sklearn.linear_model import LinearRegression, BayesianRidge
import numpy as np

ols = LinearRegression()
ols.fit(np.reshape([1,2],(-1,1)), np.array([2,3]).ravel())
clf = BayesianRidge(compute_score=True, fit_intercept=False)
clf.fit(np.array([[1,1],[1,2]]), np.array([2,3]).ravel())

print(str(ols.intercept_) + " " + str(ols.coef_[0]))
print(str(clf.coef_[0]) + " " + str(clf.coef_[1]))

Expected Results

1,1

Results for OLS and BayesianRidge:

1.0000000000000004 0.9999999999999998 0.9988917252390923 1.0005536752418909

Example 2 (not working):

from sklearn.linear_model import LinearRegression, BayesianRidge
import numpy as np

ols = LinearRegression()
ols.fit(np.reshape([1,2],(-1,1)), np.array([2000000,3000000]).ravel())
clf = BayesianRidge(compute_score=True, fit_intercept=False)
clf.fit(np.array([[1,1],[1,2]]), np.array([2000000,3000000]).ravel())

print(str(ols.intercept_) + " " + str(ols.coef_[0]))
print(str(clf.coef_[0]) + " " + str(clf.coef_[1]))

Expected Results

1000000, 1000000

Results for OLS and BayesianRidge:

1000000.0000000005 999999.9999999997 7.692319353001738e-07 1.2307710964802638e-06

Please notice that the only difference betweenn the two examples are the order of magnitude of the endogenous variable! However, although OLS works pretty well, in the second case we get 0,0 as the coefficients from the Bayes regression"

Kiona0405 commented 4 years ago

It is not a bug. It is caused by a bad assumption of prior distribution.

BayesianRidge make two assumption.

lambda(weights factor) is come from gamma distribution <- it is key for you
alpha(noise factor) is come from gamma distribtuion

Both distribution have two parameters(alpha_1, alpha_2, lambda_1, lambda_2). These parameters is set before training. And default value is 1e-6.

This default means that weights and noise likely have small value. In other words, before training, probability for small weight is high, but probability for large weight is low. But your data requires large weights.

Of course, model can learn from data, and fit posterior distribution. But posterior distribution is still affected by prior distribution(which prefer low weights). So model can’t estimate good weight.

To fix it, you can do below.

set lambda_2 high(ex, 10000)

Altenative way is use fit_intercept=True

from sklearn.linear_model import LinearRegression, BayesianRidge
import numpy as np

ols = LinearRegression()
ols.fit(np.reshape([1,2],(-1,1)), np.array([2000000,3000000]).ravel())
clf = BayesianRidge(compute_score=True, fit_intercept=False,
    lambda_2=10000)
clf.fit(np.array([[1,1],[1,2]]), np.array([2000000,3000000]).ravel())

print(str(ols.intercept_) + " " + str(ols.coef_[0]))
print(str(clf.coef_[0]) + " " + str(clf.coef_[1]))

out put is below.

scikit-learn ➤ python -u "/Users/Naoki/Projects/python/scikit-learn/17624-2.py"
1000000.0000000005 999999.9999999997
999999.9999979986 1000000.0000009974

albertcthomas commented 4 years ago

Thanks for the explanation @Kiona0405. @SB6988 let us know if this solves your issue.

SB6988 commented 4 years ago

Thanks to @Kiona0405 for the answer.

The explanation sounds correct, however I don't think it's a good choice to have this silent behaviour with this algorithm. Basically, what we are saying here is, the model should learn the parameters on its own, however if the prior is too off it might not converge. This is ok, but then why does the score plot show convergence after a few iterations while the parameters are clearly wrong? To me, it doesn't seem a reliable behaviour. I am somewhat confused on why it fails, given there is no max_iteration here that has been reached. Some possibility to increase transparency could be:

1) issue a warning whenever there is some combination of input or output whose magnitude is comparable to the inverse of the several precisions... 2) check why the iteration shows convergence of the score where clearly should not 3) remove the implicit assumption, so that the end user has to be totally aware of what's going on under the hood by setting reasonable parameters from the very beginning; this way one would not rely on the convergence algorithm

Kiona0405 commented 4 years ago

I understand your feelings. But technically and theoritically, your suggestion is difficult.

issue a warning whenever there is some combination of input or output whose magnitude is comparable to the inverse of the several precisions...

Real dataset contains outliers. In this case, warning is not necessary, I think.

check why the iteration shows convergence of the score where clearly should not

The convergence just means loss function is saturated. It doesn't say any more than that. So we have to check model by ~~relative~~ relavant metrics.

remove the implicit assumption, so that the end user has to be totally aware of what's going on under the hood by setting reasonable parameters from the very beginning; this way one would not rely on the convergence algorithm

Prior assumption is basic concept for Bayesian models. And every machine learning model has some kind of assumption.

But we can do hyper-parameter tuning. If you don’t want to set hyper-parameter of prior manually, it may help you.

SB6988 commented 4 years ago

@Kiona0405 thank you for your answer. Yes, of course I know that prior assumption is the very basic idea behind Bayesian frameworks. What I meant is the implicit assumption on the hyperparameters. I appreciate that you want to make the Gammas look "almost" like improper uninformative priors by setting the hyperparameters "almost" equal to 0, however almost in numerics can be dangerous, as this case showed, because the prior is not really a theoretical, uniformative one. Thus my suggestion was just to remove the default 10e-6, so that the user has to choose appropriate hyperaparemters.

What would your hyper-parameter tuning idea consist of more specifically? That could also be a choice. I definitely think that the current situation is just dangerous, so any improvement would help a lot.

Kiona0405 commented 4 years ago

I understand implicit default value is not user-friendly. But it could be personal preference. So I can't determine which one is better. If you want to get rid of it, please make pull request.

My idea is just use parameter-tuning(ex, random search.) outside of this function.

agramfort commented 4 years ago

@SB6988 I think the best we can offer is to improve the docstring of the class. Feel free to open a PR so a next person is less likely to hit the same difficulty as you.

scikit-learn / scikit-learn