statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
9.95k stars 2.87k forks source link

OLS fails with large values #9258

Open Gabriel-Kissin opened 3 months ago

Gabriel-Kissin commented 3 months ago

Describe the bug

When using large values for linear regression, statsmodels fails to correctly model the data. It seems to want the line to pass through the intercept when it shouldn't (despite add_constant being used).

Here is the plot produced by the code below: image

For the first three lines, statsmodels works as expected. For larger values, the lines aren't passing through the corresponding points, instead wanting to go close to the intercept. This is because the coefficient being learned for const is always very small.

Note that the black dotted lines produced by sklearn always fit the points as expected. The intercept and coefficient sklearn produces are the same as the closed-form solution, which statsmodels should be capable of reproducing. These numbers aren't large enough to cause an overflow on 32-bit precision.

Note: A simple workaround is to scale the data before fitting. Linear regression generally doesn't use GD, so scaling shouldn't be necessary though.

Code Sample

import numpy as np
import statsmodels.api
import sklearn.linear_model
import matplotlib.pyplot as plt

x_base = np.linspace(4e13, 10e13, 10)
y = np.linspace(1, 0, 10)

for i in range(6):
    x = x_base + (i*3e13)

    # find closed form solution using inv and pinv
    xTx_inv = np.linalg.inv(statsmodels.api.add_constant(x).T @ statsmodels.api.add_constant(x)) 
    xTy = statsmodels.api.add_constant(x).T @ y
    closed_form_ols = xTx_inv @ xTy

    xTx_pinv = np.linalg.pinv(statsmodels.api.add_constant(x).T @ statsmodels.api.add_constant(x)) 
    closed_form_ols_pinv = xTx_pinv @ xTy

    # solve using statsmodels
    stats_ols = statsmodels.regression.linear_model.OLS(
        endog=y, exog=statsmodels.api.add_constant(x))
    stats_ols_fitted = stats_ols.fit()                      # uses method = "pinv" by default
    # stats_ols_fitted = stats_ols.fit(method = "qr")         # fits correctly

    # when model doesn't work well, extend x to below 0 to show that the lines go through the intercept
    r2 = stats_ols_fitted._results.rsquared
    if r2<1:
        x_ = np.linspace(-.1*x.max(), x.max())
    else:        
        x_ = np.linspace(x.min(), x.max())

    # predict using statsmodels
    y_ = stats_ols_fitted.predict(statsmodels.api.add_constant(x_))

    # solve & predict using sklearn
    x_sklearn = np.linspace(x.min(), x.max())
    sklearn_ols = sklearn.linear_model.LinearRegression()
    sklearn_ols.fit(x.reshape((-1,1)), y)
    y_sklearn = sklearn_ols.predict(x_sklearn.reshape((-1,1)))

    # plot the data 
    plt.scatter(x, y)

    # compose informative legend label for each set of data/LR model
    label ='statsmodels OLS: $r^2=' + str(np.round(r2, 3)) + '$' 
    label += '\nStatsmodels params: ' + ', '.join(['{:0.3}'.format(param) for param in stats_ols_fitted._results.params]) 
    label += '\nSklearn params: ' + ', '.join(['{:0.3}'.format(param) for param in [sklearn_ols.intercept_] + list(sklearn_ols.coef_)]) 
    label += '\nManually calculated solution: ' + ', '.join(['{:0.3}'.format(param) for param in closed_form_ols])
    label += '\nUsing pseudo-inverse: ' + ', '.join(['{:0.3}'.format(param) for param in closed_form_ols_pinv])

    # plot the LR fits (statsmodels)
    plt.plot(x_, y_, label=label, lw=3)

    # plot the LR fits (sklearn)
    plt.plot(x_sklearn, y_sklearn,
             label=('sklearn LinearRegression' if i in [2,5] else ''),
             ls=':', lw=1.5, c='k')

plt.legend(fontsize='small', loc='center left', bbox_to_anchor=(1, 0.5), ncols=2)
plt.show()

If the issue has not been resolved, please file it in the issue tracker.

Expected Output

Correctly fit model!

Output of import statsmodels.api as sm; sm.show_versions()

[paste the output of ``import statsmodels.api as sm; sm.show_versions()`` here below this line] ``` INSTALLED VERSIONS ------------------ Python: 3.12.3.final.0 OS: Darwin 23.4.0 Darwin Kernel Version 23.4.0: Fri Mar 15 00:12:25 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6030 arm64 byteorder: little LC_ALL: None LANG: None statsmodels =========== Installed: 0.14.2 (/Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/statsmodels) Required Dependencies ===================== cython: Not installed numpy: 1.26.4 (/Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/numpy) scipy: 1.13.0 (/Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/scipy) pandas: 2.2.2 (/Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/pandas) dateutil: 2.9.0.post0 (/Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/dateutil) patsy: 0.5.6 (/Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/patsy) Optional Dependencies ===================== matplotlib: 3.8.4 (/Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/matplotlib) backend: module://matplotlib_inline.backend_inline cvxopt: Not installed joblib: 1.4.2 (/Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/joblib) Developer Tools ================ IPython: 8.24.0 (/Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/IPython) jinja2: 3.1.4 (/Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/jinja2) sphinx: Not installed pygments: 2.18.0 (/Users/gabriel.kissin/Library/Python/3.12/lib/python/site-packages/pygments) pytest: Not installed virtualenv: Not installed ```
Gabriel-Kissin commented 3 months ago

The core of the issue appears to be related to the algorithm used to find the inverse matrix. statsmodels uses by default the Moore-Penrose pseudoinverse to solve the least squares problem (link). When telling it to use the QR algorithm (method = "qr" instead of method = "pinv") it fits correctly. See the legend in the plot above for comparison of the betas learnt by the different methods.

My suggestion then would be for default behaviour to intelligently select an algorithm based on aspects of the data which make one more suited than the other, so that it can be used out of the box like with the sklearn one.

josef-pkt commented 3 months ago

It's very unlikely that we ever change this. QR does not have regularization in the case of numerically singular exog. (What rcond pinv threshold would be better is still open. there are problems if the condition number is around 1e15 (or 1e-15 if inverse is used), but it can go either way, regularization too soon or too late)

2628,

8721 with list of related issues

also still open: adding internally some transformation, but that was more in context of nonlinear models like exp in Poisson or Logit. #1131

your last OLS has the warning in the summary

Notes:
[1] ...
[2] The condition number is large, 2.55e+15. This might indicate that there are
strong multicollinearity or other numerical problems.

In general, the linear algebra with a condition number this high looses almost all numerical precision. It might work in some cases, but would produce very low precision in other cases. If the issue is a scaling issue, then the user needs to provide reasonably scaled variables. If the issue is inherent multicollinearity, then the user would be better of to treat is properly, e.g. choose a penalized estimator or drop variables. statsmodels is currently not doing anything for this except for the pinv in linear regression.

The problem in this case could partially be alleviated by treating the constant differently, but this would not help if the high condition number comes from high multicollinearity of the (slope) explanatory variables. That means that we would just be shifting the problems around to other cases. (another popular case that fails is power polynomials when the underlying x has large values.)

stats_ols_fitted.model.exog[:5]
array([[1.00000000e+00, 1.90000000e+14],
       [1.00000000e+00, 1.96666667e+14],
       [1.00000000e+00, 2.03333333e+14],
       [1.00000000e+00, 2.10000000e+14],
       [1.00000000e+00, 2.16666667e+14]])
Gabriel-Kissin commented 3 months ago

Thank you @josef-pkt for your thorough response. I understand the issues that you describe with making code changes to fix this.

The only other thought I have then would be to add a note to the OLS documentation pointing out that

While the example I posted is contrived, I came across this issue using a real life dataset, and it took me a while to boil down the issue to what I posted above. Having a note about this issue in the documentation could potentially be very useful to others.

Thank you for all your hard work on statsmodels (and scipy.stats)!

josef-pkt commented 3 months ago

We have relate issues

1908 sanity check

9068 basic comprehensive diagnostics

and more on tools to diagnose multicollinearity

quote of Welch cited in #9068 " Even with a vast arsenal of diagnostics, it is very hard to write down rules that can be used to guide a data analysis. So much is really subjective and subtle. "

It's work and it's not clear how much the software can take over from the user.

We try to provide more, both for numerical problems and cases where model does not look appropriate for the data, but it's slow going because there so many possibilities that can go "wrong".

For the specific scaling issue here this might help for a good warning message https://github.com/statsmodels/statsmodels/issues/2378#issuecomment-257031644