statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
10.1k stars 2.88k forks source link

REF/Review computational shortcuts for OLS outlier influence regression diagnostics #6325

Open josef-pkt opened 4 years ago

josef-pkt commented 4 years ago

outlier influence and similar regression diagnostics might still have some inefficient computations. (I improved those as I semi-randomly found computational shortcuts)

article that summarizes several computational shortcuts

Velleman, Paul F., and Roy E. Welsch. 1981. “Efficient Computing of Regression Diagnostics.” The American Statistician 35 (4): 234–42. https://doi.org/10.2307/2683296.

one that I haven't seen before: section 3.1 catcher matrix which is just transposed pinv(x) can be used to construct partial residual plot without explicit regression to partial out other variables. e.g. statsmodels.graphics.regressionplots plot_partregress uses OLS to partial out other exog

i.e. we should be able to get partial regression residuals directly from pinv(x), but I haven't tried yet.

related PRs #6265, #6268 and #2380 work with moment (cov(xy)) or product matrices (x'x) and not with pinv(x) and are focused on effects/influence of variables and not with effects of observations.

josef-pkt commented 4 years ago

I have a preliminary notebook using pinv(exog). Looks good and works well.

Using different linalg computations accumulates different floating point errors, e.g. in a small example, I get agreement between pinv and OLS version at rtol around 1e-13, atol around 1e-15.

josef-pkt commented 4 years ago

similar things should apply when we start with a QR decomposition. Belsley computes SVD after QR in his code.

brief google search for "compute svd after qr decomposition" https://math.stackexchange.com/questions/538107/svd-by-qr-and-choleski-decomposition-what-is-going-on https://www.researchgate.net/post/Question_about_Svd_using_QR A=Q*R= Q*U*S*V, where [U,S,V]= qr(R).

and maybe https://www.sciencedirect.com/science/article/pii/0024379593904798 (not looked at)

We should be able to get all the things we need from an OLS/WLS regression model or IRLS-WLS model in GLM and similar.

josef-pkt commented 4 years ago
u, s, v = np.linalg.svd(x, full_matrices=False)
[ii.shape for ii in [u, s, v]]
[(100, 5), (5,), (5, 5)]

q, r = np.linalg.qr(x)
u_, ss, vv = np.linalg.svd(r)
uu = q.dot(u_)
[ii.shape for ii in [u_, ss, vv, uu]]
[(5, 5), (5,), (5, 5), (100, 5)]

np.max(np.abs(uu - u)), np.max(np.abs(ss - s)), np.max(np.abs(vv - v))
(2.7755575615628914e-17, 0.0, 0.0)

pinv for non-singular


pinv = np.linalg.pinv(x)
pinv_1 = v.T.dot((u / s).T)
pinv_2 = vv.T.dot((uu / ss).T)

np.max(np.abs(pinv_1 - pinv)), np.max(np.abs(pinv_2 - pinv))
(6.938893903907228e-18, 6.938893903907228e-18)
​```
josef-pkt commented 4 years ago

A long, long time ago (but not as long ago as an American Pie)

statsmodels.sandbox.archive.linalg_decomp_1 PlainMatrixArray, SvdArray statsmodels.sandbox.archive.linalg_covmat AffineTransform

josef-pkt commented 4 years ago

One related idea was to use a class to separate out the linalg in OLS/WLS, currently pinv and qr based results are directly attached to model. Another related issue is to start OLS directly from the product matrices or their decomposition. Currently I have some experimental code with separate models based on those summary statistics.

current application: streamline postestimation results and diagnostics using appropriate precomputed results. Refactoring linear regression models requires a good plan/design, otherwise we just get more complex code without much benefit. However, there are things that could be computed in a more efficient way. (e.g. normalized_cov_params based on svd)

josef-pkt commented 4 years ago

another detail computing (x'x)^{-1} after QR OLS uses self.normalized_cov_params = np.linalg.inv(np.dot(R.T, R))

using potri https://github.com/scipy/scipy/blob/master/scipy/stats/_multivariate.py#L2297 https://stat.ethz.ch/R-manual/R-devel/library/base/html/chol2inv.html

scipy _multivariate.py also has some potentially interesting helper functions

josef-pkt commented 4 years ago

other linalg formulas we need relate to det and trace

e.g. statsmodels.sandbox.archive.linalg_decomp_1 contains

    def mdet(self):
        return self.meigh[0].prod()

    def mlogdet(self):
        return np.log(self.meigh[0]).sum()

regression models (like OLS) don't have hat_matrix_diag attached get_hat_matrix_diag is in GLMResults (in Results because it depends on estimate of weights)

OLSInfluence has hat_matrix_diag (self.exog * self.results.model.pinv_wexog.T).sum(1)

BUG/unit tests?: this is most likely not be available if OLS was estimated using QR see above for how to get pinv_wexog from qr decomposition

extension: OLSInfluence hat_matrix_diag should use wexog so it applies to WLS (hat matrix version in terms of wendog/wexog, not in terms of original endog changes)