kingjr / b2b

0 stars 0 forks source link

Proof: Ê positively biased if second regression is regularized #10

Open kingjr opened 5 years ago

kingjr commented 5 years ago

This is an old problem, but re-writing this for references:

The problem is as usual: Y=(EX+N)F

from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.preprocessing import scale
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ttest_1samp
%matplotlib inline

# make data
np.random.seed(1)
n = 5000  # number of samples
dim_x = 50  # dimensionality of X
dim_y = 51

Cx = np.random.randn(dim_x, dim_x)
Cx = Cx.dot(Cx.T) / dim_x  # sym pos-semidefin

X = np.random.multivariate_normal(np.zeros(dim_x), Cx, n)
N = np.random.randn(n, dim_x)
F = np.random.randn(dim_y, dim_x)
E = np.diag(np.random.rand(dim_x)<0.05)

Y = (X @ E + N) @ F.T
X, Y, E = scale(X), scale(Y), np.diag(E)

The JRR consists of two successive regressions: G = ridge(Y, X) H = ridge(X, YG)

alphas = np.logspace(-3, 3, 10)
ridge_cv = RidgeCV(alphas=alphas, fit_intercept=False)
ols = LinearRegression(fit_intercept=False)

# Standard JRR
ridge = RidgeCV(alphas)
G = ridge.fit(Y[::2], X[::2]).coef_.T
H = ridge.fit(X[1::2], Y[1::2] @ G).coef_.T
E_hat = np.diag(H)

plt.fill_between(range(dim_x), E, label='E')
plt.fill_between(range(dim_x), E_hat, label='E_hat')
plt.ylim(None, E_hat.max())
plt.legend()

image

Problem: the regularization on the second regression generates a positive bias. The more there is information in X_hat the higher the bias:

_, p_value = ttest_1samp(E_hat[E == 0], 0)
print('not causal > 0: p_value=%.4f' % p_value)

not causal > 0: p_value=0.005

This does not occur when the second regression is not regularized. However, in this case, the recovered causal factors are much noisier

ridge = RidgeCV(alphas)
G = ridge.fit(Y[::2], X[::2]).coef_.T
H = ols.fit(X[1::2], Y[1::2] @ G).coef_.T
E_hat = np.diag(H)

plt.fill_between(range(dim_x), E, label='E')
plt.fill_between(range(dim_x), E_hat, label='Ê')
plt.ylim(None, E_hat.max())
plt.legend()
_, p_value = ttest_1samp(E_hat[E == 0], 0)
print('not causal > 0: p_value=%.4f' % p_value)

not causal > 0: p_value=0.6413

image

f-charton commented 5 years ago

I believe this happens at both stage... For the second step, here is the explanation and a possible solution...

Problem To get the least square estimate of X_hat from X, we need to decorrelate features of X, which we do by inverting X'X. Through inversion, the low eigenvalues of X'X get scaled up, while the high values get scaled down. In the presence of noise, from N or just from sampling of X, all eigenvalues have it, and so small values consist mostly of noise, whereas large values contain the signal. As a result, inversion of X'X tends to magnify noise, and the effect is larger when the condition number of X'X (ratio of largest on smallest eigenvalue) gets high.

Regularisation Regularisation improves the condition number of X'X by adding a constant to all its eigenvalues. This does not change large values, but has a dramatic effect on small ones. This is known to work, but adding a diagonal to the matrix amounts to solving a different problem: instead of looking for H minimising distance between HX and X_Hat (as in least square) we now minimise distance under constraint of ||H||<Cst, Cst depending on the regularisation parameter. This means we find a different, but stabler, solution to the problem.

Two suggestions
The most direct way of reducing bias why keeping stable regression would be to use a very low alpha parameter. This could be done by forcing just one alpha into ridge regression, with negligible value (1e-7, 1e-4, experience shows rhe actual value does not really matter). This minimal regularisation would not be large enough to cause bias. Yet it would cure some of the instabilities. I thinkwe'd get away for small noise, but this would be much less resilient than classical jrr (and noise resilience strikes me as one of its major assets). This one is very easy to test : in ridgeCV, alphas=[1e-5] and we're good to go

Maybe a better way would be to abandon regularisation (and ridge regression), as follows (this certainly has a name...) calculate svd of X'X zero small eigenvalues rescale the rest so that trace is unchanged (not sure this is necessary, though, should be small) take pseudo inverse of the result do ols with this new correlation matrix inverse

Doing this, we eliminate the regularisation bias and recover the signal. Not sure how it would perform with jrr2 (it might cut some weak signal off), but it is worth a try.

Implementation is not very difficult, from ridgeCV code: 1- U,D= svd(X'X) (D diagonal, X'X=U'DU, U'U=I) 2- zero all elements of D below ||D||/K (K condition factor, we could cross validate this pass), yielding D' 3- invert D', and use UD'^{-1}U' as (X'X)^{-1} in the ols regression UD'^{-1}U'X'G_hat

kingjr commented 5 years ago

This one is very easy to test : in ridgeCV, alphas=[1e-5] and we're good to go

Right, but there is still a bias, and it has dramatic effect for jrr2

zero small eigenvalues

Interesting: Here is what I tried (also varied the number of pc, did not change much)

from sklearn.decomposition import PCA
pca = PCA('mle').fit(X)
Xt = pca.inverse_transform(pca.transform(X))
E_hat = pinv(Xt.T @ Xt) @ Xt.T @ X_hat

plt.fill_between(range(X.shape[1]), np.diag(E_hat))
not_causal = np.diag(E) == 0
_, p_value = ttest_1samp(np.diag(E_hat)[not_causal], 0)
print('not causal > 0: p_value=%.4f' % p_value)

It helps sometimes, but fails completely for the jrr2 described in #1.

kingjr commented 5 years ago

New example with strong bias

from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.preprocessing import scale
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ttest_1samp
%matplotlib inline

# make data
np.random.seed(1)
n = 5000  # number of samples
dim_x = 200  # dimensionality of X
dim_y = 201

Cx = np.random.randn(dim_x, dim_x)
Cx = Cx.dot(Cx.T) / dim_x  # sym pos-semidefin

X = np.random.multivariate_normal(np.zeros(dim_x), Cx, n)
N = np.random.randn(n, dim_x)
F = np.random.randn(dim_y, dim_x)
E = np.diag(np.random.rand(dim_x)<0.20)

Y = (X @ E + N) @ F.T
X, Y, E = scale(X), scale(Y), np.diag(E)

alphas = np.logspace(-3, 3, 10)
ridge = RidgeCV(alphas=alphas, fit_intercept=False)
ols = LinearRegression(fit_intercept=False)

# Standard JRR
G = ridge.fit(Y[::2], X[::2]).coef_.T
H = ridge.fit(X[1::2], Y[1::2] @ G).coef_.T
E_hat = np.diag(H)
plt.figure()
plt.fill_between(range(dim_x), E, label='E')
plt.plot(range(dim_x), E_hat, label='E_hat', color='C1')
plt.axhline(0, color='k', ls=':')
plt.ylim(None, E_hat[E==0].max()*5)
plt.legend()

_, p_value = ttest_1samp(E_hat[E == 0], 0)
print('not causal > 0: p_value=%.10f' % p_value)

not causal > 0: p_value=0.0000000000

image

kingjr commented 5 years ago

This seems to fix it while remaining stable

def ridge_cv(X, Y, alphas, independent_alphas=False, Uv=None):
    """
    Similar to sklearn RidgeCV but
    (1) can optimize a different alpha for each column of Y
    (2) return leave-one-out Y_hat
    """
    if isinstance(alphas, (float, int)):
        alphas = np.array([alphas, ], np.float64)
    n, n_x = X.shape
    n, n_y = Y.shape
    # Decompose X
    if Uv is None:
        U, s, _ = svd(X, full_matrices=0)
        v = s**2
    else:
        U, v = Uv
    UY = U.T @ Y

    # For each alpha, solve leave-one-out error coefs
    cv_duals = np.zeros((len(alphas), n, n_y))
    cv_errors = np.zeros((len(alphas), n, n_y))
    for alpha_idx, alpha in enumerate(alphas):
        # Solve
        w = ((v + alpha) ** -1) - alpha ** -1
        c = U @ np.diag(w) @ UY + alpha ** -1 * Y
        cv_duals[alpha_idx] = c

        # compute diagonal of the matrix: dot(Q, dot(diag(v_prime), Q^T))
        G_diag = (w * U ** 2).sum(axis=-1) + alpha ** -1
        error = c / G_diag[:, np.newaxis]
        cv_errors[alpha_idx] = error

    # identify best alpha for each column of Y independently
    if independent_alphas:
        best_alphas = (cv_errors ** 2).mean(axis=1).argmin(axis=0)
        duals = np.transpose([cv_duals[b, :, i]
                              for i, b in enumerate(best_alphas)])
        cv_errors = np.transpose([cv_errors[b, :, i]
                                  for i, b in enumerate(best_alphas)])
    else:
        _cv_errors = cv_errors.reshape(len(alphas), -1)
        best_alphas = (_cv_errors ** 2).mean(axis=1).argmin(axis=0)
        duals = cv_duals[best_alphas]
        cv_errors = cv_errors[best_alphas]

    coefs = duals.T @ X
    Y_hat = Y - cv_errors
    return coefs, best_alphas, Y_hat

def jrr(X, Y, alphas=np.logspace(-5, 5, 20)):
    # Prepare G
    U, s, V = svd(Y, full_matrices=0)
    Vs_inv_y = V.T @ np.diag(s**-1)
    v_y = s**2

    # Prepare H
    pca = PCA('mle').fit(X)
    Xt = pca.inverse_transform(pca.transform(X))
    Cx_inv = pinv(X.T @ X)

    # CV loop
    n_splits = 10
    Hs = list()
    for split in range(n_splits):
        p = np.random.permutation(range(len(X)))
        train, test = p[::2], p[1::2]

        G, _, _ = ridge_cv(Y[train], X[train], alphas,
                           independent_alphas=False, 
                           Uv=(Y[train] @ Vs_inv_y, v_y))

        X_hat = Y[test] @ G.T
        H = Cx_inv @ Xt[test].T @ X_hat
        Hs.append(H)

    H = np.mean(Hs, 0)
    E_hat = np.diag(H)
    return H