LIMO example: scipy.linalg.lstsq() versus scipy.linalg.pinv()

KensingtonSka commented 3 years ago

Hello,

My colleagues and I were trying to recreate the results of the LIMO dataset using MNE. Currently there is a tutorial on this on the MNE website. However, the tutorial doesn't separately generate beta parameters for face a and face b as LIMO does in their example. Instead, the beta reflecting face a - face b is used in the MNE tutorial. This is because linear_regression() function (here) used in the MNE implementation of LIMO computes the solution to the least-squares equation, y = XB + e, via scipy's linear algebra least squares function linalg.lstsq() which inverts X in order to compute B. This is different to LIMO's implementation which uses a Moore-Penrose pseudoinverse to compute the beta parameters in order to avoid singular matrix issues which linalg.lstsq() cannot handle. At least, this is our understanding.

Which you can see if you try to produce beta parameters for face a and face b individually via (this part of the tutorial)

predictor_vars = ['face a', 'face b', 'phase-coherence', 'intercept']

design = limo_epochs.metadata[['phase-coherence', 'face']].copy()
design['face a'] = np.where(design['face'] == 'A', 1, 0)
design['face b'] = np.where(design['face'] == 'B', 1, 0)
design['intercept'] = 1
design = design[predictor_vars2]

reg = linear_regression(limo_epochs,
                        design_matrix=design,
                        names=predictor_vars)

resulting in

Traceback (most recent call last):

  File "<ipython-input-124-7db96c3c5024>", line 1, in <module>
    reg = _fit_lm(limo_epochs.get_data(), design, predictor_vars)

  File "...\MNElimo_example.py", line 41, in _fit_lm
    sqrt_noise_var = np.sqrt(resid_sum_squares / df).reshape(data.shape[1:])

  File "C:\...\anaconda3\envs\mne\lib\site-packages\scipy\linalg\basic.py", line 968, in inv
    raise LinAlgError("singular matrix")

LinAlgError: singular matrix

To skirt this issue we created our own versions of the MNE functions linear_regression() and _fit_lm() (again, here) shown below:

"""
Created on Fri Oct 15 15:29:38 2021

@author: Rhys Hobbs (rhys.hobbs@otago.ac.nz)
"""
from inspect import isgenerator
from collections import namedtuple

import numpy as np

from mne.source_estimate import SourceEstimate
from mne.epochs import BaseEpochs
from mne.evoked import Evoked, EvokedArray
from mne.utils import logger, warn
from mne.io.pick import pick_types

def _fit_lm_pinv(data, design_matrix, names):
    """Aux function."""
    from scipy import stats, linalg
    n_samples = len(data)
    n_features = np.product(data.shape[1:])
    if design_matrix.ndim != 2:
        raise ValueError('Design matrix must be a 2d array')
    n_rows, n_predictors = design_matrix.shape

    if n_samples != n_rows:
        raise ValueError('Number of rows in design matrix must be equal '
                         'to number of observations')
    if n_predictors != len(names):
        raise ValueError('Number of regressor names must be equal to '
                         'number of column in design matrix')

    y = np.reshape(data, (n_samples, n_features))

    ################ CHANGED ############################
    betas = linalg.pinv(design_matrix).dot(y)           
    yhat = design_matrix.dot(betas).to_numpy()          
    resid_sum_squares = np.sum((yhat - y)**2, axis=0)   
    #####################################################

    df = n_rows - n_predictors
    sqrt_noise_var = np.sqrt(resid_sum_squares / df).reshape(data.shape[1:])

    ################ CHANGED ############################
    design_invcov = linalg.pinv(np.dot(design_matrix.T, design_matrix))
    #####################################################

    unscaled_stderrs = np.sqrt(np.diag(design_invcov))
    tiny = np.finfo(np.float64).tiny
    beta, stderr, t_val, p_val, mlog10_p_val = (dict() for _ in range(5))
    for x, unscaled_stderr, predictor in zip(betas, unscaled_stderrs, names):
        beta[predictor] = x.reshape(data.shape[1:])
        stderr[predictor] = sqrt_noise_var * unscaled_stderr
        p_val[predictor] = np.empty_like(stderr[predictor])
        t_val[predictor] = np.empty_like(stderr[predictor])

        stderr_pos = (stderr[predictor] > 0)
        beta_pos = (beta[predictor] > 0)
        t_val[predictor][stderr_pos] = (beta[predictor][stderr_pos] /
                                        stderr[predictor][stderr_pos])
        cdf = stats.t.cdf(np.abs(t_val[predictor][stderr_pos]), df)
        p_val[predictor][stderr_pos] = np.clip((1. - cdf) * 2., tiny, 1.)
        # degenerate cases
        mask = (~stderr_pos & beta_pos)
        t_val[predictor][mask] = np.inf * np.sign(beta[predictor][mask])
        p_val[predictor][mask] = tiny
        # could do NaN here, but hopefully this is safe enough
        mask = (~stderr_pos & ~beta_pos)
        t_val[predictor][mask] = 0
        p_val[predictor][mask] = 1.
        mlog10_p_val[predictor] = -np.log10(p_val[predictor])

    return beta, stderr, t_val, p_val, mlog10_p_val

def plinear_regression(inst, design_matrix, names=None):
    """Fit Ordinary Least Squares (OLS) regression.
    Parameters
    ----------
    inst : instance of Epochs | iterable of SourceEstimate
        The data to be regressed. Contains all the trials, sensors, and time
        points for the regression. For Source Estimates, accepts either a list
        or a generator object.
    design_matrix : ndarray, shape (n_observations, n_regressors)
        The regressors to be used. Must be a 2d array with as many rows as
        the first dimension of the data. The first column of this matrix will
        typically consist of ones (intercept column).
    names : array-like | None
        Optional parameter to name the regressors (i.e., the columns in the
        design matrix). If provided, the length must correspond to the number
        of columns present in design matrix (including the intercept, if
        present). Otherwise, the default names are ``'x0'``, ``'x1'``,
        ``'x2', …, 'x(n-1)'`` for ``n`` regressors.
    Returns
    -------
    results : dict of namedtuple
        For each regressor (key), a namedtuple is provided with the
        following attributes:
            - ``beta`` : regression coefficients
            - ``stderr`` : standard error of regression coefficients
            - ``t_val`` : t statistics (``beta`` / ``stderr``)
            - ``p_val`` : two-sided p-value of t statistic under the t
              distribution
            - ``mlog10_p_val`` : -log₁₀-transformed p-value.
        The tuple members are numpy arrays. The shape of each numpy array is
        the shape of the data minus the first dimension; e.g., if the shape of
        the original data was ``(n_observations, n_channels, n_timepoints)``,
        then the shape of each of the arrays will be
        ``(n_channels, n_timepoints)``.
    """
    if names is None:
        names = ['x%i' % i for i in range(design_matrix.shape[1])]

    if isinstance(inst, BaseEpochs):
        picks = pick_types(inst.info, meg=True, eeg=True, ref_meg=True,
                           stim=False, eog=False, ecg=False,
                           emg=False, exclude=['bads'])
        if [inst.ch_names[p] for p in picks] != inst.ch_names:
            warn('Fitting linear model to non-data or bad channels. '
                 'Check picking')
        msg = 'Fitting linear model to epochs'
        data = inst.get_data()
        out = EvokedArray(np.zeros(data.shape[1:]), inst.info, inst.tmin)
    elif isgenerator(inst):
        msg = 'Fitting linear model to source estimates (generator input)'
        out = next(inst)
        data = np.array([out.data] + [i.data for i in inst])
    elif isinstance(inst, list) and isinstance(inst[0], SourceEstimate):
        msg = 'Fitting linear model to source estimates (list input)'
        out = inst[0]
        data = np.array([i.data for i in inst])
    else:
        raise ValueError('Input must be epochs or iterable of source '
                         'estimates')
    logger.info(msg + ', (%s targets, %s regressors)' %
                (np.product(data.shape[1:]), len(names)))

    ################ CHANGED ############################
    lm_params = _fit_lm_pinv(data, design_matrix, names)
    #####################################################

    lm = namedtuple('lm', 'beta stderr t_val p_val mlog10_p_val')
    lm_fits = {}
    for name in names:
        parameters = [p[name] for p in lm_params]
        for ii, value in enumerate(parameters):
            out_ = out.copy()
            if not isinstance(out_, (SourceEstimate, Evoked)):
                raise RuntimeError('Invalid container.')
            out_._data[:] = value
            parameters[ii] = out_
        lm_fits[name] = lm(*parameters)
    logger.info('Done')
    return lm_fits

Our question is this: is there a reason for using linalg.lstsq() over the Moore-Penrose pseudoinverse?

If not, We're happy to make a push request. But I'm not sure what would be the best implementation. I'd imagine something like adding an optional input to linear_regression() which chooses one method over the other.

welcome[bot] commented 3 years ago

Hello! 👋 Thanks for opening your first issue here! ❤️ We will try to get back to you soon. 🚴🏽‍♂️

drammock commented 3 years ago

Does it work with our existing code if you omit the intercept column from the design matrix? IIUC, when including separate predictors for both levels of a binary categorical variable (here, "face" A vs B), it is customary (mandatory?) to remove the intercept term to compensate.

KensingtonSka commented 3 years ago

That does work. But it leaves all the parameters referenced to the intercept. Which is really just moving the issue around. It's fine if you've standardised your data because then you're intercept is 0 anyway. But if, for whatever reason, you can't, or don't want to, standardise it then you will lose information when you run the regression. Using the pseudoinverse skirts this issue. To quote the LIMO paper:

Although the design matrices made up by LIMO EEG are almost always rank deficient (each condition is coded in one column of X), F or T tests are exact, that is, they give identical results to that obtained by applying a standard inverse to a full rank matrix.

So, should I take it that the reason for using linalg.lstsq() over the Moore-Penrose pseudoinverse is that it's a matter of preference?

agramfort commented 3 years ago

I am inclined to switch a pinv call instead of lstsq following what nilearn does:

https://github.com/nilearn/nilearn/blob/main/nilearn/glm/regression.py#L117

agramfort commented 3 years ago

it's centering the data that makes the intercept 0. I am sure to follow. You don't want to center the data?

pinv allows you to work with rank deficient design matrices and apparently it's the default in standard GLM code so I think we should switch to pinv.

drammock commented 3 years ago

That does work. But it leaves all the parameters referenced to the intercept. Which is really just moving the issue around. It's fine if you've standardised your data because then you're intercept is 0 anyway. But if, for whatever reason, you can't, or don't want to, standardise it then you will lose information when you run the regression.

I don't think this is correct. If you have this design matrix:

face_a    face_b    intercept
   0         1          1
   0         1          1
   0         1          1
   1         0          1
   1         0          1
   1         0          1

then you can only estimate 2 parameters, because the third will be a linear combination of the other two. If you discard intercept and estimate face_a and face_b then what you get is not "referenced to the intercept" as you say (that happens when you include the intercept but drop either face_a or face_b, in which case the face_* term will be an offset for that condition relative to the intercept). I don't think this has anything to do with whether the data are standardised, and I don't think any information is lost.

KensingtonSka commented 3 years ago

@agramfort

You don't want to center the data?

I'm using the linear regression as a part of a pipeline of sorts that I've been asked to make. I not that I don't want to center the data, and more that I don't want to make any assumptions about the data. Opting to leave the centering to the user.

@drammock That's a very good point. I think I'm doing a bad job explaining 😕, I'll try again more formally. Also, apologies for the png's. I'm not super savvy with markdown yet so I opted to use latex and just screenshot my equations.

So, we have a linear regression model y = β X + ε where eq1

I want to know the values of b_1, b_2, and b_3 such that eq2

However this can't be solved by linalg.lstsq() because X has no inverse. This is a consequence of x_{i,3} = x_{i,1} + x_{i,2} (i.e. is a linear combination). To avoid this, one can take the route where one removes a column from X such that y = β' X' + ε' where eq3

Here, X' is invertible and can be used to obtain values for b'_1 and b'_2 via: eq4

But, I want b_1, b_2, and b_3 so I'll map equation 2 to our new space using x_{i,3} = x'_{i,1} + x'_{i,2}, x{i,1} = x'\{i,1}, and x_{i,1} = x'_{i,1}. This yields:

eq5 Comparing this with equation 4, and making the assumption that both models fit the data equally well such that ε = ε', we can infer that b'_1 = b_1 + b_3 and b'_2 = b_2 + b_3. This addition of b_3 to our fitting parameters is what I mean by "referenced to the intercept." To my understanding, it's not possible to retrieve b_1, b_2, and b_3 from b'_1 and b'_2 which is what I mean by information being lost. I think I was being too loose with my jargon 😀.

drammock commented 3 years ago

ok. well, no objection to using pinv anyway, if nilearn does it and @agramfort is happy.

jona-sassenhagen commented 2 years ago

@JoseAlanis please see

mne-tools / mne-python

LIMO example: scipy.linalg.lstsq() versus scipy.linalg.pinv() #9853