scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.44k stars 25.26k forks source link

Inconsitency between C-contiguous and F-contiguous arrays #26493

Open JeanLescutMuller opened 1 year ago

JeanLescutMuller commented 1 year ago

No consistency between C-contiguous and F-contiguous arrays for LinearRegression()

At least for LinearRegression() : In some edge case (when X is almost singular), there is huge difference between C-contiguous and F-contiguous arrays predictions.

Steps/Code to Reproduce

import numpy as np; print(np.__version__) # 1.23.5
import scipy; print(scipy.__version__) # 1.10.0
import sklearn as sk; print(sk.__version__) # 1.2.1

from sklearn.linear_model import LinearRegression
import pandas as pd

# Parameters 
seed, N_obs, N_feat, mu_x, sigma_x, mu_y, sigma_y = 0, 100, 1000, 100, 0.1, 100, 1

# 1) Creating a weird edge-case X, y :
np.random.seed(seed)
s = pd.Series(np.random.normal(mu_x, sigma_x, N_obs))
X = np.stack([s.ewm(com=com).mean() for com in np.arange(N_feat)]).T
y = np.random.normal(mu_y, sigma_y, N_obs)

# 2) Showing that there is different results for C-cont vs F-cont arrays :
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)
y_pred_c = model.predict(np.ascontiguousarray(X))

# Either just plot it and see :
import matplotlib.pyplot as plt
plt.scatter(y_pred, y_pred_c)

# Or look at the data :
np.var(y_pred)
np.var(y_pred - y_pred_c)
np.corrcoef(y_pred, y_pred_c)[0,1] # == 0.40295584536349216
# --> y_pred EXTREMELY different than y_pred_c

Expected Results

We expect y_pred to be fully equal to y_pred_c. Or at least np.corrcoef(y_pred, y_pred_c)[0,1] > .99

Actual Results

np.corrcoef(y_pred, y_pred_c)[0,1] # == 0.40295584536349216

Versions

System:
    python: 3.10.9 (main, Mar  1 2023, 18:23:06) [GCC 11.2.0]
executable: /opt/anaconda3/bin/python
   machine: Linux-5.10.0-23-cloud-amd64-x86_64-with-glibc2.31

Python dependencies:
      sklearn: 1.2.1
          pip: 22.3.1
   setuptools: 65.6.3
        numpy: 1.23.5
        scipy: 1.10.0
       Cython: None
       pandas: 1.5.3
   matplotlib: 3.7.0
       joblib: 1.1.1
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /opt/anaconda3/lib/libmkl_rt.so.1
         prefix: libmkl_rt
       user_api: blas
   internal_api: mkl
        version: 2021.4-Product
    num_threads: 64
threading_layer: intel

       filepath: /opt/anaconda3/lib/libiomp5.so
         prefix: libiomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 128

       filepath: /opt/anaconda3/lib/libgomp.so.1.0.0
         prefix: libgomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 128
ogrisel commented 1 year ago

This problem goes away if we use a more stable numerical solver, such as "lsqr", for instance via Ridge.

Since there is already a plan to allow LinearRegression to have different solvers and to its default solver to be consistent with Ridge, I think it this is the best way forward (unless you find cases where "lsqr" also fails).

See https://github.com/scikit-learn/scikit-learn/issues/22855#issuecomment-1514368256 for more details on the context.