rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.15k stars 526 forks source link

[BUG] Mismatch between cuml and sklearn LinearRegression for specific inputs #3465

Open wphicks opened 3 years ago

wphicks commented 3 years ago

Describe the bug For specific training and test inputs to LinearRegression, cuml and sklearn produce significantly different outputs

Steps/Code to reproduce bug

import cupy as cp
import numpy as np
from cuml import LinearRegression as cuLinearRegression
from sklearn.linear_model import LinearRegression as skLinearRegression

# One of several examples found with Hypothesis
X_train = np.array([[25000.750007327206, 25000.750007327206, 25000.750007327206, 25000.750007327206, 25000.750007327206], [1.0000000000222042, 37525.13455882354, 25000.750007327206, 25000.750007327206, 25000.750007327206], [25000.750007327206, 25000.750007327206, 25000.750007327206, 25000.750007327206, 25000.750007327206], [25000.750007327206, 25000.750007327206, 25000.750007327206, 25000.750007327206, 25000.750007327206]], dtype=np.float64)
y_train = np.array([1.0, 2.003848073721735, 2.003848073721735, 2.003848073721735], dtype=np.float64)
X_test = np.array([[25000.750007327206, 25000.750007327206, 25000.750007327206, 25000.750007327206, 25000.750007327206]], dtype=np.float64)
y_test = np.array([2.003848073721735], dtype=np.float64)

cuols = cuLinearRegression()
cuols.fit(X_train, y_train)
cu_predict = cuols.predict(X_test)

skols = skLinearRegression()
skols.fit(X_train, y_train)
sk_predict = skols.predict(X_test)

print(cu_predict)  # [1.66923205]
print(sk_predict)  # [1.5625]

Expected behavior While this is a somewhat unlikely and artificial example, I would still expect cuml and sklearn output to be a closer match. This was discovered accidentally while working toward a more sophisticated Hypothesis testing setup for #1739, and I suspect it is a red herring for that issue, but it is probably still worth investigating further.

Environment details (please complete the following information):

dantegd commented 3 years ago

@wphicks I just repro'd and saw the difference, but it seems there is a typo in the comment of the last line:

>>> print(sk_predict)  # [1.5625]
[1.625]

seems like the comment accidentally added a 5 in there, there is still a significant difference but just wanted to confirm that I'm seeing the expected result from scikit

wphicks commented 3 years ago

Nope, I'm afraid that 1.5625 is genuinely what I get on my machine. Interesting that there's a discrepancy between our setups, though...

dantegd commented 3 years ago

particularly how they look almost the same except for a digit! I'm also using scikit 0.23.1, very interesting

viclafargue commented 3 years ago

I could reproduce the issue. On DGX, the predictions are still different but look a bit more similar as observed by Dante. However there are large differences on coefficients and intercept. I'll look into this.

cu coef_: [-1.91074691e-05 -1.14229906e-05  0.00000000e+00  0.00000000e+00 0.00000000e+00]
sk coef_: [6.63600728e+09 1.32460419e+10 0.00000000e+00 0.00000000e+00 0.00000000e+00]
cu intercept_: 2.432516440558203
sk intercept_: -497066142073824.6
viclafargue commented 3 years ago

The difference seems to be explained by the different methods used in cuML and sklearn. cuML uses ordinary least squares (with eigen or SVD decomposition) whereas sklearn uses non-negative least squares from Scipy. This seems like this example is a corner case. It produces poor results with eigendecomposition and literally crashes with SVD decomposition as cuSolver seems to be unable to find a solution for it. Datasets generated with make_regression seems to produce somewhat similar coefficients, biases and predictions. I think this isn't really a big concern as the problem seems to come from a difference in the methods that are used and seems to be only observable with corner cases.

wphicks commented 3 years ago

Awesome! Thanks for digging into it, @viclafargue. For my part, I'd like to understand why we're using OLS where sklearn uses NNLS and whether that difference is justified, but that's probably just something I need to work through, not necessarily something that indicates a genuine problem.

In terms of making sure that this would not come up in more realistic scenarios, I'm working on hooking Hypothesis into our dataset generation methods anyway, so I'll add make_regression into that, and then we can test a range of gnarlier but still somewhat realistic inputs. Even if that comes up empty, I'd say it's still probably worth leaving this issue open and trying to fix it at some point, since surprising inputs show up in real usage, but I 100% agree that this should not be a major priority at the moment.

wphicks commented 3 years ago

@beckernick is having some trouble posting right now, but he was kind enough to point out a couple of interesting things on this. First, if sklearn were using NNLS, wouldn't it force all coefficients to be >0? Second, if we force sklearn to use NNLS by passing positive=True, we get a very interesting result:

import cupy as cp
import numpy as np
from cuml import LinearRegression as cuLinearRegression
from sklearn.linear_model import LinearRegression as skLinearRegression
# One of several examples found with Hypothesis
X_train = np.array([[25000.750007327206, 25000.750007327206, 25000.750007327206, 25000.750007327206, 25000.750007327206], [1.0000000000222042, 37525.13455882354, 25000.750007327206, 25000.750007327206, 25000.750007327206], [25000.750007327206, 25000.750007327206, 25000.750007327206, 25000.750007327206, 25000.750007327206], [25000.750007327206, 25000.750007327206, 25000.750007327206, 25000.750007327206, 25000.750007327206]], dtype=np.float64)
y_train = np.array([1.0, 2.003848073721735, 2.003848073721735, 2.003848073721735], dtype=np.float64)
X_test = np.array([[25000.750007327206, 25000.750007327206, 25000.750007327206, 25000.750007327206, 25000.750007327206]], dtype=np.float64)
y_test = np.array([2.003848073721735], dtype=np.float64)
cuols = cuLinearRegression()
cuols.fit(X_train, y_train)
cu_predict = cuols.predict(X_test)
skols = skLinearRegression(positive=True)
skols.fit(X_train, y_train)
sk_predict = skols.predict(X_test)
print(cu_predict)  # [1.66923205]
print(sk_predict)  # [1.5625]
[1.66923205]
[1.66923205]

I'd like to make sure we're confident that we understand what's going on under the hood there, even if this is a weird degenerate case. @viclafargue I'm happy to pick up where you left off, or you can dive back into it if you'd like to keep picking at it.

viclafargue commented 3 years ago

My bad, you're right it uses scipy.linalg.lsts when the positive parameter is unset. NNLS seems to produce similar predictions, that's interesting. Would also be interesting to check the coefficients and biases.

The least square function of Scipy can call multiple LAPACK methods (gelsd, gelsy and gelss). It uses gelsd as default.

viclafargue commented 3 years ago

The three LAPACK methods seems to produce somewhat similar results, distinct from the one of cuML : lapack_methods_vs_cuml.py.txt

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

wphicks commented 3 years ago

Still relevant and still worth investigating, I think. Not a super high-priority issue at the moment, though, so no recent updates.

github-actions[bot] commented 3 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.