[BUG] CUSOLVER_STATUS_INVALID_VALUE in ridge regression on CUDA 10.1

dantegd commented 5 years ago

Describe the bug When datasets are bigger than a certain size, ridge regression returns

Got CUSOLVER error 3 at /home/dante/rapids/cuml10-mst/mst-1015/cpp/src_prims/linalg/svd.h:67
CUSOLVER_STATUS_INVALID_VALUE

CUDA 10.0 and 9.2 do not have this issue.

Steps/Code to reproduce bug Run the following code with latest branch-0.10 cuML (nightly conda package or from source):

from sklearn.datasets import make_regression
from cuml.linear_model import Ridge as cuRidge
from sklearn.model_selection import train_test_split

n_samples = 2**16
n_features = 399

X, y = make_regression(n_samples=n_samples, n_features=n_features, 
                       random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=0)

ridge_cuml = cuRidge(fit_intercept=False,
                     normalize=True,
                     solver='svd',
                     alpha=0.1)

ridge_cuml.fit(X_train, y_train)

# predict_cuml = ridge_cuml.predict(X_test).to_array()
# error_cuml = mean_squared_error(y_test, predict_cuml)

NOTE: if n_rows is 2**15 or less, then the code runs fine.

Expected behavior No crash

Environment details (please complete the following information):

Environment location: Bare metal and container
Linux Distro/Architecture: 18.04
GPU Model/Driver: V100 32GB
CUDA: 10.1 ONLY
Method of cuDF & cuML install:
- cuDF nightly from rapidsai-nightly
- cuML branch-0.10 built from source

Additional context Changing the dataset from float64 to float32 didn't seem to affect the behavior

oyilmaz-nvidia commented 5 years ago

Checked this bug and confirmed that the error is coming from cuSolver in CUDA 10.1. They put a int overflow check in the following call;

cusolverStatus_t 
cusolverDnSgesvd_bufferSize(
    cusolverDnHandle_t handle,
    int m,
    int n,
    int *lwork );

cusolverStatus_t 
cusolverDnDgesvd_bufferSize(
    cusolverDnHandle_t handle,
    int m,
    int n,
    int *lwork );

So, they basically check if m * m is overflowing or not. But, We usually don't need all of the eigenvectors of the left eigenmatrix (known as matrix U). To get only the n eigenvectors of left eigenmatrix, we use the parameter 'jobu = S' in the following call;

cusolverStatus_t 
cusolverDnSgesvd (
    cusolverDnHandle_t handle,
    signed char jobu,
    signed char jobvt,
    int m,
    int n,
    float *A,
    int lda,
    float *S,
    float *U,
    int ldu,
    float *VT,
    int ldvt,
    float *work,
    int lwork,
    float *rwork,
    int *devInfo);

cusolverStatus_t 
cusolverDnDgesvd (
    cusolverDnHandle_t handle,
    signed char jobu,
    signed char jobvt,
    int m,
    int n,
    double *A,
    int lda,
    double *S,
    double *U,
    int ldu,
    double *VT,
    int ldvt,
    double *work,
    int lwork,
    double *rwork,
    int *devInfo);

They should have check if n * m is overflowing when the jobu=S. I have emailed the math lib team. We should give a warning if anyone is using svd solver on CUDA 10.1.

JohnZed commented 5 years ago

Blocked awaiting math libs bugfix.

rapidsai / cuml

[BUG] CUSOLVER_STATUS_INVALID_VALUE in ridge regression on CUDA 10.1 #1269