Open dantegd opened 5 years ago
Checked this bug and confirmed that the error is coming from cuSolver in CUDA 10.1. They put a int overflow check in the following call;
cusolverStatus_t
cusolverDnSgesvd_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int *lwork );
cusolverStatus_t
cusolverDnDgesvd_bufferSize(
cusolverDnHandle_t handle,
int m,
int n,
int *lwork );
So, they basically check if m * m is overflowing or not. But, We usually don't need all of the eigenvectors of the left eigenmatrix (known as matrix U). To get only the n eigenvectors of left eigenmatrix, we use the parameter 'jobu = S' in the following call;
cusolverStatus_t
cusolverDnSgesvd (
cusolverDnHandle_t handle,
signed char jobu,
signed char jobvt,
int m,
int n,
float *A,
int lda,
float *S,
float *U,
int ldu,
float *VT,
int ldvt,
float *work,
int lwork,
float *rwork,
int *devInfo);
cusolverStatus_t
cusolverDnDgesvd (
cusolverDnHandle_t handle,
signed char jobu,
signed char jobvt,
int m,
int n,
double *A,
int lda,
double *S,
double *U,
int ldu,
double *VT,
int ldvt,
double *work,
int lwork,
double *rwork,
int *devInfo);
They should have check if n * m is overflowing when the jobu=S. I have emailed the math lib team. We should give a warning if anyone is using svd solver on CUDA 10.1.
Blocked awaiting math libs bugfix.
Describe the bug When datasets are bigger than a certain size, ridge regression returns
CUDA 10.0 and 9.2 do not have this issue.
Steps/Code to reproduce bug Run the following code with latest branch-0.10 cuML (nightly conda package or from source):
NOTE: if
n_rows
is 2**15 or less, then the code runs fine.Expected behavior No crash
Environment details (please complete the following information):
Additional context Changing the dataset from float64 to float32 didn't seem to affect the behavior