numba / pyculib

Pyculib - Python bindings for CUDA libraries
BSD 2-Clause "Simplified" License
97 stars 11 forks source link

Solving a sparse system with "csrsv_solve" gives wrong result #12

Closed tryabin closed 7 years ago

tryabin commented 7 years ago

I'm trying to solve a simple sparse system using pyculib and am getting the wrong result. Here is the program I have, which also solves the system on the CPU to get the correct result.

import numpy as np
import scipy.sparse.linalg
import pyculib

handle = pyculib.sparse.Sparse()
dtype = np.float32
m = n = 3
trans = 'N'

# Initialize the CSR matrix on the host and GPU.
row = np.array([0, 0, 0, 1, 1, 2])
col = np.array([0, 1, 2, 0, 1, 2])
data = np.array([0.431663, 0.955176, 0.925239, 0.0283651, 0.569277, 0.48015], dtype=dtype)

csrMatrixCpu = scipy.sparse.csr_matrix((data, (row, col)), shape=(m, n))
csrMatrixGpu = pyculib.sparse.csr_matrix((data, (row, col)), shape=(m, n))

print(csrMatrixCpu)

# Perform the analysis step on the GPU.
nnz = csrMatrixGpu.nnz
csrVal = csrMatrixGpu.data
csrRowPtr = csrMatrixGpu.indptr
csrColInd = csrMatrixGpu.indices

descr = handle.matdescr()
info = handle.csrsv_analysis(trans, m, nnz, descr, csrVal, csrRowPtr, csrColInd)

# Initialize the right-hand side of the system.
alpha = 1.0
rightHandSide = np.array([0.48200423, 0.39379725, 0.75963706], dtype=dtype)
gpuResult = np.zeros(m, dtype=dtype)

# Solve the system on the GPU and on the CPU.
handle.csrsv_solve(trans, m, alpha, descr, csrVal, csrRowPtr, csrColInd, info, rightHandSide, gpuResult)
cpuResult = scipy.sparse.linalg.dsolve.spsolve(csrMatrixCpu, rightHandSide, use_umfpack=False)

print('gpu result = ' + str(gpuResult))
print('cpu result = ' + str(cpuResult))

The expected result vector is: [-4.27667809 0.9048419 1.58208275]

but solving the system on the GPU gives the result vector: [ 1.11662161 0.63611245 1.58208275]

stuartarchibald commented 7 years ago

Thanks for the report. I think there's a few things going on here:

  1. The matrix supplied is not triangular but is being treated as such in the CUDA solvers (csrsv_solve deals with solution to triangular matrices) but treated as-is on the CPU side.
  2. The matrix presented is also almost upper triangular, assuming the intent was that it is upper triangular, this needs declaring in the matdescr http://pyculib.readthedocs.io/en/latest/cusparse.html for example descr = handle.matdescr(0, 'N', 'U', 'G'), the 'U' overrides the default 'L' (lower) storage. In the code presented only the lower triangle is considered by CUDA whilst doing the solve.

This demonstrates what's going on:

import numpy as np
import scipy.sparse.linalg
import pyculib

handle = pyculib.sparse.Sparse()
dtype = np.float32
m = n = 3
trans = 'N'

# Initialize the CSR matrix on the host and GPU.
row = np.array([0, 0, 0, 1, 1, 2])
col = np.array([0, 1, 2, 0, 1, 2])
data = np.array([0.431663, 0.955176, 0.925239, 0.0283651, 0.569277, 0.48015], dtype=dtype)

csrMatrixCpu = scipy.sparse.csr_matrix((data, (row, col)), shape=(m, n))
csrMatrixGpu = pyculib.sparse.csr_matrix((data, (row, col)), shape=(m, n))

print(csrMatrixCpu)
print(csrMatrixCpu.todense())

# Perform the analysis step on the GPU.
nnz = csrMatrixGpu.nnz
csrVal = csrMatrixGpu.data
csrRowPtr = csrMatrixGpu.indptr
csrColInd = csrMatrixGpu.indices

descr = handle.matdescr()
info = handle.csrsv_analysis(trans, m, nnz, descr, csrVal, csrRowPtr, csrColInd)

# Initialize the right-hand side of the system.
alpha = 1.0
rightHandSide = np.array([0.48200423, 0.39379725, 0.75963706], dtype=dtype)
gpuResult = np.zeros(m, dtype=dtype)

# Solve the system on the GPU and on the CPU.
handle.csrsv_solve(trans, m, alpha, descr, csrVal, csrRowPtr, csrColInd, info, rightHandSide, gpuResult)

# The system solved on the GPU is not that on the CPU...
# it was declared as lower triangular, so mutate the CPU version to also be lower triangular
csrMatrixCpu = scipy.sparse.csr_matrix(np.tril(csrMatrixCpu.todense()))
print(csrMatrixCpu.todense())
cpuResult = scipy.sparse.linalg.dsolve.spsolve(csrMatrixCpu, rightHandSide, use_umfpack=False)

print('gpu result = ' + str(gpuResult))
print('cpu result = ' + str(cpuResult))

which gives:

$ python issue12b.py 
  (0, 0)        0.431663
  (0, 1)        0.955176
  (0, 2)        0.925239
  (1, 0)        0.0283651
  (1, 1)        0.569277
  (2, 2)        0.48015
[[ 0.43166301  0.955176    0.92523903]
 [ 0.0283651   0.56927699  0.        ]
 [ 0.          0.          0.48015001]]
[[ 0.43166301  0.          0.        ]
 [ 0.0283651   0.56927699  0.        ]
 [ 0.          0.          0.48015001]]
gpu result = [ 1.11662161  0.63611245  1.58208275]
cpu result = [ 1.11662161  0.63611245  1.58208275]

Further, this shows how to handle an upper triangular system a bit like the one above, using the matdescr constructor.

import numpy as np
import scipy.sparse.linalg
import pyculib

handle = pyculib.sparse.Sparse()
dtype = np.float32
m = n = 3
trans = 'N'

# Initialize the CSR matrix on the host and GPU.
row = np.array([0, 0, 0, 1, 2])
col = np.array([0, 1, 2, 1, 2])
data = np.array([0.431663, 0.955176, 0.925239, 0.569277, 0.48015], dtype=dtype)

csrMatrixCpu = scipy.sparse.csr_matrix((data, (row, col)), shape=(m, n))
csrMatrixGpu = pyculib.sparse.csr_matrix((data, (row, col)), shape=(m, n))

print(csrMatrixCpu)
print(csrMatrixCpu.todense())

# Perform the analysis step on the GPU.
nnz = csrMatrixGpu.nnz
csrVal = csrMatrixGpu.data
csrRowPtr = csrMatrixGpu.indptr
csrColInd = csrMatrixGpu.indices

descr = handle.matdescr(0, 'N', 'U', 'G')
info = handle.csrsv_analysis(trans, m, nnz, descr, csrVal, csrRowPtr, csrColInd)

# Initialize the right-hand side of the system.
alpha = 1.0
rightHandSide = np.array([0.48200423, 0.39379725, 0.75963706], dtype=dtype)
gpuResult = np.zeros(m, dtype=dtype)

# Solve the system on the GPU and on the CPU.
handle.csrsv_solve(trans, m, alpha, descr, csrVal, csrRowPtr, csrColInd, info, rightHandSide, gpuResult)
cpuResult = scipy.sparse.linalg.dsolve.spsolve(csrMatrixCpu, rightHandSide, use_umfpack=False)

print('gpu result = ' + str(gpuResult))
print('cpu result = ' + str(cpuResult))

which gives:

$ python issue12b.py 
  (0, 0)        0.431663
  (0, 1)        0.955176
  (0, 2)        0.925239
  (1, 1)        0.569277
  (2, 2)        0.48015
[[ 0.43166301  0.955176    0.92523903]
 [ 0.          0.56927699  0.        ]
 [ 0.          0.          0.48015001]]
gpu result = [-3.80515194  0.69174981  1.58208275]
cpu result = [-3.80515194  0.69174981  1.58208275]

Hope this helps.

tryabin commented 7 years ago

The matrix was not intended to be triangular. I overlooked the fact that csrsv_solve only works on triangular matrices.

Is it possible to solve a sparse non-triangular system using Pyculib? I see there is a function for performing Cholesky factorization, but it requires a call to csrsv_analysis first, which in turn only works for triangular matrices.

seibert commented 7 years ago

It looks like this could be improved with some of the other functions in cuSPARSE, which is what pyculib is delegating to.

stuartarchibald commented 7 years ago

IIRC There's no non-triangular sparse system solver supported at present (exception being tridiagonal systems) as pyculib binds solely to the Sparse BLAS which at L2/3 only has triangular solvers (and similar) as it is fundamentally a BLAS.

There's a few things to do for solver support:

  1. Bind to the cuSOLVER library.
  2. Expose the API for Dense LAPACK like routines from cuSolverDN.
  3. Expose the API for Sparse LAPACK like routines from cuSolverSP.

The cudatoolkit packages in the Numba channel have the library present. Pull requests/contributions are welcome, https://github.com/numba/pyculib/issues/3 is tracking/for discussion.

tryabin commented 7 years ago

Thanks, that's what I suspected. However in my use case of real-time simulation a workaround I found was to first factor the matrix on the CPU just once at the start, and then solve the system for different right-hand sides on the GPU by using the pre-computed lower and upper triangular matrices.