Using with CUDA Solver?

zjwegert commented 3 years ago

Hi, I'm interested in using this package with a GPU CG method from IterativeMethods. I think only forward_substitution and backward_substitution methods are required to extend this to CUDA. I've tried using CUDAs sv2! with little success. Any thoughts?

using CUDA, CUDA.CUSPARSE, CUDA.CUSOLVER
using LinearAlgebra
using SparseArrays
using IncompleteLU
using IterativeSolvers
CUDA.allowscalar(false)

val = sprand(200,200,0.05);
A_cpu = val*val'
b_cpu = rand(200)

A_gpu = CuSparseMatrixCSC(A_cpu)
b_gpu = CuArray(b_cpu)

Precilu = ilu(A_cpu, τ = 3.0)

import Base: ldiv!
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
  copyto!(y, x)                       
  sv2!('N', 'L', 1.0, P, y, 'O') 
  sv2!('N', 'U', 1.0, P, y, 'O')  
  return y
end

function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)                        
  sv2!('N', 'L', 1.0, P, x, 'O') 
  sv2!('N', 'U', 1.0, P, x, 'O')  
  return x
end

# function LinearAlgebra.ldiv!(_lu::LUperso, rhs::CUDA.CuArray)
#   _x = UpperTriangular(_lu.Ut) \ (LowerTriangular(_lu.L) \ rhs)
#   rhs .= vec(_x)
#   # CUDA.unsafe_free!(_x)
#   rhs
# end
#
# function LinearAlgebra.ldiv!(yrhs::CuArray,_lu::LUperso, rhs::CuArray)
#   copyto!(yrhs,rhs)
#   _x = UpperTriangular(_lu.Ut) \ (LowerTriangular(_lu.L) \ rhs)
#   rhs .= vec(_x)
#   # CUDA.unsafe_free!(_x)
#   rhs
# end

struct LUgpu
    L
    Ut  # transpose of U in LU decomposition
end

P = LUperso(LowerTriangular(CuSparseMatrixCSR(I+Precilu.L)), UpperTriangular(CuSparseMatrixCSR(sparse(Precilu.U'))));
val = cg(A_gpu,b_gpu,verbose=true,Pl=P,tol=10^-7,maxiter=1000)

P_cpu = ilu(A_cpu,τ=3.0)
val = cg(A_cpu,b_cpu;verbose=true,Pl=P_cpu,tol=10^-7,maxiter=1000)
A_cpu\b_cpu

amontoison commented 3 years ago

You need to convert Precilu.L and Precilu.U before to make it working on GPU.

L_gpu = CuSparseMatrixCSC(Precilu.L)
U_gpu = CuSparseMatrixCSC(Precilu.U)

After that you can use sv2! for the backward and forward sweeps. Be careful because the storage is a little bit specific for the U factor (@haampie I am right ?) With Tim, we updated some functions for triangular types in CUSPARSE to avoid the use of sv2! for information (https://github.com/JuliaGPU/CUDA.jl/pull/575).

About your example, CG requires a symmetric and positive definite preconditioner. ilu is great for non-symmetric square systems and Krylov solvers bicgstab, gmres, etc.... An incomplete Cholesky or LDL' factorization is more appropriate for your problem.

zjwegert commented 3 years ago

Hi @amontoison, I gave that a go but no success! Do you mean something like

...
import Base: ldiv!
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
  copyto!(y, x)
  sv2!('T', 'L', 1.0, P, y, 'O')
  sv2!('N', 'U', 1.0, P, y, 'O',unit_diag=true)
  return y
end

function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)
  sv2!('T', 'L', 1.0, P, x, 'O')
  sv2!('N', 'U', 1.0, P, x, 'O',unit_diag=true)
  return x
end

struct LUgpu
    L
    Ut  # transpose of U in LU decomposition
end

L_gpu = CuSparseMatrixCSC(sparse(Precilu.L+I))
U_gpu = CuSparseMatrixCSC(Precilu.U)

P = LUperso(LowerTriangular(L_gpu), UpperTriangular(U_gpu));
val = cg(A_gpu,b_gpu,verbose=true,Pl=P,tol=10^-7,maxiter=1000)

Note the change to unit_diag which you recommended in the other issue. Let me know if you have any ideas. Also, thank you for the heads up in-regard to ILU vs Cholesky. Should I not be using sv2!?

haampie commented 3 years ago

I think what @amontoison tries to say is if you use CUDA#master (or is there a tagged version already?) you can do this:


using SparseArrays, LinearAlgebra, CUDA, IncompleteLU
using IncompleteLU: ILUFactorization
import LinearAlgebra: ldiv!

struct CudaILUFactorization{TL, TU}
    L::TL
    U::TU
end

function CudaILUFactorization(f::ILUFactorization)
    L = UnitLowerTriangular(cu(f.L))
    U = UpperTriangular(transpose(cu(f.U)))

    return CudaILUFactorization(L, U)
end

LinearAlgebra.ldiv!(f::CudaILUFactorization, x) = ldiv!(f.U, ldiv!(f.L, x))

function example()
    A = sprand(Float32, 1000, 1000, 10 / 1000) + 100I
    fact = ilu(A, τ = 0.001f0);
    cuda_fact = CudaILUFactorization(fact);

    x = rand(Float32, 1000);

    return norm(Array(ldiv!(cuda_fact, cu(x))) - fact \ x)
end

s.t.

julia> example()
2.401256f-8

haampie commented 3 years ago

Now, I'm not 100% sure anymore why I wasn't using the UnitLowerTriangular + Transpose types for the CPU. IIRC there was some performance issue or dispatch issue back in the days. Maybe that's resolved by now. I should check.

amontoison commented 3 years ago

@Omega-xyZac you forgot the factors P.L and P.Ut.

function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
    copyto!(y, x)
    sv2!('T', 'L', 1.0, P.L, y, 'O')
    sv2!('N', 'U', 1.0, P.Ut, y, 'O',unit_diag=true)
    return y
end

function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)
    sv2!('T', 'L', 1.0, P.L, x, 'O')
    sv2!('N', 'U', 1.0, P.Ut, x, 'O',unit_diag=true)
    return x
end

@haampie The code for the backward and forward sweeps was slow If I remember well. I don't know if it was improved but I know that the dispatch issues were solved.

The last release 2.4.0 should include these modifications Harmen but I don't know why I have the old behavior of sv2!.

zjwegert commented 3 years ago

Thank you both for all the help. I really appreciate it. I'm still having issues with the GPU version VS CPU. CPU converges in 40 iterations whilst GPU never converges and hits iteration limit. I have attached an example A and b along with the script. Would appreciate any ideas.

A&b.zip &

using CUDA, CUDA.CUSPARSE, CUDA.CUSOLVER
using LinearAlgebra
using SparseArrays
using IncompleteLU
using IterativeSolvers
CUDA.allowscalar(false)

A = sparse(readdlm("./A.mat",' '))
b = readdlm("./b.mat",' ')

A_cpu = A
b_cpu = b/norm(b)

A_gpu = CuSparseMatrixCSC(A_cpu)
b_gpu = CuArray(b_cpu)

Precilu = ilu(A_cpu, τ = 3.0)

struct LUgpu{TL, TU}
    L::TL
    U::TU   # transpose of U in LU decomposition
end

import Base: ldiv!
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
  copyto!(y, x)
  sv2!('T', 'L', 1.0, P.L, y, 'O')
  sv2!('N', 'U', 1.0, P.U, y, 'O',unit_diag=true)
  return y
end

function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)
  sv2!('T', 'L', 1.0, P.L, x, 'O')
  sv2!('N', 'U', 1.0, P.U, x, 'O',unit_diag=true)
  return x
end

P = LUgpu(CuSparseMatrixCSC(I+Precilu.L), CuSparseMatrixCSC(Precilu.U));
val = cg(A_gpu,b_gpu,verbose=true,Pl=P,tol=10^-7)

P_cpu = ilu(A_cpu,τ=3.0)
val = cg(A_cpu,b_cpu;verbose=true,Pl=P_cpu,tol=10^-7) # <--- Working as expected

Probably still doing something silly here...

amontoison commented 3 years ago

I found the mistakes :

b = readdlm("./b.mat",' ') -> b = readdlm("./b.mat",' ')[:]
CuSparseMatrixCSC(Precilu.U) -> CuSparseMatrixCSC(sparse(Precilu.U')))
Update ldiv! functions

using CUDA, CUDA.CUSPARSE, CUDA.CUSOLVER
using LinearAlgebra
using SparseArrays
using IncompleteLU
CUDA.allowscalar(false)
using DelimitedFiles
using IterativeSolvers

A = sparse(readdlm("./A.mat",' '))
b = readdlm("./b.mat",' ')[:]
n, m = size(A)

A_cpu = A
b_cpu = b/norm(b)

A_gpu = CuSparseMatrixCSC(A_cpu)
b_gpu = CuArray(b_cpu)

Precilu = ilu(A_cpu, τ = 3.0)

struct LUgpu{TL, TU}
  L::TL
  U::TU # transpose of U in LU decomposition
end

import Base: ldiv!
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
  copyto!(y, x)
  sv2!('N', 'L', 1.0, P.L, y, 'O')
  sv2!('N', 'U', 1.0, P.U, y, 'O')
  return y
end

function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)
  sv2!('N', 'L', 1.0, P.L, x, 'O')
  sv2!('N', 'U', 1.0, P.U, x, 'O')
  return x
end

P = LUgpu(CuSparseMatrixCSC(I+Precilu.L), CuSparseMatrixCSC(sparse(Precilu.U')));
x1 = cg(A_gpu,b_gpu,verbose=true,Pl=P)
norm(b_gpu - A_gpu * x1)

P_cpu = ilu(A_cpu,τ=3.0)
x2 = cg(A_cpu,b_cpu;verbose=true,Pl=P_cpu)
norm(b_cpu - A_cpu * x2)

But as I explained before, It's not relevant to use ILU factorization for symmetric positive definite linear systems... If you also want to solve linear systems on GPU, you should compute the preconditioner on GPU, for example an Incomplete Cholesky factorization.

The best solution for your problem is from my point of view:

using LinearOperators
import Krylov

F = ic02(A_gpu, 'O')

# Solve Fy = x
function ldiv2!(y, F, x)
  copyto!(y, x)
  sv2!('T', 'U', 1.0, F, y, 'O')
  sv2!('N', 'U', 1.0, F, y, 'O')
  return y
end

# Operator that model F⁻¹
y = similar(b_gpu); n = length(b_gpu); T = eltype(b_gpu)
opM = LinearOperator(T, n, n, true, true, x -> ldiv2!(y, F, x))

# Solve a symmetric positive definite system with an incomplete Cholesky preconditioner on GPU
(x, stats) = Krylov.cg(A_gpu, b_gpu, M=opM)

haampie commented 3 years ago

Also, at least on the CPU, computing the incomplete Cholesky factorization is much more efficient than computing the incomplete LU decomposition, cause it barely requires bookkeeping for indices.

zjwegert commented 3 years ago

Thank you both for the info. The main issue that I was having with ic02 and even ilu02 was the lack of drop tolerance.. The example matrix I gave you is for a tiny system (6x6x6 FE cells). In reality, I am usually dealing with 500000x500000 matrices or larger. As such, using the full IC/ILU decompositions is very slow. Is there a way to set the drop tolerance for these?

amontoison commented 3 years ago

ic02 and ilu02 are IC(0) and ILU(0) factorizations. The factors L and U computed by IncompleteLU have more nnz.

zjwegert commented 3 years ago

Using one of the larger systems I have

nnz(Precilu.U),nnz(Precilu.L) = (569704,69708)
nnz(F) = 53999152

where F is ic02 and Precilu is ilu with a drop tolerance of 3.0.

cuihantao commented 2 years ago

    A = sprand(Float32, 1000, 1000, 10 / 1000) + 100I

Hello @haampie ,

If the matrix is ill-conditioned, say A = sprand(Float32, 1000, 1000, 10 / 1000) + 0.1I, the norm of the difference Array(ldiv!(cuda_fact, cu(x))) - fact \ x becomes significant. What would be the possible cause? Thanks!

cond number: 7681.252426717046
norm: 3.2563258f14

haampie / IncompleteLU.jl

Using with CUDA Solver? #18