Open zjwegert opened 3 years ago
You need to convert Precilu.L
and Precilu.U
before to make it working on GPU.
L_gpu = CuSparseMatrixCSC(Precilu.L)
U_gpu = CuSparseMatrixCSC(Precilu.U)
After that you can use sv2!
for the backward and forward sweeps.
Be careful because the storage is a little bit specific for the U factor (@haampie I am right ?)
With Tim, we updated some functions for triangular types in CUSPARSE
to avoid the use of sv2!
for information (https://github.com/JuliaGPU/CUDA.jl/pull/575).
About your example, CG
requires a symmetric and positive definite preconditioner. ilu
is great for non-symmetric square systems and Krylov solvers bicgstab
, gmres
, etc....
An incomplete Cholesky or LDL' factorization is more appropriate for your problem.
Hi @amontoison, I gave that a go but no success! Do you mean something like
...
import Base: ldiv!
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
copyto!(y, x)
sv2!('T', 'L', 1.0, P, y, 'O')
sv2!('N', 'U', 1.0, P, y, 'O',unit_diag=true)
return y
end
function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)
sv2!('T', 'L', 1.0, P, x, 'O')
sv2!('N', 'U', 1.0, P, x, 'O',unit_diag=true)
return x
end
struct LUgpu
L
Ut # transpose of U in LU decomposition
end
L_gpu = CuSparseMatrixCSC(sparse(Precilu.L+I))
U_gpu = CuSparseMatrixCSC(Precilu.U)
P = LUperso(LowerTriangular(L_gpu), UpperTriangular(U_gpu));
val = cg(A_gpu,b_gpu,verbose=true,Pl=P,tol=10^-7,maxiter=1000)
Note the change to unit_diag
which you recommended in the other issue. Let me know if you have any ideas. Also, thank you for the heads up in-regard to ILU vs Cholesky. Should I not be using sv2!
?
I think what @amontoison tries to say is if you use CUDA#master
(or is there a tagged version already?) you can do this:
using SparseArrays, LinearAlgebra, CUDA, IncompleteLU
using IncompleteLU: ILUFactorization
import LinearAlgebra: ldiv!
struct CudaILUFactorization{TL, TU}
L::TL
U::TU
end
function CudaILUFactorization(f::ILUFactorization)
L = UnitLowerTriangular(cu(f.L))
U = UpperTriangular(transpose(cu(f.U)))
return CudaILUFactorization(L, U)
end
LinearAlgebra.ldiv!(f::CudaILUFactorization, x) = ldiv!(f.U, ldiv!(f.L, x))
function example()
A = sprand(Float32, 1000, 1000, 10 / 1000) + 100I
fact = ilu(A, τ = 0.001f0);
cuda_fact = CudaILUFactorization(fact);
x = rand(Float32, 1000);
return norm(Array(ldiv!(cuda_fact, cu(x))) - fact \ x)
end
s.t.
julia> example()
2.401256f-8
Now, I'm not 100% sure anymore why I wasn't using the UnitLowerTriangular + Transpose types for the CPU. IIRC there was some performance issue or dispatch issue back in the days. Maybe that's resolved by now. I should check.
@Omega-xyZac you forgot the factors P.L
and P.Ut
.
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
copyto!(y, x)
sv2!('T', 'L', 1.0, P.L, y, 'O')
sv2!('N', 'U', 1.0, P.Ut, y, 'O',unit_diag=true)
return y
end
function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)
sv2!('T', 'L', 1.0, P.L, x, 'O')
sv2!('N', 'U', 1.0, P.Ut, x, 'O',unit_diag=true)
return x
end
@haampie The code for the backward and forward sweeps was slow If I remember well. I don't know if it was improved but I know that the dispatch issues were solved.
The last release 2.4.0 should include these modifications Harmen but I don't know why I have the old behavior of sv2!
.
Thank you both for all the help. I really appreciate it. I'm still having issues with the GPU version VS CPU. CPU converges in 40 iterations whilst GPU never converges and hits iteration limit. I have attached an example A and b along with the script. Would appreciate any ideas.
A&b.zip &
using CUDA, CUDA.CUSPARSE, CUDA.CUSOLVER
using LinearAlgebra
using SparseArrays
using IncompleteLU
using IterativeSolvers
CUDA.allowscalar(false)
A = sparse(readdlm("./A.mat",' '))
b = readdlm("./b.mat",' ')
A_cpu = A
b_cpu = b/norm(b)
A_gpu = CuSparseMatrixCSC(A_cpu)
b_gpu = CuArray(b_cpu)
Precilu = ilu(A_cpu, τ = 3.0)
struct LUgpu{TL, TU}
L::TL
U::TU # transpose of U in LU decomposition
end
import Base: ldiv!
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
copyto!(y, x)
sv2!('T', 'L', 1.0, P.L, y, 'O')
sv2!('N', 'U', 1.0, P.U, y, 'O',unit_diag=true)
return y
end
function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)
sv2!('T', 'L', 1.0, P.L, x, 'O')
sv2!('N', 'U', 1.0, P.U, x, 'O',unit_diag=true)
return x
end
P = LUgpu(CuSparseMatrixCSC(I+Precilu.L), CuSparseMatrixCSC(Precilu.U));
val = cg(A_gpu,b_gpu,verbose=true,Pl=P,tol=10^-7)
P_cpu = ilu(A_cpu,τ=3.0)
val = cg(A_cpu,b_cpu;verbose=true,Pl=P_cpu,tol=10^-7) # <--- Working as expected
Probably still doing something silly here...
I found the mistakes :
using CUDA, CUDA.CUSPARSE, CUDA.CUSOLVER
using LinearAlgebra
using SparseArrays
using IncompleteLU
CUDA.allowscalar(false)
using DelimitedFiles
using IterativeSolvers
A = sparse(readdlm("./A.mat",' '))
b = readdlm("./b.mat",' ')[:]
n, m = size(A)
A_cpu = A
b_cpu = b/norm(b)
A_gpu = CuSparseMatrixCSC(A_cpu)
b_gpu = CuArray(b_cpu)
Precilu = ilu(A_cpu, τ = 3.0)
struct LUgpu{TL, TU}
L::TL
U::TU # transpose of U in LU decomposition
end
import Base: ldiv!
function LinearAlgebra.ldiv!(y::CuArray, P::LUgpu, x::CuArray)
copyto!(y, x)
sv2!('N', 'L', 1.0, P.L, y, 'O')
sv2!('N', 'U', 1.0, P.U, y, 'O')
return y
end
function LinearAlgebra.ldiv!(P::LUgpu, x::CuArray)
sv2!('N', 'L', 1.0, P.L, x, 'O')
sv2!('N', 'U', 1.0, P.U, x, 'O')
return x
end
P = LUgpu(CuSparseMatrixCSC(I+Precilu.L), CuSparseMatrixCSC(sparse(Precilu.U')));
x1 = cg(A_gpu,b_gpu,verbose=true,Pl=P)
norm(b_gpu - A_gpu * x1)
P_cpu = ilu(A_cpu,τ=3.0)
x2 = cg(A_cpu,b_cpu;verbose=true,Pl=P_cpu)
norm(b_cpu - A_cpu * x2)
But as I explained before, It's not relevant to use ILU
factorization for symmetric positive definite linear systems...
If you also want to solve linear systems on GPU, you should compute the preconditioner on GPU, for example an Incomplete Cholesky factorization.
The best solution for your problem is from my point of view:
using LinearOperators
import Krylov
F = ic02(A_gpu, 'O')
# Solve Fy = x
function ldiv2!(y, F, x)
copyto!(y, x)
sv2!('T', 'U', 1.0, F, y, 'O')
sv2!('N', 'U', 1.0, F, y, 'O')
return y
end
# Operator that model F⁻¹
y = similar(b_gpu); n = length(b_gpu); T = eltype(b_gpu)
opM = LinearOperator(T, n, n, true, true, x -> ldiv2!(y, F, x))
# Solve a symmetric positive definite system with an incomplete Cholesky preconditioner on GPU
(x, stats) = Krylov.cg(A_gpu, b_gpu, M=opM)
Also, at least on the CPU, computing the incomplete Cholesky factorization is much more efficient than computing the incomplete LU decomposition, cause it barely requires bookkeeping for indices.
Thank you both for the info. The main issue that I was having with ic02
and even ilu02
was the lack of drop tolerance.. The example matrix I gave you is for a tiny system (6x6x6 FE cells). In reality, I am usually dealing with 500000x500000 matrices or larger. As such, using the full IC/ILU decompositions is very slow. Is there a way to set the drop tolerance for these?
ic02
and ilu02
are IC(0) and ILU(0) factorizations. The factors L and U computed by IncompleteLU have more nnz
.
Using one of the larger systems I have
nnz(Precilu.U),nnz(Precilu.L) = (569704,69708)
nnz(F) = 53999152
where F is ic02 and Precilu is ilu with a drop tolerance of 3.0.
A = sprand(Float32, 1000, 1000, 10 / 1000) + 100I
Hello @haampie ,
If the matrix is ill-conditioned, say A = sprand(Float32, 1000, 1000, 10 / 1000) + 0.1I
, the norm of the difference Array(ldiv!(cuda_fact, cu(x))) - fact \ x
becomes significant. What would be the possible cause? Thanks!
cond number: 7681.252426717046
norm: 3.2563258f14
Hi, I'm interested in using this package with a GPU CG method from IterativeMethods. I think only
forward_substitution
andbackward_substitution
methods are required to extend this to CUDA. I've tried using CUDAssv2!
with little success. Any thoughts?