pghysels / STRUMPACK

Structured Matrix Package (LBNL)
http://portal.nersc.gov/project/sparse/strumpack/
Other
166 stars 38 forks source link

Strumpack vs Cholmod #75

Open BehroozZare opened 1 year ago

BehroozZare commented 1 year ago

I know that Strumpack uses LU and Cholmod uses LLt and it is a bit unfair comparison. However, I would like to find an example where using Strumpack would be much more beneficial than using Cholmd in a single-node shared memory system. Is there any example where low-rank approximation in a single node can show significant performance benefit compere to Cholmod? I tried the 3Dpossion example in the Strumpack code base and it didn't give me the performance benefits that I was looking for.

pghysels commented 1 year ago

We can see significant benefit with the testPoisson3d example yes. For instance, compare the exact sparse direct LU solver (in single precision):

OMP_NUM_THREADS=8 ./testPoisson3d 100 --sp_disable_gpu --sp_compression none
...
#   - factor time = 27.9501
...
#   - factor memory = 8238.88 MB
...
REFINEMENT it. 0    res =      6482.84  rel.res =            1  bw.error =            1
REFINEMENT it. 1    res =    0.0019654  rel.res =   3.0317e-07  bw.error =  2.76639e-06
...
#   - solve time = 0.299797

to the solver with block low rank compression enabled:

OMP_NUM_THREADS=8 ./testPoisson3d 100 --sp_disable_gpu --sp_compression blr
...
#   - factor time = 10.2616
...
#   - factor memory = 2544.16 MB
...
#   - factor memory/nonzeros = 30.8799 % of multifrontal
...
GMRES it. 0 res =      1000.88  rel.res =            1   restart!
GMRES it. 1 res =      6.63635  rel.res =   0.00663049
GMRES it. 2 res =      1.68189  rel.res =   0.00168041
GMRES it. 3 res =     0.566236  rel.res =  0.000565736
GMRES it. 4 res =     0.194374  rel.res =  0.000194202
GMRES it. 5 res =    0.0576096  rel.res =  5.75586e-05
...
#   - solve time = 0.794273

The factorization time was reduced from 27 seconds to 10 seconds. The solve time has gone up slightly, but overall the compression enabled preconditioner is faster than the direct solver, and only requires ~30% of the memory compared to the exact solver.

For LLt you would expect a speedup of at most 2x. For BLR the speedups can be bigger for larger problems.