Open BehroozZare opened 1 year ago
We can see significant benefit with the testPoisson3d example yes. For instance, compare the exact sparse direct LU solver (in single precision):
OMP_NUM_THREADS=8 ./testPoisson3d 100 --sp_disable_gpu --sp_compression none
...
# - factor time = 27.9501
...
# - factor memory = 8238.88 MB
...
REFINEMENT it. 0 res = 6482.84 rel.res = 1 bw.error = 1
REFINEMENT it. 1 res = 0.0019654 rel.res = 3.0317e-07 bw.error = 2.76639e-06
...
# - solve time = 0.299797
to the solver with block low rank compression enabled:
OMP_NUM_THREADS=8 ./testPoisson3d 100 --sp_disable_gpu --sp_compression blr
...
# - factor time = 10.2616
...
# - factor memory = 2544.16 MB
...
# - factor memory/nonzeros = 30.8799 % of multifrontal
...
GMRES it. 0 res = 1000.88 rel.res = 1 restart!
GMRES it. 1 res = 6.63635 rel.res = 0.00663049
GMRES it. 2 res = 1.68189 rel.res = 0.00168041
GMRES it. 3 res = 0.566236 rel.res = 0.000565736
GMRES it. 4 res = 0.194374 rel.res = 0.000194202
GMRES it. 5 res = 0.0576096 rel.res = 5.75586e-05
...
# - solve time = 0.794273
The factorization time was reduced from 27 seconds to 10 seconds. The solve time has gone up slightly, but overall the compression enabled preconditioner is faster than the direct solver, and only requires ~30% of the memory compared to the exact solver.
For LLt you would expect a speedup of at most 2x. For BLR the speedups can be bigger for larger problems.
I know that Strumpack uses LU and Cholmod uses LLt and it is a bit unfair comparison. However, I would like to find an example where using Strumpack would be much more beneficial than using Cholmd in a single-node shared memory system. Is there any example where low-rank approximation in a single node can show significant performance benefit compere to Cholmod? I tried the 3Dpossion example in the Strumpack code base and it didn't give me the performance benefits that I was looking for.