hypre-space / hypre

Parallel solvers for sparse linear systems featuring multigrid methods.
https://www.llnl.gov/casc/hypre/
Other
675 stars 187 forks source link

BoomerAMG GPU performance & configuration guidance #120

Closed zjibben closed 4 years ago

zjibben commented 4 years ago

I've been building Hypre with GPU support and running BoomerAMG, with disappointing results. I wanted to check if this is expected for now or if I'm doing something incorrectly.

I'm building master @ bd1f981, using this configuration:

./configure CC=mpicc CXX=mpicxx HYPRE_CUDA_SM=70 --prefix=$HOME/opt/hypre/master-gpu --enable-shared --enable-unified-memory --with-cuda --with-MPI --disable-fortran

I'm configuring BoomerAMG as a preconditioner using:

HYPRE_BoomerAMGSetCoarsenType(solver, 6); // Falgout coarsening
HYPRE_BoomerAMGSetRelaxType(solver, 3); // hybrid Gauss-Seidel smoothing
HYPRE_BoomerAMGSetNumSweeps(solver, 1);
HYPRE_BoomerAMGSetMaxLevels(solver, 25);
HYPRE_BoomerAMGSetMaxIter(solver, 2);
HYPRE_BoomerAMGSetTol(solver, 0.0);
HYPRE_BoomerAMGSetStrongThreshold(solver, 0.5);

I'm calling HYPRE_BoomerAMGSolve roughly 4000 times, comparing the GPU-enabled version on a Titan V against the CPU version on a single core of a Xeon E5-2683 v4. I'm comparing HYPRE_BoomerAMGSolve timings only, not intialization or setvalues time. For three different problem sizes, the GPU code ran 2x slower than the CPU in serial. The vector lengths were 166762, 497175, and 1288252 elements. Is this consistent with your findings, or is something wrong with my configuration?

Aside: I tried a fourth problem size with a vector length of 3876408 elements, which should easily fit on the 12GB of GPU memory several times over. But I get the error CUDA ERROR (code = 2, out of memory) at hypre_memory.c:175. With some print statements I found Hypre attempted to allocate ~16 exabytes on the GPU. The problem runs just fine with CPU-only Hypre, so I believe there's an error somewhere.

ulrikeyang commented 4 years ago

Hi Zach, To get better performance you should use a Jacobi smoother. Gauss-Seidel will run on the CPU. You can use relax_type 18 or relax_type 7 with a weight. You should also set HYPRE_BoomerAMGSetKeepTranspose(solver, 1); This will use all matvecs and avoid the transpose matvecs. Let me know if this works better. Also for better complexities, I suggest to use a different coarsening scheme. The current default is 10. Or you can try 8.

Ulrike

From: Zach Jibben notifications@github.com Sent: Monday, May 18, 2020 4:27 PM To: hypre-space/hypre hypre@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [hypre-space/hypre] BoomerAMG GPU performance & configuration guidance (#120)

I've been building Hypre with GPU support and running BoomerAMG, with disappointing results. I wanted to check if this is expected for now or if I'm doing something incorrectly.

I'm building master @ bd1f981https://github.com/hypre-space/hypre/commit/bd1f981c6f22da50b927bd8fae775163b5e58028, using this configuration:

./configure CC=mpicc CXX=mpicxx HYPRE_CUDA_SM=70 --prefix=$HOME/opt/hypre/master-gpu --enable-shared --enable-unified-memory --with-cuda --with-MPI --disable-fortran

I'm configuring BoomerAMG as a preconditioner using:

HYPRE_BoomerAMGSetCoarsenType(solver, 6); // Falgout coarsening

HYPRE_BoomerAMGSetRelaxType(solver, 3); // hybrid Gauss-Seidel smoothing

HYPRE_BoomerAMGSetNumSweeps(solver, 1);

HYPRE_BoomerAMGSetMaxLevels(solver, 25);

HYPRE_BoomerAMGSetMaxIter(solver, 2);

HYPRE_BoomerAMGSetTol(solver, 0.0);

HYPRE_BoomerAMGSetStrongThreshold(solver, 0.5);

I'm calling HYPRE_BoomerAMGSolve roughly 800 times, comparing the GPU-enabled version on a Titan V against the CPU version on a single core of a Xeon E5-2683 v4. I'm comparing HYPRE_BoomerAMGSolve timings only, not intialization or setvalues time. For three different problem sizes, the GPU code ran 2x slower than the CPU in serial. The vector lengths were 166762, 497175, and 1288252 elements. Is this consistent with your findings, or is something wrong with my configuration?

Aside: I tried a fourth problem size with a vector length of 3876408 elements, which should easily fit on the 12GB of GPU memory several times over. But I get the error CUDA ERROR (code = 2, out of memory) at hypre_memory.c:175. With some print statements I found Hypre attempted to allocate ~16 exabytes on the GPU. The problem runs just fine with CPU-only Hypre, so I believe there's an error somewhere.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/hypre-space/hypre/issues/120, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD4NLLJUV7T5IUEBMGFMLLLRSG74BANCNFSM4NEP7SPA.

liruipeng commented 4 years ago

Zach,

After HYPRE_Init(); Put the following line to enable AMG-setup on GPUs (default is CPU)

   hypre_HandleDefaultExecPolicy(hypre_handle()) = HYPRE_EXEC_DEVICE;

I am also curious about the memory issue. Can you somehow backtrack where was this allocation from hypre, by using gdb or something? Thanks!

-Ruipeng

zjibben commented 4 years ago

Thank you for the suggestions, Ulrike! Those improved things drastically, now the solve time ranges between 20 - 40x faster than the CPU in serial, with more favorable comparisons for larger runs. Neglecting data transfer outside the solve call, is this closer to what I should expect? I'm not familiar enough with the BoomerAMG algorithm to know how it should fare on a GPU architecture.

Ruipeng, in fact I already have that line, and another:

hypre_HandleMemoryLocation(hypre_handle()) = HYPRE_MEMORY_DEVICE;
hypre_HandleDefaultExecPolicy(hypre_handle()) = HYPRE_EXEC_DEVICE;

I'm calling _v2 versions of vector & matrix initialize functions, and I'm using HYPRE_Malloc and HYPRE_Memcpy before/after SetValues and GetValues. The out-of-memory error still occurs on my larger run. I'll see if I can make a small reproducer for you, or in the process perhaps I'll catch an error with my own use of HYPRE_Malloc.

liruipeng commented 4 years ago

Thank you for the suggestions, Ulrike! Those improved things drastically, now the solve time ranges between 20 - 40x faster than the CPU in serial, with more favorable comparisons for larger runs. Neglecting data transfer outside the solve call, is this closer to what I should expect? I'm not familiar enough with the BoomerAMG algorithm to know how it should fare on a GPU architecture.

Ruipeng, in fact I already have that line, and another:

hypre_HandleMemoryLocation(hypre_handle()) = HYPRE_MEMORY_DEVICE;
hypre_HandleDefaultExecPolicy(hypre_handle()) = HYPRE_EXEC_DEVICE;

I'm calling _v2 versions of vector & matrix initialize functions, and I'm using HYPRE_Malloc and HYPRE_Memcpy before/after SetValues and GetValues. The out-of-memory error still occurs on my larger run. I'll see if I can make a small reproducer for you, or in the process perhaps I'll catch an error with my own use of HYPRE_Malloc.

Thanks for the updates, Zach. I am curious in which function at some higher "level" calls the hypre_TAlloc (such as matrix assembly, AMG interpolation, coarsening, matrix-matrix product, etc) that gives the OOM issue. 16 exabytes are enormous. Something is wrong. If you can provide the call stack before it was OOM, it would be helpful. A producer is also good. Thanks! -Ruipeng

zjibben commented 4 years ago

My mistake, looks like the CUDA memory error is due to my end. I had a Fortran precision mismatch causing my array size calculation to loop into negative numbers before casting to a long integer. Sorry for the confusion!

zjibben commented 4 years ago

Back to the performance question: The solve step runs 20-40x faster on the GPU than the CPU in serial. I've seen code run several hundred times faster on these Volta cards, is this a limitation of the algorithm? I'm also spending a significant amount of time in setup, which is only running 2x faster than on the CPU in serial and now takes the majority of my runtime. Do you have any suggestions to improve performance?

ulrikeyang commented 4 years ago

Hi Zach, First of all the codes you have seen running at high speed are most likely codes with very high arithmetic density, e.g. certain dense matrix codes, and very high parallelism, using very large problem sizes. For AMG, you are dealing with sparse matrices as well as lower levels with increasingly smaller problems, all of which is not ideal for GPUs, however it is also important to use the right settings to run on GPUs. To get the complete setup running on GPUs you need to use in addition to the settings we told you about before the following: HYPRE_BoomerAMGSetRAP2(amg_solver, 1); HYPRE_BoomerAMGSetModuleRAP2(amg_solver, 1); hypre_HandleSpgemmUseCusparse(hypre_handle()) = 0; (this will avoid Cusparse matrix matrix multiplication and use our own) I am assuming you are using the most current hypre version or what's currently in the repository. We just had a new release yesterday.

rfalgout commented 4 years ago

Has this issue been resolved? Thanks! -Rob

ulrikeyang commented 4 years ago

yes