Segfault using hypre_BoomerAMGCoarsenRuge

jrood-nrel commented 3 months ago

Hello, I've been working on a segfault in Exawind when running on GPUs with CUDA. I'm using the latest hypre. I can't tell if we're calling hypre wrong, or if something is wrong in hypre. I have this stacktrace in case anyone has any insight into what could be going wrong. It sure seems to running out of bounds on this array (num_variables=39732 here):

#0  hypre_BoomerAMGCoarsenRuge (S=0x6d03c9c0, A=0x51242c30, measure_type=0, coarsen_type=6, cut_factor=0, debug_flag=0,
    CF_marker_ptr=0x512c14d0) at par_coarsen.c:1056
#1  0x00007fff9528a58f in hypre_BoomerAMGCoarsenFalgout (S=0x6d03c9c0, A=0x51242c30, measure_type=0, cut_factor=0, debug_flag=0,
    CF_marker_ptr=0x512c14d0) at par_coarsen.c:2075
#2  0x00007fff953cd3bc in hypre_BoomerAMGSetup (amg_vdata=0x512c2260, A=0x51242c30, f=0x74c72f10, u=0x5128eab0)
    at par_amg_setup.c:1225
#3  0x00007fff9523eed2 in HYPRE_BoomerAMGSetup (solver=0x512c2260, A=0x51242c30, b=0x74c72f10, x=0x5128eab0)
    at HYPRE_parcsr_amg.c:53
#4  0x00007fff952276ea in hypre_GMRESSetup (gmres_vdata=0x512f9480, A=0x51242c30, b=0x74c72f10, x=0x5128eab0) at gmres.c:241
#5  0x00007fff9521d147 in HYPRE_GMRESSetup (solver=0x512f9480, A=0x51242c30, b=0x74c72f10, x=0x5128eab0) at HYPRE_gmres.c:37
#6  0x00007fff952427f9 in HYPRE_ParCSRGMRESSetup (solver=0x512f9480, A=0x51242c30, b=0x74c72f10, x=0x5128eab0)
    at HYPRE_parcsr_gmres.c:69
#7  0x0000000001652861 in sierra::nalu::HypreDirectSolver::setupSolver (this=0x51876640)
    at /mnt/vdb/home/user/exawind/exawind-manager/environment/exawind/nalu-wind/src/HypreDirectSolver.C:210
#8  0x00000000016526de in sierra::nalu::HypreDirectSolver::initSolver (this=0x51876640)
    at /mnt/vdb/home/user/exawind/exawind-manager/environment/exawind/nalu-wind/src/HypreDirectSolver.C:200
#9  0x0000000001651ff3 in sierra::nalu::HypreDirectSolver::solve (this=0x51876640, numIterations=@0x7fffffff780c: 0,
    finalResidualNorm=@0x7fffffff7800: 0, isFinalOuterIter=true)
    at /mnt/vdb/home/user/exawind/exawind-manager/environment/exawind/nalu-wind/src/HypreDirectSolver.C:104
#10 0x00000000016819ae in sierra::nalu::HypreLinearSystem::solve (this=0x22e8cd80, linearSolutionField=0x51361640)
    at /mnt/vdb/home/user/exawind/exawind-manager/environment/exawind/nalu-wind/src/HypreLinearSystem.C:2498
#11 0x0000000001369287 in sierra::nalu::EquationSystem::assemble_and_solve (this=0x5135d350, deltaSolution=0x51361640)
...
...

https://github.com/hypre-space/hypre/blob/3caa81955eb8d1b4e35d9b450e27cf6d07b50f6e/src/parcsr_ls/par_coarsen.c#L1056

victorapm commented 3 months ago

Hi Jon,

the function hypre_BoomerAMGCoarsenFalgout has not been ported to CUDA and therefore requires unified memory support (--enable-unified-memory) when running hypre with the execution policy set to device. Could you double-check if hypre was configured with UVM support in your build (look for #define HYPRE_USING_UNIFIED_MEMORY in ${HYPRE_PATH}/src/HYPRE_Config.h)?

For a GPU-enabled coarsening technique that doesn't require UVM, I recommend using PMIS. I am working on getting HMIS also working without UVM (however some parts of HMIS still run on the CPU).

Hope this helps!

jrood-nrel commented 3 months ago

Well that makes sense then. I have tried with --enable-unified-memory and it fails in a different place. I will try another algorithm. Thanks!

jrood-nrel commented 3 months ago

This was helpful. I took some settings from another case we had that seems to be working: https://github.com/Exawind/exawind-cases/blob/main/single-turbine/nrel5mw_nalu.yaml

victorapm commented 3 months ago

Glad it worked, please reach out if you have any problems in the future

hypre-space / hypre

Segfault using hypre_BoomerAMGCoarsenRuge #1120