hypre-space / hypre

Parallel solvers for sparse linear systems featuring multigrid methods.
https://www.llnl.gov/casc/hypre/
Other
670 stars 184 forks source link

example code for AMG-DD solver? #1115

Open BenWibking opened 4 weeks ago

BenWibking commented 4 weeks ago

Are there any examples of AMG-DD usage?

I don't see it used anywhere in https://github.com/hypre-space/hypre/blob/master/src/test/test_ij.c.

I would like to replicate the tests in https://arxiv.org/abs/1906.10575, but there does not appear to be enough information in the paper to choose values for all of the parameters that are exposed by the API.

waynemitchell commented 4 weeks ago

@BenWibking, you can use the ij driver to test AMG-DD by passing -solver 90 or -solver 91: https://github.com/hypre-space/hypre/blob/master/src/test/ij.c#L2392 Which parameters are you unsure about? The paper should discuss the most important parameters (I hope).

BenWibking commented 4 weeks ago

Thanks for the quick response!

I see I was looking at the wrong source file.

I am unsure about the parameter setting the number of ghost layers:

         HYPRE_BoomerAMGDDSetNumGhostLayers(amgdd_solver, amgdd_num_ghost_layers);

Perhaps I missed something in the paper. How many ghost layers were used for the tests shown?

waynemitchell commented 4 weeks ago

If I'm remembering this correctly, everything should work correctly with a single ghost layer. The main algorithmic development in the paper was driven by trying to minimize the number of ghost layers required. A single ghost layer is the default set in the ij driver, so you shouldn't need to set anything: https://github.com/hypre-space/hypre/blob/master/src/test/ij.c#L499 I'm not sure why the number of ghost layers is still exposed as a parameter that the user can set... maybe there are cases that I'm just not thinking of right now when you need to set it higher. But I think it's likely a relic from a previous version of the algorithm.

BenWibking commented 3 weeks ago

Ok, I've looked at that code and tried to modify the AMG2023 code to use AMGDD based on ij.c.

My code is here: https://github.com/BenWibking/AMG2023/commit/c847960f1bb70df13d004ca42868455b8525313c#diff-ee753cc8c3a9fe01da6eeade8f8b9aee1d4c7485f3f52f2ae2add0a12222111d

However, I get an MPI_ABORT with no other error message that would enable me to debug it, even when running in a debugger:

Running with these driver parameters:
  Problem ID    = 1

=============================================
Hypre init times:
=============================================
Hypre init:
  wall clock time = 0.000000 seconds
  Laplacian_27pt:
    (Nx, Ny, Nz) = (10, 10, 10)
    (Px, Py, Pz) = (1, 1, 1)

=============================================
Generate Matrix:
=============================================
Spatial Operator:
  wall clock time = 0.000295 seconds
  RHS vector has unit components
  Initial guess is 0
=============================================
IJ Vector Setup:
=============================================
RHS and Initial Guess:
  wall clock time = 0.000012 seconds
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
  Proc: [[6075,0],0]
  Errorcode: -1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Is there some way to see why Hypre called MPI_ABORT?

BenWibking commented 3 weeks ago

I recompiled without MPI and was able to see this. It looks like a bug:

(lldb) r
Process 30117 launched: '/Users/benwibking/AMG2023/amg' (arm64)
Running with these driver parameters:
  Problem ID    = 1

=============================================
Hypre init times:
=============================================
Hypre init:
  wall clock time = 0.000000 seconds
  Laplacian_27pt:
    (Nx, Ny, Nz) = (10, 10, 10)
    (Px, Py, Pz) = (1, 1, 1)

=============================================
Generate Matrix:
=============================================
Spatial Operator:
  wall clock time = 0.000000 seconds
  RHS vector has unit components
  Initial guess is 0
=============================================
IJ Vector Setup:
=============================================
RHS and Initial Guess:
  wall clock time = 0.000000 seconds
[memory.c, 66] hypre_assert failed: 0
Assertion failed: (0), function hypre_OutOfMemory, file memory.c, line 66.
Process 30117 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #4: 0x00000001001f3ae4 amg`hypre_OutOfMemory(size=18446744064213649472) at memory.c:66:4
   63
   64      hypre_sprintf(msg, "Out of memory trying to allocate %zu bytes\n", size);
   65      hypre_error_w_msg(HYPRE_ERROR_MEMORY, msg);
-> 66      hypre_assert(0);
   67      fflush(stdout);
   68   }
   69
Target 0: (amg) stopped.
BenWibking commented 3 weeks ago

The full backtrace is:

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #0: 0x00000001856d15f0 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x0000000185709c20 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x0000000185616a30 libsystem_c.dylib`abort + 180
    frame #3: 0x0000000185615d20 libsystem_c.dylib`__assert_rtn + 284
  * frame #4: 0x00000001001f3ae4 amg`hypre_OutOfMemory(size=18446744064213649472) at memory.c:66:4
    frame #5: 0x00000001001f3000 amg`hypre_MAlloc_core(size=18446744064213649472, zeroinit=1, location=hypre_MEMORY_HOST) at memory.c:437:7
    frame #6: 0x00000001001f34f8 amg`hypre_CAlloc(count=18446744072522563848, elt_size=8, location=HYPRE_MEMORY_HOST) at memory.c:948:11
    frame #7: 0x00000001000ab0cc amg`hypre_BoomerAMGDDSetup(amgdd_vdata=0x000000011f815200, A=0x0000600001b20000, b=0x0000600003528080, x=0x0000600003528100) at par_amgdd_setup.c:96:15
    frame #8: 0x000000010007b004 amg`HYPRE_BoomerAMGDDSetup(solver=0x000000011f815200, A=0x0000600001b20000, b=0x0000600003528080, x=0x0000600003528100) at HYPRE_parcsr_amgdd.c:47:13
    frame #9: 0x000000010007121c amg`hypre_GMRESSetup(gmres_vdata=0x0000600001128000, A=0x0000600001b20000, b=0x0000600003528080, x=0x0000600003528100) at gmres.c:241:4
    frame #10: 0x000000010006ff6c amg`HYPRE_GMRESSetup(solver=0x0000600001128000, A=0x0000600001b20000, b=0x0000600003528080, x=0x0000600003528100) at HYPRE_gmres.c:37:13
    frame #11: 0x0000000100005344 amg`main(argc=1, argv=0x000000016fdfeaa8) at amg.c:756:7
    frame #12: 0x000000018537f154 dyld`start + 2476
waynemitchell commented 3 weeks ago

Hm... after just a few quick ij driver runs, I'm not able to reproduce this on my side. From your backtrace, it looks like the size passed to the memory allocation is bad (looks like some uninitialized garbage or something). The code should just be allocating a small amount of memory here: basically just a data structure for each level of the AMG hierarchy (the size is num_levels here): https://github.com/hypre-space/hypre/blob/master/src/parcsr_ls/par_amgdd_setup.c#L96 Maybe the regular AMG setup isn't happening as it should? The AMG-DD setup should perform an underlying AMG setup automatically here: https://github.com/hypre-space/hypre/blob/master/src/parcsr_ls/par_amgdd_setup.c#L76 But maybe the check in this if statement is not robust for some reason? Can you check whether num_levels at par_amgdd_setup.c:96 has a reasonable value, and if not, check whether the AMG setup call at line 76 is happening?

BenWibking commented 3 weeks ago

num_levels is bad:

Process 46550 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x00000001000ab0bc amg`hypre_BoomerAMGDDSetup(amgdd_vdata=0x0000000131008200, A=0x00006000024a0000, b=0x0000600000aa4080, x=0x0000600000aa4100) at par_amgdd_setup.c:96:15
   93      }
   94
   95      // Allocate pointer for the composite grids
-> 96      compGrid = hypre_CTAlloc(hypre_AMGDDCompGrid *, num_levels, HYPRE_MEMORY_HOST);
   97      hypre_ParAMGDDDataCompGrid(amgdd_data) = compGrid;
   98
   99      // In the 1 processor case, just need to initialize the comp grids
Target 0: (amg) stopped.
(lldb) p num_levels
(HYPRE_Int) -1186987768
BenWibking commented 3 weeks ago

If I set a breakpoint on line 76, it doesn't trigger, so I assume that means it's not getting executed:

(lldb) breakpoint set --file par_amgdd_setup.c --line 76
Breakpoint 1: where = amg`hypre_BoomerAMGDDSetup + 184 at par_amgdd_setup.c:76:36, address = 0x00000001000ab018
(lldb) r
Process 46833 launched: '/Users/benwibking/AMG2023/amg' (arm64)
Running with these driver parameters:
  Problem ID    = 1

=============================================
Hypre init times:
=============================================
Hypre init:
  wall clock time = 0.000000 seconds
  Laplacian_27pt:
    (Nx, Ny, Nz) = (10, 10, 10)
    (Px, Py, Pz) = (1, 1, 1)

=============================================
Generate Matrix:
=============================================
Spatial Operator:
  wall clock time = 0.000000 seconds
  RHS vector has unit components
  Initial guess is 0
=============================================
IJ Vector Setup:
=============================================
RHS and Initial Guess:
  wall clock time = 0.000000 seconds
[memory.c, 66] hypre_assert failed: 0
Assertion failed: (0), function hypre_OutOfMemory, file memory.c, line 66.
Process 46833 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #4: 0x00000001001f3ae4 amg`hypre_OutOfMemory(size=18446744064213649472) at memory.c:66:4
   63
   64      hypre_sprintf(msg, "Out of memory trying to allocate %zu bytes\n", size);
   65      hypre_error_w_msg(HYPRE_ERROR_MEMORY, msg);
-> 66      hypre_assert(0);
   67      fflush(stdout);
   68   }
   69
Target 0: (amg) stopped.
BenWibking commented 3 weeks ago

Ok, it's not setting up BoomerAMG:

(lldb) breakpoint set --file par_amgdd_setup.c --line 74
Breakpoint 1: where = amg`hypre_BoomerAMGDDSetup + 160 at par_amgdd_setup.c:74:9, address = 0x00000001000ab000
(lldb) r
Process 47411 launched: '/Users/benwibking/AMG2023/amg' (arm64)
Running with these driver parameters:
  Problem ID    = 1

=============================================
Hypre init times:
=============================================
Hypre init:
  wall clock time = 0.000000 seconds
  Laplacian_27pt:
    (Nx, Ny, Nz) = (10, 10, 10)
    (Px, Py, Pz) = (1, 1, 1)

=============================================
Generate Matrix:
=============================================
Spatial Operator:
  wall clock time = 0.000000 seconds
  RHS vector has unit components
  Initial guess is 0
=============================================
IJ Vector Setup:
=============================================
RHS and Initial Guess:
  wall clock time = 0.000000 seconds
Process 47411 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
    frame #0: 0x00000001000ab000 amg`hypre_BoomerAMGDDSetup(amgdd_vdata=0x0000000122015e00, A=0x0000600001744000, b=0x00006000039440c0, x=0x0000600003944180) at par_amgdd_setup.c:74:9
   71      }
   72
   73      // If the underlying AMG data structure has not yet been set up, call BoomerAMGSetup()
-> 74      if (!hypre_ParAMGDataAArray(amg_data))
   75      {
   76         hypre_BoomerAMGSetup((void*) amg_data, A, b, x);
   77      }
Target 0: (amg) stopped.
(lldb) step
Process 47411 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = step in
    frame #0: 0x00000001000ab030 amg`hypre_BoomerAMGDDSetup(amgdd_vdata=0x0000000122015e00, A=0x0000600001744000, b=0x00006000039440c0, x=0x0000600003944180) at par_amgdd_setup.c:80:11
   77      }
   78
   79      // Get number of processes
-> 80      comm = hypre_ParCSRMatrixComm(A);
   81      hypre_MPI_Comm_size(comm, &num_procs);
   82
   83      // get info from amg about how to setup amgdd
Target 0: (amg) stopped.
waynemitchell commented 3 weeks ago

Are you calling HYPRE_BoomerAMGDDCreate() before the setup?

BenWibking commented 3 weeks ago

Yes: https://github.com/BenWibking/AMG2023/commit/c847960f1bb70df13d004ca42868455b8525313c#diff-ee753cc8c3a9fe01da6eeade8f8b9aee1d4c7485f3f52f2ae2add0a12222111dR560

waynemitchell commented 3 weeks ago

OK, I think I see your issue. The AMG-DD solver object is called amgdd_solver: https://github.com/BenWibking/AMG2023/commit/c847960f1bb70df13d004ca42868455b8525313c#diff-ee753cc8c3a9fe01da6eeade8f8b9aee1d4c7485f3f52f2ae2add0a12222111dR560 but you pass pcg_precond as the preconditioner to CG: https://github.com/BenWibking/AMG2023/commit/c847960f1bb70df13d004ca42868455b8525313c#diff-ee753cc8c3a9fe01da6eeade8f8b9aee1d4c7485f3f52f2ae2add0a12222111dR592 So you aren't passing the correct preconditioner object.

BenWibking commented 3 weeks ago

Ah, ok. I was confused about that. Everything works now.

BenWibking commented 3 weeks ago

Does the AMGDD solver work on HIP?

I tried to run it on a single node on Frontier, but I get a segmentation fault (whereas the unmodified AMG2023 runs fine with this build):

Running with these driver parameters:
  Problem ID    = 1

=============================================
Hypre init times:
=============================================
Hypre init:
  wall clock time = 0.000005 seconds
  Laplacian_27pt:
    (Nx, Ny, Nz) = (256, 256, 256)
    (Px, Py, Pz) = (2, 2, 2)

=============================================
Generate Matrix:
=============================================
Spatial Operator:
  wall clock time = 0.977577 seconds
  RHS vector has unit components
  Initial guess is 0
=============================================
IJ Vector Setup:
=============================================
RHS and Initial Guess:
  wall clock time = 0.006108 seconds
srun: error: frontier10190: tasks 1,3,5,7: Segmentation fault
srun: Terminating StepId=2239275.0
slurmstepd: error: *** STEP 2239275.0 ON frontier10190 CANCELLED AT 2024-08-17T15:47:01 ***
srun: error: frontier10190: tasks 2,4,6: Terminated
srun: error: frontier10190: task 0: Segmentation fault (core dumped)
srun: Force Terminated StepId=2239275.0

I built Hypre with ./configure --with-hip --with-gpu-arch=gfx90a --with-MPI-lib-dirs="${MPICH_DIR}/lib" --with-MPI-libs="mpi" --with-MPI-include="${MPICH_DIR}/include" --enable-mixedint

BenWibking commented 3 weeks ago

You can see the full set of changes and the job scripts I used here: https://github.com/LLNL/AMG2023/compare/main...BenWibking:AMG2023:amgdd

BenWibking commented 3 weeks ago

I recompiled Hypre with ./configure --with-hip --with-gpu-arch=gfx90a --with-MPI-lib-dirs="${MPICH_DIR}/lib" --with-MPI-libs="mpi" --with-MPI-include="${MPICH_DIR}/include" --enable-mixedint --enable-unified-memory and now it hangs:

Running with these driver parameters:
  Problem ID    = 1

=============================================
Hypre init times:
=============================================
Hypre init:
  wall clock time = 0.000004 seconds
  Laplacian_27pt:
    (Nx, Ny, Nz) = (256, 256, 256)
    (Px, Py, Pz) = (2, 2, 2)

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 2239289.0 ON frontier10151 CANCELLED AT 2024-08-17T16:06:09 ***
slurmstepd: error: *** JOB 2239289 ON frontier10151 CANCELLED AT 2024-08-17T16:06:09 ***
waynemitchell commented 3 weeks ago

It should work fine with hip (I just tried AMG-DD via the ij driver on an AMD machine with no issues). Your build looks OK to me. Also nothing is jumping out at me in your changes that would screw up a GPU run... Not sure where the issue is. Can you try running with valgrind? That might be the easiest way to at least diagnose where the segfault is occurring.