Closed marcfehling closed 5 years ago
Dear
hypre
developers!The
deal.II
community is currently making an effort to implement parallel hp-adaptive finite element methods. We were preparing a particular benchmark to verify our parallel implementation, which is based on one of thedeal.II
tutorial step step-27 in which sequential hp-adaptive FEM has been introduced.However, a very certain corner configuration of our scenario yields to some errors triggered by
hypre
, or more specifically, byBoomerAMG
. We use a backend toPETSc
as our parallel linear algebra library, which providesBoomerAMG
as our algebraic multigrid preconditioner.We could only reproduce this issue with exactly 20
MPI
processes assigned to a total of 49152 cells with 198144 degrees of freedom. The error occurs in the very first cycle of consecutive refinements, in which all cells were assigned a Lagrangian finite element with quadratic base functions.Using a different preconditioner or none at all does not trigger the error. Neither does a different amount of
MPI
processes, nor a different number of global mesh refinements (which yields this particular amount of cells). However, I haven't tried different finite elements yet. With parallel debugging, I encountered the following issues:Five processes reported the following (with
petsc-3.9.1
):Four processes reported the following (with
petsc-3.9.1
):Additionally, when using versions of
PETSc
prior to3.8.4
, error messages likely to the following appear.bash-4.2$ mpirun -np 20 step-27new Running with PETSc on 20 MPI rank(s)... Cycle 0: Number of active cells : 49152 Number of degrees of freedom: 198144 Number of constraints : 3272 Out of memory trying to allocate 1451474080 bytes Out of memory trying to allocate 1451474080 bytes Out of memory trying to allocate -1938871136 bytes Out of memory trying to allocate -1938871136 bytes Out of memory trying to allocate 1073649184 bytes Out of memory trying to allocate 1073649184 bytes Out of memory trying to allocate 595867104 bytes Out of memory trying to allocate 595867104 bytes -------------------------------------------------------------------------- mpirun noticed that process rank 1 with PID 36483 on node down exited on signal 11 (Segmentation fault). --------------------------------------------------------------------------
We have been discussing this topic in dealii/dealii#8565. We tried to locate the error using several
git bisect
sessions, and the problem seems to have been introduced in one of the following commits f4e70c8...a08bfa5. The test case completes with f4e70c8 and fails with a08bfa5 for the first time.
Hi, @marcfehling,
Trying to understand something here: f4e70c8 was newer than a08bfa5. (01/2016 v.s. 12/2015). Does the newer one f4e70c8 work while a08bfa5 not?
We were not able to narrow the selection of commits down any further, since we were not able to build
hypre
successfully for any of them. I know that, since this is a very special corner case, it will most probably not be easy to find the source of the error, but maybe you have a very first idea on where this error may come from regarding thegdb
output above. I hope that we could make a joint effort to tackle this problem together.Please let me know if you need further information on our scenario, or if you would like to work with the aforementioned
deal.II
testcase and need assistance with it. You'll find the sourcecode on this branch. Pardon me if it you find it untidy - it is in some kind of work-in-progress state. If desired, I will try to boil it down to a minimal working example.
HI @liruipeng!
Trying to understand something here: f4e70c8 was newer than a08bfa5. (01/2016 v.s. 12/2015). Does the newer one f4e70c8 work while a08bfa5 not?
f4e70c8 is indeed more recent than a08bfa5, but has been previously merged into the master branch.
In fact, I assume that all the commits f4e70c8...a08bfa5 have been issued on a separate branch (starting sometime in 2005) and have been merged later via bc6dc33 on January 14, 2016, which is the same day a08bfa5 has been issued.
HI @liruipeng!
Trying to understand something here: f4e70c8 was newer than a08bfa5. (01/2016 v.s. 12/2015). Does the newer one f4e70c8 work while a08bfa5 not?
f4e70c8 is indeed more recent than a08bfa5, but has been previously merged into the master branch.
In fact, I assume that all the commits f4e70c8...a08bfa5 have been issued on a separate branch (starting sometime in 2005) and have been merged later via bc6dc33 on January 14, 2016, which is the same day a08bfa5 has been issued.
Hi @marcfehling It seems that those commits all were pretty big and old. Have you tried more recent releases of hypre? Can you remind me again what is the last release (or commit) you know that works for you? The information from gdb by itself does not make too much sense to me. So, it will be good if we can reproduce this problem. How easy is it for us to build deal.II with Petsc and hypre? Thanks!
It seems that those commits all were pretty big and old.
These commits indeed changed a lot of lines of code, but I assume most of them have been altered by some merge commit that joined the old commits (2005) to the then current branch (2015).
Have you tried more recent releases of hypre?
Yes, I cloned your git
repository to localize the issue.
In fact, I first encountered this issue by using PETSc 3.9.1
which builds hypre 2.14.0
. To find the cause of the problem, I continued working with both PETSc
and hypre
repositories, and I think that I checked the most recent commits as well. But just to make sure, I'll give the most recent versions another try.
Can you remind me again what is the last release (or commit) you know that works for you?
The latest official release that worked for me is hypre v2.10.1
. I summarized my findings here: https://github.com/dealii/dealii/issues/8565#issuecomment-528836858
The information from gdb by itself does not make too much sense to me. So, it will be good if we can reproduce this problem. How easy is it for us to build deal.II with Petsc and hypre? Thanks!
My workflow was to first build hypre
, then build PETSc
with hypre
, and then deal.II
with PETSc
. Since this is tedious, I wrote a shell script to do this for me. In fact, I used it together with the handy tool git bisect
to find the faulty commit automatically.
deal.II
needs to be configured with p4est
. Please follow this documentation to install a compatible version.
If you'd like to reproduce the problem, clone hypre
, PETSc
, and this particular deal.II
branch and run this script to automatically build everything and run the failing testcase. You just have to update the paths to make it work for you. If you encounter any trouble, let me know :)
I can confirm that the problem occurs with the latest commits https://github.com/hypre-space/hypre/commit/e4b5c2e0de51e3810888500fa150e1cb0b064c32 and https://github.com/petsc/petsc/commit/81e64d771dc312caa27665cfdcc57d112da34629.
Hi, @marcfehling. I built dealii and ran the example. The output was different from what you showed above. See below:
li50@tux383:~/workspace/dealii/install/examples/step-27new$ mpirun -np 20 ./step-27new
Running with PETSc on 20 MPI rank(s)...
Cycle 0:
Number of active cells : 768
Number of degrees of freedom: 3264
Number of constraints : 584
Solved in 11 iterations.
+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 32s | |
| | | |
| Section | no. calls | wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assembly | 1 | 0.657s | 2.1% |
| flag h | 1 | 1.78s | 5.6% |
| flag p | 1 | 1.92s | 6% |
| refine | 1 | 1.05s | 3.3% |
| setup | 1 | 4.19s | 13% |
| solve | 1 | 18.6s | 58% |
+---------------------------------+-----------+------------+------------+
Did I miss something?
Thanks
-Ruipeng
Hi @liruipeng!
I built dealii and ran the example. The output was different from what you showed above. [...]
Did I miss something?
Thanks for verifying this so quickly! You indeed managed to produce the correct output we would expect if the error does not occur. I suspect that either our library or compiler configuration do not match, which may be a hint to the solution of the problem.
gcc 9.2.0
in combination with openmpi 4.0.1
on my workstation, as well as Intel 19.0.3.199
with IntelMPI 19.0.3.199
on our supercomputer. Both configurations result in segmentation faults on the aforementioned test.p4est 2.2
following these instructions, or the stock p4est
library?li50@tux383:~/workspace/dealii/install/examples/step-27new$ mpirun -np 20 ./step-27new Running with PETSc on 20 MPI rank(s)... Cycle 0: Number of active cells : 768 Number of degrees of freedom: 3264 Number of constraints : 584 Solved in 11 iterations.
Wait! I just noticed that your number of active cells does not correspond to the failing test case. Are you sure you've compiled the pd-27-tutorial-bug
branch? If it is the correct branch, your output should look like this:
Cycle 0:
Number of active cells : 49152
Number of degrees of freedom: 198144
Number of constraints : 3272
li50@tux383:~/workspace/dealii/install/examples/step-27new$ mpirun -np 20 ./step-27new Running with PETSc on 20 MPI rank(s)... Cycle 0: Number of active cells : 768 Number of degrees of freedom: 3264 Number of constraints : 584 Solved in 11 iterations.
Wait! I just noticed that your number of active cells does not correspond to the failing test case. Are you sure you've compiled the
pd-27-tutorial-bug
branch? If it is the correct branch, your output should look like this:Cycle 0: Number of active cells : 49152 Number of degrees of freedom: 198144 Number of constraints : 3272
Sorry, I was on a wrong branch: pd-27-tutorial instead of pd-27-tutorial-bug. Just changed branch and retried. I had the same problem.
Hi, @marcfehling, I think I fixed this issue. The fixes are in branch amg-setup
of hypre. Can you please check it out and try it on your side? I just ran mpirun -n 20 ./step-27new
and it finished its 6 cycles. See below for output of Cycle 0 and Cycle 5
Cycle 0: [517/1883]
Number of active cells : 49152
Number of degrees of freedom: 198144
Number of constraints : 3272
Num MPI tasks = 20
Num OpenMP threads = 1
BoomerAMG SETUP PARAMETERS:
Max levels = 25
Num levels = 9
Strength Threshold = 0.250000
Interpolation Truncation Factor = 0.000000
Maximum Row Sum Threshold for Dependency Weakening = 0.900000
Coarsening Type = Falgout-CLJP
measures are determined locally
No global partition option chosen.
Interpolation = modified classical interpolation
Operator Matrix Information:
nonzero entries/row row sums
lev rows entries sparse min max avg min max
======================================================================
0 198144 3093536 0.000 1 25 15.6 -2.444e-01 3.200e+00
1 48668 445462 0.000 4 19 9.2 -7.168e-02 2.746e+00
2 12385 120927 0.001 4 24 9.8 -1.483e-02 2.602e+00
3 3307 38103 0.003 4 22 11.5 -1.877e-02 2.868e+00
4 953 13725 0.015 4 24 14.4 -2.359e-03 3.078e+00
5 303 5093 0.055 6 30 16.8 -3.862e-05 3.188e+00
6 101 1481 0.145 5 22 14.7 8.770e-02 2.669e+00
7 35 367 0.300 8 13 10.5 9.777e-01 2.635e+00
8 10 52 0.520 4 6 5.2 1.776e+00 3.322e+00
Interpolation Matrix Information:
entries/row min max row sums
lev rows x cols min max avgW weight weight min max
==================================================================================
0 198144 x 48668 0 5 2.6 7.000e-02 8.066e-01 0.000e+00 1.054e+00
1 48668 x 12385 1 7 2.7 4.183e-02 1.000e+00 3.367e-01 1.027e+00
2 12385 x 3307 1 6 2.7 4.134e-02 1.001e+00 2.350e-01 1.005e+00
3 3307 x 953 1 6 2.8 4.701e-02 1.000e+00 1.412e-01 1.007e+00
4 953 x 303 0 7 2.8 2.601e-02 1.000e+00 0.000e+00 1.001e+00
5 303 x 101 0 6 2.6 1.297e-02 9.980e-01 0.000e+00 1.000e+00
6 101 x 35 0 5 2.2 2.437e-02 7.441e-01 0.000e+00 1.000e+00
7 35 x 10 1 3 1.6 2.383e-02 4.329e-01 9.363e-02 1.000e+00
Complexity: grid = 1.331890
operator = 1.202102
memory = 1.391263
BoomerAMG SOLVER PARAMETERS:
Maximum number of cycles: 1
Stopping Tolerance: 0.000000e+00
Cycle type (1 = V, 2 = W, etc.): 1
Relaxation Parameters:
Visiting Grid: down up coarse
Number of sweeps: 1 1 1
Type 0=Jac, 3=hGS, 6=hSGS, 9=GE: 6 6 6
Point types, partial sweeps (1=C, -1=F):
Pre-CG relaxation (down): 1 -1
Post-CG relaxation (up): -1 1
Coarsest grid: 0
Solved in 8 iterations.
+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 70.9s | |
| | | |
| Section | no. calls | wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assembly | 1 | 3.18s | 4.5% |
| flag h | 1 | 3.55s | 5% |
| flag p | 1 | 6.4s | 9% |
| refine | 1 | 5.14s | 7.2% |
| setup | 1 | 10.9s | 15% |
| solve | 1 | 31.3s | 44% |
+---------------------------------+-----------+------------+------------+
...
Cycle 5:
Number of active cells : 78069
Number of degrees of freedom: 896554
Number of constraints : 126247
Num MPI tasks = 20
Num OpenMP threads = 1
BoomerAMG SETUP PARAMETERS:
Max levels = 25
Num levels = 13
Strength Threshold = 0.250000
Interpolation Truncation Factor = 0.000000
Maximum Row Sum Threshold for Dependency Weakening = 0.900000
Coarsening Type = Falgout-CLJP
measures are determined locally
No global partition option chosen.
Interpolation = modified classical interpolation
Operator Matrix Information:
nonzero entries/row row sums
lev rows entries sparse min max avg min max
======================================================================
0 896554 24475078 0.000 1 230 27.3 -2.208e+01 2.992e+01
1 327241 13219677 0.000 6 192 40.4 -1.594e+01 2.565e+01
2 138979 5069749 0.000 6 141 36.5 -1.240e+00 4.455e+00
3 58696 2200456 0.001 6 113 37.5 -7.509e-02 4.847e+00
4 24341 934649 0.002 8 97 38.4 -5.175e-03 4.954e+00
5 9697 379863 0.004 8 102 39.2 -5.635e-04 5.043e+00
6 3627 141989 0.011 8 95 39.1 -3.188e-05 4.477e+00
7 1329 50191 0.028 8 75 37.8 -7.379e-10 3.543e+00
8 468 14898 0.068 11 63 31.8 5.507e-13 4.385e+00
9 158 3832 0.154 6 43 24.3 5.603e-05 3.833e+00
10 51 757 0.291 10 21 14.8 2.442e-01 5.620e+00
11 16 120 0.469 5 10 7.5 1.166e+00 5.753e+00
12 5 19 0.760 3 5 3.8 2.561e+00 5.190e+00
Interpolation Matrix Information:
entries/row min max row sums
lev rows x cols min max avgW weight weight min max
==================================================================================
0 896554 x 327241 0 14 2.4 3.377e-02 1.000e+00 0.000e+00 1.200e+00
1 327241 x 138979 1 8 2.8 3.247e-02 1.058e+00 3.058e-01 1.137e+00
2 138979 x 58696 1 10 3.0 2.672e-02 1.122e+00 1.852e-01 1.122e+00
3 58696 x 24341 0 9 3.0 2.138e-02 1.007e+00 0.000e+00 1.039e+00
4 24341 x 9697 0 10 3.0 1.439e-02 1.001e+00 0.000e+00 1.002e+00
5 9697 x 3627 0 11 2.9 1.115e-02 1.000e+00 0.000e+00 1.000e+00
6 3627 x 1329 0 7 2.8 3.287e-02 1.000e+00 0.000e+00 1.000e+00
7 1329 x 468 0 7 2.7 4.340e-02 1.000e+00 0.000e+00 1.000e+00
8 468 x 158 0 7 2.5 2.449e-02 1.000e+00 0.000e+00 1.000e+00
9 158 x 51 1 5 2.1 1.364e-02 1.000e+00 1.016e-01 1.000e+00
10 51 x 16 0 4 1.7 3.224e-02 9.185e-01 0.000e+00 1.000e+00
11 16 x 5 0 2 1.1 4.437e-02 6.080e-01 0.000e+00 1.000e+00
Complexity: grid = 1.629753
operator = 1.899535
memory = 2.017587
BoomerAMG SOLVER PARAMETERS:
Maximum number of cycles: 1
Stopping Tolerance: 0.000000e+00
Cycle type (1 = V, 2 = W, etc.): 1
Relaxation Parameters:
Visiting Grid: down up coarse
Number of sweeps: 1 1 1
Type 0=Jac, 3=hGS, 6=hSGS, 9=GE: 6 6 9
Point types, partial sweeps (1=C, -1=F):
Pre-CG relaxation (down): 1 -1
Post-CG relaxation (up): -1 1
Coarsest grid: 0
Solved in 62 iterations.
+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 278s | |
| | | |
| Section | no. calls | wall time | % of total |
+---------------------------------+-----------+------------+------------+
| assembly | 1 | 59s | 21% |
| flag h | 1 | 6.42s | 2.3% |
| flag p | 1 | 26.3s | 9.5% |
| refine | 1 | 6.46s | 2.3% |
| setup | 1 | 20.7s | 7.5% |
| solve | 1 | 157s | 57% |
+---------------------------------+-----------+------------+------------+
+---------------------------------------------+------------+------------+
| Total wallclock time elapsed since start | 0.000121s | |
| | | |
| Section | no. calls | wall time | % of total |
+---------------------------------+-----------+------------+------------+
+---------------------------------+-----------+------------+------------+
Hi, @marcfehling, I think I fixed this issue. The fixes are in branch
amg-setup
of hypre. Can you please check it out and try it on your side? I just ranmpirun -n 20 ./step-27new
and it finished its 6 cycles. See below for output of Cycle 0 and Cycle 5
Hi @liruipeng! Thank you for taking care of this so quickly. I can indeed confirm that building your branch amg-setup
does fix our problem. Hooray! :)
Just out of curiosity: How exactly did you solve this problem? Is the following line responsible for it? https://github.com/hypre-space/hypre/blob/f69824574fae7bcbd7cd44c415cf1c658e823dd9/src/parcsr_ls/par_gauss_elim.c#L211
Maybe we do something wrong on our end and could learn from this fix, e.g. by setting up our preconditioners and solvers differently.
Hi, @marcfehling,
Yes, the problem was in the Gauss elimination at the coarsest level. Before the AMG setup, the relaxation types were set via PETSc to be (6,6,9), that told hypre to use type 9 (i.e., GaussElim) for the coarsest solve. During the setup of this case, the coarsening loop exited prematurely without reaching the default minimum size of the coarsest grid, since HYPRE found there was no need to keep coarsening anymore (such as encountering a very diagonally dominant matrix). At this point, HYPRE internally set the coarsest grid solve to be just the relaxation (type 6) and would not set up the Gauss elimination. This makes sense because if the coarse grid is still quite big, we don't want to do a direct solve. But the problem was that after the setup, we never expected users to set the relaxation type back to 9 again, which was indeed done via PETSc where the BoomerAMG parameters were set once again before the AMG solve. So, AMG chose to do Gauss elimination and crashed apparently since it had not been set up yet.
I won't say this is a wrong use of Hypre. The users should be able to set parameters again after the setup and before the solve. So we just added a checking if Gauss elimination is ready before trying to solve, that fixed the problem.
Hope this helps
-Ruipeng
Yes, the problem was in the Gauss elimination at the coarsest level. Before the AMG setup, the relaxation types were set via PETSc to be (6,6,9), that told hypre to use type 9 (i.e., GaussElim) for the coarsest solve. [...] But the problem was that after the setup, we never expected users to set the relaxation type back to 9 again, which was indeed done via PETSc where the BoomerAMG parameters were set once again before the AMG solve.
I think that we at deal.II decided to use Gaussian elimination on the coarsest level at some point, and it is not a default of PETSc (see these lines).
Just out of curiosity: What is your default solver on the coarsest level? Maybe we can get better results and performance by changing our coarsest solver to a different one. And we probably would have avoided this issue right from the beginning.
Yes, the problem was in the Gauss elimination at the coarsest level. Before the AMG setup, the relaxation types were set via PETSc to be (6,6,9), that told hypre to use type 9 (i.e., GaussElim) for the coarsest solve. [...] But the problem was that after the setup, we never expected users to set the relaxation type back to 9 again, which was indeed done via PETSc where the BoomerAMG parameters were set once again before the AMG solve.
I think that we at deal.II decided to use Gaussian elimination on the coarsest level at some point, and it is not a default of PETSc (see these lines).
It actually sets the default of the coarse grid solver to be GaussElim from those lines, see
set_option_value("-pc_hypre_boomeramg_relax_type_coarse",
"Gaussian-elimination");
Just out of curiosity: What is your default solver on the coarsest level? Maybe we can get better results and performance by changing our coarsest solver to a different one. And we probably would have avoided this issue right from the beginning.
The default is 9 (Gaussian-elimination). I think the problem was probably from here https://github.com/marcfehling/dealii/blob/406a071c79c0f0a92441288580f6f76149753b44/source/lac/petsc_solver.cc#L128 where the parameters used in AMG-Solve were reset...
Nice collaboration between fellow xSDK projects -- thanks to all involved for making this work!
The default is 9 (Gaussian-elimination). I think the problem was probably from here https://github.com/marcfehling/dealii/blob/406a071c79c0f0a92441288580f6f76149753b44/source/lac/petsc_solver.cc#L128 where the parameters used in AMG-Solve were reset...
Thank you for this hint! Let us move further discussion on this topic to the deal.II issue https://github.com/dealii/dealii/issues/8565.
Dear
hypre
developers!The
deal.II
community is currently making an effort to implement parallel hp-adaptive finite element methods. We were preparing a particular benchmark to verify our parallel implementation, which is based on one of thedeal.II
tutorial step step-27 in which sequential hp-adaptive FEM has been introduced.However, a very certain corner configuration of our scenario yields to some errors triggered by
hypre
, or more specifically, byBoomerAMG
. We use a backend toPETSc
as our parallel linear algebra library, which providesBoomerAMG
as our algebraic multigrid preconditioner.We could only reproduce this issue with exactly 20
MPI
processes assigned to a total of 49152 cells with 198144 degrees of freedom. The error occurs in the very first cycle of consecutive refinements, in which all cells were assigned a Lagrangian finite element with quadratic base functions.Using a different preconditioner or none at all does not trigger the error. Neither does a different amount of
MPI
processes, nor a different number of global mesh refinements (which yields this particular amount of cells). However, I haven't tried different finite elements yet. With parallel debugging, I encountered the following issues:Five processes reported the following (with
petsc-3.9.1
):Four processes reported the following (with
petsc-3.9.1
):Additionally, when using versions of
PETSc
prior to3.8.4
, error messages likely to the following appear.We have been discussing this topic in https://github.com/dealii/dealii/issues/8565. We tried to locate the error using several
git bisect
sessions, and the problem seems to have been introduced in one of the following commits https://github.com/hypre-space/hypre/compare/f4e70c8...a08bfa5. The test case completes with https://github.com/hypre-space/hypre/commit/f4e70c8 and fails with https://github.com/hypre-space/hypre/commit/a08bfa5 for the first time. We were not able to narrow the selection of commits down any further, since we were not able to buildhypre
successfully for any of them.I know that, since this is a very special corner case, it will most probably not be easy to find the source of the error, but maybe you have a very first idea on where this error may come from regarding the
gdb
output above. I hope that we could make a joint effort to tackle this problem together.Please let me know if you need further information on our scenario, or if you would like to work with the aforementioned
deal.II
testcase and need assistance with it. You'll find the sourcecode on this branch. Pardon me if it you find it untidy - it is in some kind of work-in-progress state. If desired, I will try to boil it down to a minimal working example.