Open jewatkins opened 3 years ago
A smaller problem for comparison, after one iteration:
humboldt, 3-20km, ML preconditioner, 20 ranks:
Total linear iterations in first nonlinear iteration: 25
Solution average after nonlinear convergence: 0.1072498668
humboldt, 3-20km, ML preconditioner, 2 ranks:
Total linear iterations in first nonlinear iteration: 23
Solution average after nonlinear convergence: 0.107249664344
The Humboldt results look reasonable to me.
I checked the difference between the Jacobian in the 2 rank case vs. the 20 rank case:
max value: 3.7253e-09
norm: 1.7459e-08
The results are similar if I use homogeneous DBC on the basalside with no BC on lateral: 3.7253e-09 and 1.6862e-08
Is this an absolute difference or a relative one?
absolute
Ok. It might be pointing to a wrongly assembled jacobian.
Meanwhile, I double checked a basic pb (SteadyHeat2D), running it with 1 and 32 ranks: the NOX residuals start to be different (1e-3 relative diff) only at the 3rd nonlinear iteration. HOWEVER, the linear solver history is quite different already with the 1st jacobian: 16 (serial) vs 43 (32 ranks) iterations.
Edit: this is a Tpetra run, with Belos solver and Ifpack2 prec (ILUT with fill-in 1).
However, J_serial-J_parallel ~ 1e-16.
So this is interesting: the jacobian is basically the same, and yet we get very different convergence histories...
I'm not an expert of Ifpack, but: when using several ranks, some subdomain contains no Dirichlet rows. Could this deteriorate solver performance (due to singular or almost singular local matrix)?
Using MueLu (instead of Ifpack2) gives the same linear iterations (serial vs 32-ranks). Like
serial:
*******************************************************
***** Belos Iterative Solver: Block Gmres
***** Maximum Iterations: 100
***** Block Size: 1
***** Residual Test:
***** Test 1 : Belos::StatusTestImpResNorm<>: (2-Norm Res Vec) / (2-Norm Res0), tol = 0.0001
*******************************************************
Iter 0, [ 1] : 1.000000e+00
Iter 9, [ 1] : 3.158400e-05
vs 32 ranks:
*******************************************************
***** Belos Iterative Solver: Block Gmres
***** Maximum Iterations: 100
***** Block Size: 1
***** Residual Test:
***** Test 1 : Belos::StatusTestImpResNorm<>: (2-Norm Res Vec) / (2-Norm Res0), tol = 0.0001
*******************************************************
Iter 0, [ 1] : 1.000000e+00
Iter 9, [ 1] : 2.945532e-05
This does not explain the ALI tests failures, where the prec was ML or MueLu. But at least confirms that the core part of Albany is not buggy. It also suggests that the choice of the preconditioner type might be affecting convergence in some cases.
Edit: for completeness, these are the Ifpack2 and MueLu params:
Ifpack2:
Ifpack2:
Overlap: 1
Prec Type: ILUT
Ifpack2 Settings:
'fact: drop tolerance': 0.00000000000000000e+00
'fact: ilut level-of-fill': 1.00000000000000000e+00
'fact: level-of-fill': 1
MueLu:
MueLu:
verbosity: none
max levels: 5
'coarse: max size': 512
multigrid algorithm: sa
'aggregation: type': uncoupled
'smoother: type': RELAXATION
'smoother: params':
'relaxation: type': Jacobi
'relaxation: sweeps': 1
'relaxation: damping factor': 2.5e-01
Thanks for those results. FYI, if I use relative difference it's 1.0446e-13 and 2.8920e-13.
1e-13 is not quite machine precision, but it might be due to some carries, so it might still be a roundoff diff. This is good, since it suggests that we're not messing up the assembly (phew).
My best guess is that we need to be careful with our prec specifications. I just tried input_fo_gis_unstructT.yaml
, and I also get same linear conv history for serial and 32 ranks. That test uses Ifpack2, but the dof ordering is column-wise, so each rank does contain a piece of BC, so the local matrices are not singular (note: I am talking a bit out of my @$$, since I don't really know how the ifpack preconditioner works, and the singularity of local matrices might not be relevant).
@jewatkins could you remind me which input file and input mesh you used?
I'm using a modified version of https://github.com/ikalash/ali-perf-tests/blob/master/perf_tests/humboldt-3-20km/input_albany_Velocity_MueLu_Wedge.yaml where I switch to "Use Serial Mesh: true" and Epetra/AztecOO/ML. Mesh files are here: https://github.com/ikalash/ali-perf-tests/tree/master/meshes/humboldt-3-20km
I could add the exact input file I'm using if you want to try running it.
Thanks! No need to add the exact input file.
So, I tried that input file, with the meshes in /home/projects/albany/ali-perf-tests-meshes
. Use Serial Mesh
causes little change in the convergence history (serial mesh gives 1 less linear solver iteration at each solve, but the residuals look fairly similar). Furthermore, serial vs 32 ranks is fairly similar. I get about 2-3 more iterations with 32 ranks for each linear solve, compared to the serial case.
Is this what you are observing for that mesh/input-file? Note: I'm not using SFad, I'm using DFad (the default in Albany).
OK, so we don't get significant changes on the Humbolt mesh. @jewatkins what solver are you using on the Greenland mesh? Belos or AztecOO? Does the choice of the solver make a difference? Have you tried the Thwaites mesh? Maybe there is something wrong with the Greenland meshes.
So, I tried that input file, with the meshes in
/home/projects/albany/ali-perf-tests-meshes
.Use Serial Mesh
causes little change in the convergence history (serial mesh gives 1 less linear solver iteration at each solve, but the residuals look fairly similar). Furthermore, serial vs 32 ranks is fairly similar. I get about 2-3 more iterations with 32 ranks for each linear solve, compared to the serial case.Is this what you are observing for that mesh/input-file? Note: I'm not using SFad, I'm using DFad (the default in Albany).
Yes, that sounds about right. I used 2 and 20 ranks and these are the results I got:
humboldt, 3-20km, ML preconditioner, 20 ranks:
Total linear iterations in first nonlinear iteration: 25
Solution average after nonlinear convergence: 0.1072498668
humboldt, 3-20km, ML preconditioner, 2 ranks:
Total linear iterations in first nonlinear iteration: 23
Solution average after nonlinear convergence: 0.107249664344
OK, so we don't get significant changes on the Humbolt mesh. @jewatkins what solver are you using on the Greenland mesh? Belos or AztecOO? Does the choice of the solver make a difference?
On the Greenland mesh, I've tried Belos/AztecOO, muelu/ml and it didn't make a difference (I still had an inconsistency).
Have you tried the Thwaites mesh? Maybe there is something wrong with the Greenland meshes.
I tried 8 ranks and 80 ranks with Thwaites and did not see a large discrepancy (linear iterations: 95/96, solution average: -162.729919149/-162.729919148). I've only seen a large difference when running the Greenland 1-7km and 1-10km. Maybe the large variation in the domain of those meshes causes the error to be more noticeable? There's still a small difference in the Humboldt case so the error could still be there, just not as noticeable.
Uhm, Thwaites is a reasonably sized ice sheet and with non trivial dynamics. So, I suspect there's something wrong with those meshes that makes the problem very ill conditioned. Can you put this on hold for a week or so? I'm now working on a new Greenland initialization and we can target that when it's ready.
Another interesting bit of information. For the Humboldt case, if I use similar boundary conditions and ML settings to the nightly test (must be both), I get the exact number of iterations and solution average. I.e. if I add this:
Dirichlet BCs:
DBC on NS extruded_boundary_node_set_3 for DOF U0 prescribe Field: velocity
DBC on NS extruded_boundary_node_set_3 for DOF U1 prescribe Field: velocity
ML:
Base Method Defaults: none
ML Settings:
default values: SA
ML output: 0
'repartition: enable': 1
'repartition: max min ratio': 1.32699999999999995e+00
'repartition: min per proc': 600
'repartition: Zoltan dimensions': 2
'repartition: start level': 4
'semicoarsen: number of levels': 2
'semicoarsen: coarsen rate': 12
'smoother: sweeps': 4
'smoother: type': Chebyshev
'smoother: Chebyshev eig boost': 1.19999999999999995e+00
'smoother: sweeps (level 0)': 1
'smoother: sweeps (level 1)': 4
'smoother: type (level 0)': line Jacobi
'smoother: type (level 1)': line Jacobi
'smoother: damping factor': 5.50000000000000044e-01
'smoother: pre or post': both
'coarse: type': Amesos-KLU
'coarse: pre or post': pre
'coarse: sweeps': 4
I get this for both:
number of linear iterations on first nonlinear iteration: 17
mean value of final solution: 2.66149625626
What BC where you using before for Humboldt? (B.t.w. when I was telling you not to prescribe any Dirichlet BC on the laterla boundary I was referring to Greenland, not to Humboldt).
It might be that some ranks pick up a high-speed region, while others pick up slow-speed ones. This is especially possible for highly non-uniform meshes.
Uhm, Thwaites is a reasonably sized ice sheet and with non trivial dynamics. So, I suspect there's something wrong with those meshes that makes the problem very ill conditioned. Can you put this on hold for a week or so? I'm now working on a new Greenland initialization and we can target that when it's ready.
I'm okay with this. Since there's cases where I'm able to get the same number of iterations/average solution between a low/high rank case, I'll assume that this is some sort of issue with the mesh and continue trying to improve the GPU runs (working on cases that work like thwaites). I recall seeing the same issue there but maybe not because I can't find the runs.
What BC where you using before for Humboldt? (B.t.w. when I was telling you not to prescribe any Dirichlet BC on the laterla boundary I was referring to Greenland, not to Humboldt).
I was using this:
LandIce BCs:
Number : 2
BC 0:
Type: Basal Friction
Side Set Name: basalside
Cubature Degree: 4
Basal Friction Coefficient:
Type: Exponent Of Given Field
Given Field Variable Name: basal_friction_log
Zero Beta On Floating Ice: true
BC 1:
Type: Lateral
Cubature Degree: 4
Side Set Name: extruded_boundary_side_set_1
Something similar is done in the nightly input file too but it has that extra term on extruded_boundary_node_set_3
. What's the difference between extruded_boundary_node_set_1 and extruded_boundary_node_set_3?
Humboldt mesh is obtained cutting out a piece of Greenland ice sheet. On the internal (artificial) boundary (node_set_3) we prescribe the velocity obtained from a full Greenland simulation. On the real ice margin (side_set_1) we prescribe the usual lateral condition, which account for ocean backpressure if the ice is partially submerged in ocean.
Uhm, Thwaites is a reasonably sized ice sheet and with non trivial dynamics. So, I suspect there's something wrong with those meshes that makes the problem very ill conditioned. Can you put this on hold for a week or so? I'm now working on a new Greenland initialization and we can target that when it's ready.
I'm okay with this. Since there's cases where I'm able to get the same number of iterations/average solution between a low/high rank case, I'll assume that this is some sort of issue with the mesh and continue trying to improve the GPU runs (working on cases that work like thwaites). I recall seeing the same issue there but maybe not because I can't find the runs.
Yeah, I think it's probably the best course of action.
Ah okay, thanks for the info. I should update my Humboldt cases.
I ran into an interesting result when comparing linear convergence and solution average for different numbers of MPI ranks in Albany Land Ice:
The difference in linear convergence and solution average seems pretty large. My nonlinear tolerance is set to 1.0e-6 and my linear tolerance is set to 1.0e-8. I didn't see much of a difference when reducing these numbers. I also didn't see much of a difference when I tried this same experiment on smaller problems. I'm still investigating to see what might cause this but let me know if anyone has any thoughts.