Closed lindsayad closed 1 year ago
Note-taking on a medium sized 3D channel flow problem with a Reynolds number of 1 to anticipate what FSP options to use for the big kahuna
Re | dofs | cpu | -pc_fieldsplit_schur_fact_type |
-pc_fieldsplit_schur_precondition |
solve time |
---|---|---|---|---|---|
1 | 655360 | 32 | full | a11 | 50.525 |
1 | 655360 | 32 | lower | a11 | 39.341 |
1 | 655360 | 32 | upper | a11 | 37.435 |
1 | 655360 | 32 | diag | a11 | 66.413 |
1 | 655360 | 32 | full | selfp | 47.303 |
1 | 655360 | 32 | lower | selfp | 37.095 |
1 | 655360 | 32 | upper | selfp | 35.902 |
1 | 655360 | 32 | diag | selfp | 61.088 |
Conclusions:
selfp
performs slightly better than a11
for all factorization typeslower
and upper
factorizations perform better than full
which performs better than diag
. upper
seems to have a slight edge on lower
-pc_fieldsplit_schur_precondition full
runs out of memoryI want to investigate whether we are solving a symmetric form with respect to A01 and A10 and if not, then whether switching to a symmetric form improves convergence
My general experience:
-pc_jacobi_type rowmax
; -pc_jacobi_type rowsum
appears to be terriblefull
factorization does not compare as poorly to lower/upper
as for non-advection dominated, in fact it may be fasterdiag
and blockdiag
are equivalent and in these cases when using selfp
Ainv just corresponds to the reciprocal of A's diagonal which for advection dominated leads to Sp poorly approximating S, e.g. even when doing first order upwind with a Reynolds number of 2.2 in our 2d-rc-no-slip.i
channel flow problem, it takes 50-60 linear iterations to solve the Schur complement with -pc_type lu
for both the outer Schur PC type and the A00 PC typeTransferring this comment to https://github.com/idaholab/moose/discussions/24809. Any additional findings will go there
Adding a note that I had posted on slack but never added here:
I barely got within the memory limits of my workstation and was able to solve a 1.3 million cell 3D channel flow problem (5.2 million dofs) with 64 cpu in 338 seconds (90 seconds spent in setup) with field split. I will need to take to the HPC to see how much bigger I can go
How many nonlinear iterations did it take to solve your problem (asking to compare with SIMPLE which needs 400 to solve a problem with similar size, Re=220)?
I don't remember exactly. Somewhere between 3 and 6 although the 248 second solve reporting was with a problem with Reynolds ~1.
Testing out some things. 696.907 second solve for 7002001 dofs with 63 processes using distributed mesh and the hybrid discretization in https://github.com/idaholab/moose/pull/23986 for the lid driven problem with Re=1 and solver parameters
[Problem]
type = NavierStokesProblem
mass_matrix = 'mass'
L_matrix = 'L'
extra_tag_matrices = 'mass L'
[]
[Preconditioning]
active = FSP
[FSP]
type = FSP
topsplit = 'up'
[up]
splitting = 'u p'
splitting_type = schur
petsc_options_iname = '-pc_fieldsplit_schur_fact_type -pc_fieldsplit_schur_precondition -ksp_gmres_restart -ks\
p_type -ksp_pc_side -ksp_rtol'
petsc_options_value = 'full self 300 fgm\
res right 1e-4'
[]
[u]
vars = 'u v'
petsc_options = '-ksp_monitor'
petsc_options_iname = '-pc_type -pc_hypre_type -ksp_type -ksp_rtol -ksp_gmres_restart -ksp_pc_side'
petsc_options_value = 'hypre boomeramg gmres 1e-2 300 right'
[]
[p]
vars = 'pressure'
petsc_options = '-pc_lsc_scale_diag -ksp_monitor -lsc_ksp_monitor'
petsc_options_iname = '-ksp_type -ksp_gmres_restart -ksp_rtol -pc_type -ksp_pc_side -lsc_pc_type -lsc_ksp_type \
-lsc_ksp_pc_side -lsc_ksp_rtol'
petsc_options_value = 'fgmres 300 1e-2 lsc right hypre gmres \
right 1e-1'
[]
[]
[]
With identical solve options with Q2Q1 taylor-hood elements with 5768003 dofs, with Re=1 and distributed mesh, the solve time is 170.904s. The appreciable difference between Q2Q1 performance and the hybrid CG-DG performance is due to the time in A block solves. hypre boomeramg was taking about 3-4 iterations for Q2Q1 and ~10 iterations with hybrid CG-DG
Copying from slack notes: according to @grmnptr OpenFOAM solves a 3 million dof problem in 273 seconds on a single process
Solving Q2Q1 with Re=1 with 3 million dofs with a single process gives a solve time of 742 seconds, so a factor of 2.7 slower
I do not know the setup of the OpenFOAM case
so likely we are not at an apples to apples comparison
Using Q2Q1 elements, I was able to solve a 70 million dof problem (70,034,582 to be exact; n=2789 elements, 2 dimensions) using 3,504 procs on sawtooth in 278.778 seconds with Re=1. Reading the title post again, the prospective users wanted to do 70 million cells, so assuming that's 3D, that would be 280 million dofs. Time to request another interactive session!
Going up to 280 million dofs on sawtooth I got crashes and messages possibly related to MPI. I am not too motivated at the moment to dig into that. So at the moment, my record is 70 million. I don't really see this as a MOOSE issue at the moment, so I'm closing. We can always re-open if someone wants to
Reason
We have HPC people who are wanting to see if they can use NS (Pronghorn) to solve a problem with 70 million cells. The mesh is complex. To start we will see how big a simple channel problem we can solve using field split.
Design
Take a 3D channel and see how big we can go.
Impact
Test what we're capable of doing in NS