See how large a problem we can solve with NS using field split

lindsayad commented 1 year ago

Reason

We have HPC people who are wanting to see if they can use NS (Pronghorn) to solve a problem with 70 million cells. The mesh is complex. To start we will see how big a simple channel problem we can solve using field split.

Design

Take a 3D channel and see how big we can go.

Impact

Test what we're capable of doing in NS

lindsayad commented 1 year ago

Note-taking on a medium sized 3D channel flow problem with a Reynolds number of 1 to anticipate what FSP options to use for the big kahuna

Re	dofs	cpu	`-pc_fieldsplit_schur_fact_type`	`-pc_fieldsplit_schur_precondition`	solve time
1	655360	32	full	a11	50.525
1	655360	32	lower	a11	39.341
1	655360	32	upper	a11	37.435
1	655360	32	diag	a11	66.413
1	655360	32	full	selfp	47.303
1	655360	32	lower	selfp	37.095
1	655360	32	upper	selfp	35.902
1	655360	32	diag	selfp	61.088

Conclusions:

selfp performs slightly better than a11 for all factorization types
lower and upper factorizations perform better than full which performs better than diag. upper seems to have a slight edge on lower
-pc_fieldsplit_schur_precondition full runs out of memory

lindsayad commented 1 year ago

I want to investigate whether we are solving a symmetric form with respect to A01 and A10 and if not, then whether switching to a symmetric form improves convergence

lindsayad commented 1 year ago

My general experience:

Jacobi for the Schur takes more iterations than something like ASM+(I)LU but it is less expensive so it often wins
Apparently boomeramg works better on high Re problems when L is large compared to ASM+(I)LU but ASM+(I)LU works better on high Re problems for small mu. This in reference to solving the momentum part of the matrix
If there is no diagonal, then for Jacobi for Schur, specify -pc_jacobi_type rowmax; -pc_jacobi_type rowsum appears to be terrible
Jacobi doesn't work for the momentum part of the matrix; not strong enough
For advection dominated, full factorization does not compare as poorly to lower/upper as for non-advection dominated, in fact it may be faster
I believe for Aij matrices in PETSc that the block size is always 1. So when doing Schur problems, doing Ainv types of diag and blockdiag are equivalent and in these cases when using selfp Ainv just corresponds to the reciprocal of A's diagonal which for advection dominated leads to Sp poorly approximating S, e.g. even when doing first order upwind with a Reynolds number of 2.2 in our 2d-rc-no-slip.i channel flow problem, it takes 50-60 linear iterations to solve the Schur complement with -pc_type lu for both the outer Schur PC type and the A00 PC type

Transferring this comment to https://github.com/idaholab/moose/discussions/24809. Any additional findings will go there

lindsayad commented 1 year ago

Adding a note that I had posted on slack but never added here:

I barely got within the memory limits of my workstation and was able to solve a 1.3 million cell 3D channel flow problem (5.2 million dofs) with 64 cpu in 338 seconds (90 seconds spent in setup) with field split. I will need to take to the HPC to see how much bigger I can go

grmnptr commented 1 year ago

How many nonlinear iterations did it take to solve your problem (asking to compare with SIMPLE which needs 400 to solve a problem with similar size, Re=220)?

lindsayad commented 1 year ago

I don't remember exactly. Somewhere between 3 and 6 although the 248 second solve reporting was with a problem with Reynolds ~1.

lindsayad commented 1 year ago

Testing out some things. 696.907 second solve for 7002001 dofs with 63 processes using distributed mesh and the hybrid discretization in https://github.com/idaholab/moose/pull/23986 for the lid driven problem with Re=1 and solver parameters

[Problem]
  type = NavierStokesProblem
  mass_matrix = 'mass'
  L_matrix = 'L'
  extra_tag_matrices = 'mass L'
[]

[Preconditioning]
  active = FSP
  [FSP]
    type = FSP
    topsplit = 'up'
    [up]
      splitting = 'u p'
      splitting_type  = schur
      petsc_options_iname = '-pc_fieldsplit_schur_fact_type  -pc_fieldsplit_schur_precondition -ksp_gmres_restart -ks\
p_type -ksp_pc_side -ksp_rtol'
      petsc_options_value = 'full                            self                              300                fgm\
res    right        1e-4'
    []
    [u]
      vars = 'u v'
      petsc_options = '-ksp_monitor'
      petsc_options_iname = '-pc_type -pc_hypre_type -ksp_type -ksp_rtol -ksp_gmres_restart -ksp_pc_side'
      petsc_options_value = 'hypre    boomeramg      gmres     1e-2      300                right'
    []
    [p]
      vars = 'pressure'
      petsc_options = '-pc_lsc_scale_diag -ksp_monitor -lsc_ksp_monitor'
      petsc_options_iname = '-ksp_type -ksp_gmres_restart -ksp_rtol -pc_type -ksp_pc_side -lsc_pc_type -lsc_ksp_type \
-lsc_ksp_pc_side -lsc_ksp_rtol'
      petsc_options_value = 'fgmres    300                1e-2      lsc      right        hypre        gmres         \
right            1e-1'
    []
  []
[]

lindsayad commented 1 year ago

With identical solve options with Q2Q1 taylor-hood elements with 5768003 dofs, with Re=1 and distributed mesh, the solve time is 170.904s. The appreciable difference between Q2Q1 performance and the hybrid CG-DG performance is due to the time in A block solves. hypre boomeramg was taking about 3-4 iterations for Q2Q1 and ~10 iterations with hybrid CG-DG

lindsayad commented 1 year ago

Copying from slack notes: according to @grmnptr OpenFOAM solves a 3 million dof problem in 273 seconds on a single process

lindsayad commented 1 year ago

Solving Q2Q1 with Re=1 with 3 million dofs with a single process gives a solve time of 742 seconds, so a factor of 2.7 slower

lindsayad commented 1 year ago

I do not know the setup of the OpenFOAM case

was it channel flow or a cavity or something else?
were the trivial initial guesses used?
was GPU acceleration used for the OpenFOAM case? (probably not)
etc.

so likely we are not at an apples to apples comparison

lindsayad commented 1 year ago

Using Q2Q1 elements, I was able to solve a 70 million dof problem (70,034,582 to be exact; n=2789 elements, 2 dimensions) using 3,504 procs on sawtooth in 278.778 seconds with Re=1. Reading the title post again, the prospective users wanted to do 70 million cells, so assuming that's 3D, that would be 280 million dofs. Time to request another interactive session!

lindsayad commented 1 year ago

Going up to 280 million dofs on sawtooth I got crashes and messages possibly related to MPI. I am not too motivated at the moment to dig into that. So at the moment, my record is 70 million. I don't really see this as a MOOSE issue at the moment, so I'm closing. We can always re-open if someone wants to

idaholab / moose