jerett-cc / ryujin

High-performance high-order finite element solver for hyperbolic conservation equations
https://conservation-laws.org
Other
0 stars 0 forks source link

Build issue? MPI_Truncate error on cases that work on W machine #16

Open jerett-cc opened 7 months ago

jerett-cc commented 7 months ago

I get the following output when I run testcases that we have already tested on @bangerth's machine.

here 1
here 1
here 1
here 1
here 1
here 1
here 2
here 2
here 2
here 2
here 2
here 2
here 3
before levels
here 3
before levels
here 3
here 3
here 3
before levels
here 3
before levels
before levels
before levels
after levels, before time_loops
after levels, before time_loops
end default construction
here 4
end default construction
here 4
after levels, before time_loops
after levels, before time_loops
px: 6
px: 6
after levels, before time_loops
end default construction
here 4
end default construction
here 4
end default construction
here 4
after levels, before time_loops
px: 6
px: 6
px: 6
end default construction
here 4
px: 6
[INFO] Setting up Structures in App at level 1
[INFO] Setting up Structures in App at level 1
[INFO] Setting up Structures in App at level 1
[INFO] Setting up Structures in App at level 1
[INFO] Setting up Structures in App at level 1
[INFO] Setting up Structures in App at level 1
[INFO] Setting up Structures in App at level 0
[INFO] Setting up Structures in App at level 0
[INFO] Setting up Structures in App at level 0
[INFO] Setting up Structures in App at level 0
[INFO] Setting up Structures in App at level 0
[INFO] Setting up Structures in App at level 0
[up:36743] *** An error occurred in MPI_Waitall
[up:36743] *** reported by process [14508577067677188097,5]
[up:36743] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
[up:36743] *** MPI_ERR_TRUNCATE: message truncated
[up:36743] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[up:36743] ***    and potentially your MPI job)
[up.math.colostate.edu:36736] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[up.math.colostate.edu:36736] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly terminate

which happens when run with

mpirun -n 6 high-order-euler

run with the following prm file:

subsection MPI Parameters
  set px = 6
  set Time Bricks = 4
  set Start Time = 0.0
  set Stop Time = 5.0
  set cfactor = 2 ## 2 is default
  set max_iter = 1
end
subsection MGRIT
  set mgrit refinements = 0, 1
  set print_solution = false
end

subsection OfflineData
end
subsection TimeLoop
  set basename                      = cylinder
  set enable checkpointing          = false
  set enable compute error          = false
  set enable compute quantities     = false
  set enable output full            = true
  set enable output levelsets       = false
  set error normalize               = false
  set error quantities              = rho, m_1, m_2, E
  set output checkpoint multiplier  = 1
  set output full multiplier        = 1
  set output granularity            = 1
  set output levelsets multiplier   = 1
  set output quantities multiplier  = 1
  set refinement timepoints         = 
  set resume                        = false
  set terminal show rank throughput = false
  set terminal update interval      = 5
end
subsection Equation
  set gamma                   = 1.4
  set reference density       = 1
  set vacuum state relaxation = 10000
end
subsection Discretization
  set geometry            = cylinder
  set mesh distortion     = 0
  set mesh repartitioning = false
end
subsection InitialValues
  set configuration = uniform
  set direction     = 1, 0
  set perturbation  = 0
  set position      = 1, 0
  subsection astro jet
    set jet width               = 0.05
    set primitive ambient right = 5, 0, 0.4127
    set primitive jet state     = 5, 30, 0.4127
  end
  subsection uniform
    set primitive state = 1.4, 3, 1
  end
end
subsection HyperbolicModule
  set cfl with boundary dofs        = false
  set limiter iterations            = 2
  set limiter newton max iterations = 2
  set limiter newton tolerance      = 1e-10
  set limiter relaxation factor     = 1
end
subsection TimeIntegrator
  set cfl max               = 0.9
  set cfl min               = 0.45
  set cfl recovery strategy = bang bang control
  set time stepping scheme  = erk 33
end
subsection VTUOutput
  set manifolds                  = 
  set schlieren beta             = 10
  set schlieren quantities       = rho
  set schlieren recompute bounds = true
  set use mpi io                 = true
  set vorticity quantities       = 
  set vtu output quantities      = rho, m_1, m_2, E
end
subsection Quantities
  set boundary manifolds           = 
  set clear statistics on writeout = true
  set interior manifolds           = 
end
jerett-cc commented 7 months ago

the executable high-order-euler is build by entering the directory /ryujin and typing make debug, which contrasts how @bangerth build the debug executable on his machine.

jerett-cc commented 7 months ago

this issue may be related to the number of time bricks. for example, if the spatial communicator has 5 processes and the global communicator only has 5 processes, but Num_Time = 4. Then we are asking each of the 4 bricks to use the same communicator.

I suspect that since these all use the same communicator, there are communication errors, since one brick might be working on some part, assuming that it owns all of the communicators, but that this is not the case and some other brick does the same and there is a communication clash. For proof of this hunch, consider that running with

mpirun -n 5 high-order-euler test.prm 5 1 3 5

with test.prm:

subsection App
  set print_solution = false
  set Time Bricks = 1
  set Start Time = 0.0
  set Stop Time = 5.0
  set cfactor = 2 # 2 is Xbraid default
  set max_iter = 1
end

subsection OfflineData
end
subsection TimeLoop
  set basename                      = cylinder
  set enable checkpointing          = false
  set enable compute error          = false
  set enable compute quantities     = false
  set enable output full            = true
  set enable output levelsets       = false
  set error normalize               = false
  set error quantities              = rho, m_1, m_2, E
  set output checkpoint multiplier  = 1
  set output full multiplier        = 1
  set output granularity            = 1
  set output levelsets multiplier   = 1
  set output quantities multiplier  = 1
  set refinement timepoints         = 
  set resume                        = false
  set terminal show rank throughput = false
  set terminal update interval      = 5
end
subsection Equation
  set gamma                   = 1.4
  set reference density       = 1
  set vacuum state relaxation = 10000
end
subsection Discretization
  set geometry            = cylinder
  set mesh distortion     = 0
  set mesh repartitioning = false
end
subsection InitialValues
  set configuration = uniform
  set direction     = 1, 0
  set perturbation  = 0
  set position      = 1, 0
  subsection astro jet
    set jet width               = 0.05
    set primitive ambient right = 5, 0, 0.4127
    set primitive jet state     = 5, 30, 0.4127
  end
  subsection uniform
    set primitive state = 1.4, 3, 1
  end
end
subsection HyperbolicModule
  set cfl with boundary dofs        = false
  set limiter iterations            = 2
  set limiter newton max iterations = 2
  set limiter newton tolerance      = 1e-10
  set limiter relaxation factor     = 1
end
subsection TimeIntegrator
  set cfl max               = 0.9
  set cfl min               = 0.45
  set cfl recovery strategy = bang bang control
  set time stepping scheme  = erk 33
end
subsection VTUOutput
  set manifolds                  = 
  set schlieren beta             = 10
  set schlieren quantities       = rho
  set schlieren recompute bounds = true
  set use mpi io                 = true
  set vorticity quantities       = 
  set vtu output quantities      = rho, m_1, m_2, E
end
subsection Quantities
  set boundary manifolds           = 
  set clear statistics on writeout = true
  set interior manifolds           = 
end

Does not have an MPI error. The only change on this run from the one in this issue which is broken is that in test.prm Time Bricks = 1

jerett-cc commented 7 months ago

Adding the MPI_Barrier(comm_x) does not work. in app we wrote

void prepare_mg_objects()
    {
      for(unsigned int lvl = 0; lvl < refinement_levels.size(); lvl++)
      {
        if (dealii::Utilities::MPI::this_mpi_process(comm_t) == 0)
        {
          std::cout << "[INFO] Preparing Structures in App at level "
                    << refinement_levels[lvl] << std::endl;
        }
        levels[lvl]->prepare();
        std::cout << "Level " + std::to_string(refinement_levels[lvl]) + " prepared." << std::endl;
        MPI_Barrier(comm_x);
      }
      //set the last variables in app.
      n_fine_dofs = levels[0]->offline_data->dof_handler().n_dofs();
      n_locally_owned_dofs = levels[0]->offline_data->n_locally_owned();
    }

The barrier seems to NOT solve the issue here.

jerett-cc commented 7 months ago

Similarly,

MPI_Barrier(MPI_COMM_WORLD);

in the same spot does NOT fix the issue.

bangerth commented 7 months ago

That's a bummer. I would leave it in anyway, though. It does not hurt, and might help.

jerett-cc commented 3 weeks ago

NEED to test this with the current (Aug. 22nd 2024) state of the code.