Parallel p4est simulation crashes when ranks have zero elements

benegee commented 5 months ago

Apparently this has been around before, see the great analysis in https://github.com/trixi-framework/Trixi.jl/issues/1096. However, I am still facing the following issue:

Example to reproduce

Take https://github.com/trixi-framework/Trixi.jl/blob/main/examples/p4est_3d_dgsem/elixir_advection_amr.jl and apply the following changes:

initial_refinement_level = 1
tspan = (0.0, 30.0)

base_level = 0,
med_level = 1, med_threshold = 10.0,
max_level = 2, max_threshold = 50.0)

Run on 2 MPI ranks.

Error 1

The simulation fails at https://github.com/trixi-framework/Trixi.jl/blob/fc15c3873d2c0174db1f0a38aab13ff5e6132379/src/callbacks_step/analysis_dg3d_parallel.jl#L17 because get_node_vars tries to access the first element, which does not exist on one of the two ranks.

This can be avoided by using something like

  l2_error = zero(func(zero(SVector{nvariables(equations), eltype(u)}), equations))

Error 2

Next, the simulation fails at https://github.com/trixi-framework/Trixi.jl/blob/fc15c3873d2c0174db1f0a38aab13ff5e6132379/src/callbacks_step/analysis_dg3d_parallel.jl#L73-L77 because the index of the first element is passed to the anonymous function https://github.com/trixi-framework/Trixi.jl/blob/fc15c3873d2c0174db1f0a38aab13ff5e6132379/src/callbacks_step/analysis_dg3d.jl#L259-L264

A quick fix was to replace the body by something like

if nelements(mesh, dg, cache) == 0
  zero(eltype(u))
else
  # see above
end

Things then work as expected. But how would I solve this properly?

benegee commented 5 months ago

Sorry to bother you, @JoshuaLampert @lchristm @ranocha @sloede, I saw you have gone through this before. Do you have a suggestion how to fix the issues described above?

sloede commented 5 months ago

This is a conceptually hard problem:

We want to compute the integral over the output of a somewhat arbitrary function
The only way to assess the output type of such a function is by actually calling it
To be able to call it, we have to feed it somewhat sensible data (structures) in u
This data (structure) for u does not exist in a rank has zero elements.

My main question right now is: how did this ever work in the first place? Other than that, the only "clean" solution I see at the moment is to exclude zero-element ranks from the integral computations, which would require creating a new MPI communicator. While this is conceptually something that we should be doing anyways (and need to think about at some point in the intermediate future), it is very much non-trivial to implement.

Any other ideas/thoughts on this?

lchristm commented 5 months ago

I am also a bit puzzled as to how this is working so far at all. I agree that the solution proposed by @sloede would be the cleanest and most reasonable. A quick and dirty fix may be something like the following:

Assuming that

root is always non-empty (which is true as far as I can tell) and
the return types of the functions do not change over time,

we could determine the return types during callback initialization on root and distribute them via MPI.bcast. All processes could store this information somewhere and then use that instead of calling the function like zero(func(something...)).

trixi-framework / Trixi.jl