trixi-framework / Trixi.jl

Trixi.jl: Adaptive high-order numerical simulations of conservation laws in Julia
https://trixi-framework.github.io/Trixi.jl
MIT License
505 stars 98 forks source link

Parallel p4est simulation crashes when ranks have zero elements #1973

Open benegee opened 3 weeks ago

benegee commented 3 weeks ago

Apparently this has been around before, see the great analysis in https://github.com/trixi-framework/Trixi.jl/issues/1096. However, I am still facing the following issue:

Example to reproduce

Take https://github.com/trixi-framework/Trixi.jl/blob/main/examples/p4est_3d_dgsem/elixir_advection_amr.jl and apply the following changes:

Error 1

The simulation fails at https://github.com/trixi-framework/Trixi.jl/blob/fc15c3873d2c0174db1f0a38aab13ff5e6132379/src/callbacks_step/analysis_dg3d_parallel.jl#L17 because get_node_vars tries to access the first element, which does not exist on one of the two ranks.

This can be avoided by using something like

  l2_error = zero(func(zero(SVector{nvariables(equations), eltype(u)}), equations))

Error 2

Next, the simulation fails at https://github.com/trixi-framework/Trixi.jl/blob/fc15c3873d2c0174db1f0a38aab13ff5e6132379/src/callbacks_step/analysis_dg3d_parallel.jl#L73-L77 because the index of the first element is passed to the anonymous function https://github.com/trixi-framework/Trixi.jl/blob/fc15c3873d2c0174db1f0a38aab13ff5e6132379/src/callbacks_step/analysis_dg3d.jl#L259-L264

A quick fix was to replace the body by something like

if nelements(mesh, dg, cache) == 0
  zero(eltype(u))
else
  # see above
end

Things then work as expected. But how would I solve this properly?

benegee commented 1 week ago

Sorry to bother you, @JoshuaLampert @lchristm @ranocha @sloede, I saw you have gone through this before. Do you have a suggestion how to fix the issues described above?

sloede commented 1 week ago

This is a conceptually hard problem:

My main question right now is: how did this ever work in the first place? Other than that, the only "clean" solution I see at the moment is to exclude zero-element ranks from the integral computations, which would require creating a new MPI communicator. While this is conceptually something that we should be doing anyways (and need to think about at some point in the intermediate future), it is very much non-trivial to implement.

Any other ideas/thoughts on this?

lchristm commented 1 week ago

I am also a bit puzzled as to how this is working so far at all. I agree that the solution proposed by @sloede would be the cleanest and most reasonable. A quick and dirty fix may be something like the following:

Assuming that

we could determine the return types during callback initialization on root and distribute them via MPI.bcast. All processes could store this information somewhere and then use that instead of calling the function like zero(func(something...)).