Open benegee opened 5 months ago
Sorry to bother you, @JoshuaLampert @lchristm @ranocha @sloede, I saw you have gone through this before. Do you have a suggestion how to fix the issues described above?
This is a conceptually hard problem:
u
u
does not exist in a rank has zero elements.My main question right now is: how did this ever work in the first place? Other than that, the only "clean" solution I see at the moment is to exclude zero-element ranks from the integral computations, which would require creating a new MPI communicator. While this is conceptually something that we should be doing anyways (and need to think about at some point in the intermediate future), it is very much non-trivial to implement.
Any other ideas/thoughts on this?
I am also a bit puzzled as to how this is working so far at all. I agree that the solution proposed by @sloede would be the cleanest and most reasonable. A quick and dirty fix may be something like the following:
Assuming that
we could determine the return types during callback initialization on root and distribute them via MPI.bcast
. All processes could store this information somewhere and then use that instead of calling the function like zero(func(something...))
.
Apparently this has been around before, see the great analysis in https://github.com/trixi-framework/Trixi.jl/issues/1096. However, I am still facing the following issue:
Example to reproduce
Take https://github.com/trixi-framework/Trixi.jl/blob/main/examples/p4est_3d_dgsem/elixir_advection_amr.jl and apply the following changes:
initial_refinement_level = 1
tspan = (0.0, 30.0)
Run on 2 MPI ranks.
Error 1
The simulation fails at https://github.com/trixi-framework/Trixi.jl/blob/fc15c3873d2c0174db1f0a38aab13ff5e6132379/src/callbacks_step/analysis_dg3d_parallel.jl#L17 because
get_node_vars
tries to access the first element, which does not exist on one of the two ranks.This can be avoided by using something like
Error 2
Next, the simulation fails at https://github.com/trixi-framework/Trixi.jl/blob/fc15c3873d2c0174db1f0a38aab13ff5e6132379/src/callbacks_step/analysis_dg3d_parallel.jl#L73-L77 because the index of the first element is passed to the anonymous function https://github.com/trixi-framework/Trixi.jl/blob/fc15c3873d2c0174db1f0a38aab13ff5e6132379/src/callbacks_step/analysis_dg3d.jl#L259-L264
A quick fix was to replace the body by something like
Things then work as expected. But how would I solve this properly?