Closed heplesser closed 6 years ago
This is not a problem with 0 GetStatus
reading a number from GRNG, but rather a parallel computing problem.
When a script is run with for example two MPI processes, and call 0 GetStatus
only on process 0
, that process will at one point in accumulating the status dictionary want to update delay extrema, and thus has to communicate with other processes. This communication is done in the following line:
This causes it to deadlock. However, the other process, process 1
, will go into Simulate
, and there it too will update delay extrema, resolving this deadlock. Then process 1
will continue to the check for synchronized GRNGs, where it will try to gather random numbers from all processes with MPI_Allgather
, done in the following line:
This creates a new deadlock.
Process 0
has at this point returned from the GetStatus
call, goes into Simulate
, and again tries to update delay extrema. In doing so, it will also call MPI_Allgather
. This is what process 1
is waiting for, but it will not receive the random number from process 0
. Rather, it receives the minimum delay, or possibly even a garbage number, which of course does not equal its own random number. Therefore process 1
throws the error.
@hakonsbm Thank you for the detailed analysis!
I do not see any technical way in which we could guard against this type of problem: MPI requires by definition that all ranks stay in step, so that MPI communication operations on different ranks match each other. It is also impossible, in all generality, to detect that a user is performing a specific operation only on a single rank and prevent that directly. The check for GRNG synchrony is a relatively easy to implement at sensitive test for users trying to work around NEST's built-in parallelization.
The [documentation or Rank
] explicitly states that [i]t is highly discouraged to use this function to write rank-dependent code in a simulation script as this can break NEST in funny ways of which dead-locks are the nicest.
I am therefore closing this issue as wontfix.
Nilton Kamiji reported this first on NEST User on 24 March 2018.
To reproduce, run the script
with at least two MPI processes. It will fail with
It passes when run with a single MPI process.
If both ranks individually run
0 GetStatus
, the script passes:The problem is most likely that the
0 GetStatus
somewhere reads a number from the GRNG, although I cannot see any reason why it should do so.Note: One should never perform operations on a single or a subset of ranks, since this will quite likely upset NEST's parallelization logic. But a pure read operation such as
GetStatus
should be safe.