Closed martinruefenacht closed 4 years ago
After investigating, the bug appears to originate in the parameters being passed into the MPI_Init_thread()
call.
I chose a random failing test to try and resolve manually. The test was failing on OpenMPI 3.1.2. The code of the failing test was
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argument_count, char **argument_list)
{
int* argument_count_arg_NULL = NULL;
char*** argument_list_arg_NULL = NULL;
int required_arg_MPI_THREAD_SINGLE = MPI_THREAD_SINGLE;
int* provided_out;
// start point for start-end test
int return_MPI_Init_thread = MPI_Init_thread(argument_count_arg_NULL, argument_list_arg_NULL, required_arg_MPI_THREAD_SINGLE, provided_out);
printf("return_MPI_Init_thread %i\n", return_MPI_Init_thread);
printf("argument_count_arg_NULL %p\n", argument_count_arg_NULL);
printf("argument_list_arg_NULL %p\n", argument_list_arg_NULL);
printf("provided_out %p\n", provided_out);
if(return_MPI_Init_thread != MPI_SUCCESS)
{
exit(return_MPI_Init_thread);
}
// end point for start-end test
int return_MPI_Finalize = MPI_Finalize();
printf("return_MPI_Finalize %i\n", return_MPI_Finalize);
if(return_MPI_Finalize != MPI_SUCCESS)
{
exit(return_MPI_Finalize);
}
return 0;
}
The program would compile, but running mpiexec on the executable manually created the following error:
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node whitwell exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
By manually changing the int* provided_out
to int provided_out
and the function call to int return_MPI_Init_thread = MPI_Init_thread(argument_count_arg_NULL, argument_list_arg_NULL, required_arg_MPI_THREAD_SINGLE, &provided_out);
I was able to get no compilation errors and no runtime errors.
I believe this can be fixed by changing the parameter in the database for the MPI_Init_thread function in the MPI database, however before doing that I wanted to validate my fix on a different system. Upon testing this on my Macbook Pro running MPICH 3.2.1 the original error never even occurred. The original generated test passed without issue. Just for the sake of testing I went ahead and manually applied the same fix to my Macbook's test and it still compiled and run.
The specified type for the provided_out
parameter in MPI_Init_thread
is causing issues on some MPI implementations. It can be fixed by removing the pointer from the variable declaration and dereferencing the variable in the method call as previously shown. Either way this is a discrepancy in MPI implementations. If OpenMPI is correct, then we have a bug and should fix it by modifying our database of MPI functions. If MPICH is correct then we might have found a bug in OpenMPI and we might not necessarily have to make a change on our end.
int* provided_out;
// start point for start-end test
int return_MPI_Init_thread = MPI_Init_thread(argument_count_arg_NULL, argument_list_arg_NULL, required_arg_MPI_THREAD_SINGLE, provided_out);
int provided_out;
// start point for start-end test
int return_MPI_Init_thread = MPI_Init_thread(argument_count_arg_NULL, argument_list_arg_NULL, required_arg_MPI_THREAD_SINGLE, &provided_out);
Upon further investigation it appears that MPI generally expects a reference (rather than a pointer) on parameters that have an out direction. I no longer think that this would be a database change, but rather a change in the sampler.
Yes, the actual database is correct I think. But the way we generate a variable is not. We would need to create an int and then take a pointer to it and pass that pointer... Or the shorthand of & directly in the argument list. For explicitness I would prefer doing the two variables (one being a pointer) approach. (Explicit is better than implicit)
Does this solve the 136 error we used to get?
Yes this did solve the error on my machine.
Ok after testing manually, your proposed explicit fix does seem to resolve the problem. I'll start implementing a fix.
Fix is ready, however it might not actually be our bug.
The error code returned is "139".
MPI_Init_MPI_Finalize tests are correctly succeeding. (sometimes, see #23)
MPI_Init_thread_MPI_Finalize tests are not for the most part.