pnnl / ExaGO

High-performance power grid optimization for stochastic, security-constrained, and multi-period ACOPF problems.
Other
70 stars 9 forks source link

cray-mpich complains about MPI_LOR on MPI_CXX_BOOL #100

Closed nkoukpaizan closed 11 months ago

nkoukpaizan commented 12 months ago

Issue type

Relates to

Summary

I was trying to build the current develop branch on Frontier, and saw several functionality tests failing (e.g., FUNCTIONALITY_TEST_PFLOW_TESTSUITE_1_proc). A backtrace shows:


[ExaGO] Creating PFlow Functionality Test
[ExaGO] Creating PFlow Functionality Test
terminate called after throwing an instance of 'ExaGOError'
what():  Error in is_true_somewhere for MPI_Allreduce

Program received signal SIGABRT, Aborted. 0x00007fffe8371cbb in raise () from /lib64/libc.so.6 (gdb) backtrace

0 0x00007fffe8371cbb in raise () from /lib64/libc.so.6

1 0x00007fffe8373355 in abort () from /lib64/libc.so.6

2 0x00007fffe8d2b5b9 in __gnu_cxx::__verbose_terminate_handler () at ../../../../cpe-gcc-12.2.0-202211182106.97b1815c41a72/libstdc++-v3/libsupc++/vterminate.cc:95

3 0x00007fffe8d36bea in cxxabiv1::terminate (handler=) at ../../../../cpe-gcc-12.2.0-202211182106.97b1815c41a72/libstdc++-v3/libsupc++/eh_terminate.cc:48

4 0x00007fffe8d36c55 in std::terminate () at ../../../../cpe-gcc-12.2.0-202211182106.97b1815c41a72/libstdc++-v3/libsupc++/eh_terminate.cc:58

5 0x00007fffe8d36ea7 in cxxabiv1::cxa_throw (obj=, tinfo=0x216300 , dest=0x2999d0 <ExaGOError::~ExaGOError()>)

at ../../../../cpe-gcc-12.2.0-202211182106.97b1815c41a72/libstdc++-v3/libsupc++/eh_throw.cc:98

6 0x0000000000299054 in is_true_somewhere (flag=false, comm=1140850688) at /lustre/orion/scratch/nkouk/csc359/ExaGO/tests/functionality/pflow/../toml_utils.h:75

7 0x000000000029a993 in PflowFunctionalityTests::ensure_options_are_consistent (this=0x7fffffff5cc0, testcase=..., presets=...)

at /lustre/orion/scratch/nkouk/csc359/ExaGO/tests/functionality/pflow/selfcheck.cpp:52

8 0x0000000000299c6c in FunctionalityTestContext::run_all_test_cases (this=0x7fffffff5cc0)

at /lustre/orion/scratch/nkouk/csc359/ExaGO/tests/functionality/pflow/../toml_utils.h:133

9 0x0000000000299371 in main (argc=2, argv=0x7fffffff61e8) at /lustre/orion/scratch/nkouk/csc359/ExaGO/tests/functionality/pflow/selfcheck.cpp:196

It points to the `MPI_Allreduce` [here](https://github.com/pnnl/ExaGO/blob/839af0dc95d4cf8007922fa26a2641d2c69ffbae/tests/functionality/toml_utils.h#L73). I created a small reproducer to perform an `MPI_Allreduce` `MPI_LOR` on `MPI_CXX_BOOL`, and got a more verbose output:

> 

Fatal error in PMPI_Allreduce: Invalid MPI_Op, error stack: PMPI_Allreduce(497).....: MPI_Allreduce(sbuf=0x7fffffff59e3, rbuf=0x7fffffff59e2, count=1, datatype=dtype=0x4c000133, op=MPI_LOR, comm=MPI_COMM_WORLD) failed MPIR_LOR_check_dtype(92): MPI_Op MPI_LOR operation not defined for this datatype



The operation works on `MPI_C_BOOL`. Unless I am missing something, this looks like a bug in the MPI implementation to me. `MPI_CXX_BOOL` and `MPI_LOR` are in the MPI 3.1 standard. If the MPI implementation on the other platforms work as I expect, a workaround would be to use `MPI_C_BOOL`.

CC: @pelesh @rothpc @cameronrutherford @bjpalmer 
bjpalmer commented 11 months ago

You can implement this with integers and MPI_SUM. Not as elegant, but it should work everywhere.

bjpalmer commented 11 months ago

I implemented a fix using integers and created an MR for this.

nkoukpaizan commented 11 months ago

@bjpalmer Thanks for the fix. Your PR has been merged and I filed a bug report for the underlying cray-mpich issue.