Closed zhaohaifei closed 1 year ago
Stream synchronization is hard due to sometimes poor documentation and inconsistencies in the different implementations. The MPI-direct methods are sometimes blocking and sometimes not.
CMAKE_CXX_COMPILER
to clang++ and CMake should figure out the hipcc
setup.Thank you.
Hello, I am using the rocm backend to build the heffte2.3 version for unit testing, and the heffte_streams test item will report the following error:
Start testing: Aug 16 15:16 CST
12/22 Testing: heffte_streams_np6 12/22 Test: heffte_streams_np6 Command: "/public/home/knight_wp/openmpi-5.0.0rc12/install/bin/mpiexec" "-n" "6" "/public/home/knight_wp/heffte-2.3.0/build/test/test_streams" Directory: /public/home/knight_wp/heffte-2.3.0/build/test "heffte_streams_np6" start time: Aug 16 15:16 CST Output:
------------------------------------------------------------------------------- ccomplex -np 6 test heffte::fft3d (stream) pass
zcomplex -np 6 test heffte::fft3d (stream) pass
float -np 6 test heffte::fft3d_r2c (stream) pass
double -np 6 test heffte::fft3d_r2c (stream) pass
error magnitude: 0.294887
error magnitude: 0.29489
error magnitude: 0.294889
error magnitude: 0.294887
terminate called after throwing an instance of 'std::runtime_error'
what(): mpi rank = 0 test -np 6 test heffte::fft3d (stream) in file: /public/home/knight_wp/heffte-2.3.0/test/test_fft3d.h line: 283
[a06r3n04:15025] Process received signal
[a06r3n04:15025] Signal: Aborted (6)
[a06r3n04:15025] Signal code: (-6)
[a06r3n04:15025] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b84f305b5d0]
[a06r3n04:15025] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2b84fcd57207]
[a06r3n04:15025] [ 2] /lib64/libc.so.6(abort+0x148)[0x2b84fcd588f8]
[a06r3n04:15025] [ 3] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0x99203)[0x2b84fc9e0203]
[a06r3n04:15025] [ 4] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4c76)[0x2b84fc9ebc76]
[a06r3n04:15025] [ 5] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4ce1)[0x2b84fc9ebce1]
[a06r3n04:15025] [ 6] terminate called after throwing an instance of 'std::runtime_error'
/public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4f35)[0x2b84fc9ebf35]
[a06r3n04:15025] [ 7] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x42c167]
[a06r3n04:15025] [ 8] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bee0]
[a06r3n04:15025] [ 9] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bb30]
[a06r3n04:15025] [10] /lib64/libc.so.6(libc_start_main+0xf5)[0x2b84fcd433d5]
[a06r3n04:15025] [11] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40b0f8]
[a06r3n04:15025] End of error message
what(): mpi rank = 4 test -np 6 test heffte::fft3d (stream) in file: /public/home/knight_wp/heffte-2.3.0/test/test_fft3d.h line: 283
terminate called after throwing an instance of 'std::runtime_error'
[a06r3n04:15029] Process received signal
[a06r3n04:15029] Signal: Aborted (6)
[a06r3n04:15029] Signal code: (-6)
what(): mpi rank = 3 test -np 6 test heffte::fft3d (stream) in file: /public/home/knight_wp/heffte-2.3.0/test/test_fft3d.h line: 283
[a06r3n04:15028] Process received signal
[a06r3n04:15028] Signal: Aborted (6)
[a06r3n04:15028] Signal code: (-6)
[a06r3n04:15029] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2ab27ffc95d0]
[a06r3n04:15029] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2ab289cc5207]
[a06r3n04:15029] [ 2] [a06r3n04:15028] [ 0] /lib64/libc.so.6(abort+0x148)[0x2ab289cc68f8]
[a06r3n04:15029] [ 3] /lib64/libpthread.so.0(+0xf5d0)[0x2ae3196a55d0]
[a06r3n04:15028] [ 1] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0x99203)[0x2ab28994e203]
[a06r3n04:15029] [ 4] /lib64/libc.so.6(gsignal+0x37)[0x2ae3233a1207]
[a06r3n04:15028] [ 2] /lib64/libc.so.6(abort+0x148)[0x2ae3233a28f8]
[a06r3n04:15028] [ 3] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4c76)[0x2ab289959c76]
[a06r3n04:15029] [ 5] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4ce1)[0x2ab289959ce1]
[a06r3n04:15029] [ 6] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0x99203)[0x2ae32302a203]
[a06r3n04:15028] [ 4] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4f35)[0x2ab289959f35]
[a06r3n04:15029] [ 7] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x42c167]
[a06r3n04:15029] [ 8] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bee0]
/public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4c76)[0x2ae323035c76]
[a06r3n04:15028] [ 5] terminate called after throwing an instance of 'std::runtime_error'
[a06r3n04:15029] [ 9] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bb30]
[a06r3n04:15029] [10] /lib64/libc.so.6(libc_start_main+0xf5)[0x2ab289cb13d5]
[a06r3n04:15029] [11] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40b0f8]
[a06r3n04:15029] End of error message
/public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4ce1)[0x2ae323035ce1]
[a06r3n04:15028] [ 6] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4f35)[0x2ae323035f35]
[a06r3n04:15028] [ 7] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x42c167]
[a06r3n04:15028] [ 8] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bee0]
[a06r3n04:15028] [ 9] what(): mpi rank = 1 test -np 6 test heffte::fft3d (stream) in file: /public/home/knight_wp/heffte-2.3.0/test/test_fft3d.h line: 283
/public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bb30]
[a06r3n04:15028] [10] /lib64/libc.so.6(libc_start_main+0xf5)[0x2ae32338d3d5]
[a06r3n04:15028] [11] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40b0f8]
[a06r3n04:15028] End of error message
[a06r3n04:15026] Process received signal
[a06r3n04:15026] Signal: Aborted (6)
[a06r3n04:15026] Signal code: (-6)
[a06r3n04:15026] [ 0] terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
/lib64/libpthread.so.0(+0xf5d0)[0x2ad73a2775d0]
[a06r3n04:15026] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2ad743f73207]
[a06r3n04:15026] [ 2] what(): mpi rank = 2 test -np 6 test heffte::fft3d (stream) in file: /public/home/knight_wp/heffte-2.3.0/test/test_fft3d.h line: 283
/lib64/libc.so.6(abort+0x148)[0x2ad743f748f8]
[a06r3n04:15026] [ 3] what(): mpi rank = 5 test -np 6 test heffte::fft3d (stream) in file: /public/home/knight_wp/heffte-2.3.0/test/test_fft3d.h line: 283
/public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0x99203)[0x2ad743bfc203]
[a06r3n04:15026] [ 4] [a06r3n04:15027] Process received signal
[a06r3n04:15027] Signal: Aborted (6)
[a06r3n04:15027] Signal code: (-6)
/public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4c76)[0x2ad743c07c76]
[a06r3n04:15026] [ 5] [a06r3n04:15030] Process received signal
[a06r3n04:15030] Signal: Aborted (6)
[a06r3n04:15030] Signal code: (-6)
/public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4ce1)[0x2ad743c07ce1]
[a06r3n04:15026] [ 6] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4f35)[0x2ad743c07f35]
[a06r3n04:15026] [ 7] [a06r3n04:15027] [ 0] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x42c167]
[a06r3n04:15026] [ 8] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bee0]
[a06r3n04:15026] [ 9] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bb30]
[a06r3n04:15026] /lib64/libpthread.so.0(+0xf5d0)[0x2b9587dc05d0]
[a06r3n04:15027] [ 1] [10] /lib64/libc.so.6(libc_start_main+0xf5)[0x2ad743f5f3d5]
[a06r3n04:15026] [11] /lib64/libc.so.6(gsignal+0x37)[0x2b9591abc207]
[a06r3n04:15027] [ 2] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40b0f8]
[a06r3n04:15026] End of error message
[a06r3n04:15030] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b61769615d0]
[a06r3n04:15030] [ 1] /lib64/libc.so.6(abort+0x148)[0x2b9591abd8f8]
[a06r3n04:15027] [ 3] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0x99203)[0x2b9591745203]
[a06r3n04:15027] [ 4] /lib64/libc.so.6(gsignal+0x37)[0x2b618065d207]
[a06r3n04:15030] [ 2] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4c76)[0x2b9591750c76]
[a06r3n04:15027] [ 5] /lib64/libc.so.6(abort+0x148)[0x2b618065e8f8]
[a06r3n04:15030] [ 3] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0x99203)[0x2b61802e6203]
[a06r3n04:15030] [ 4] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4ce1)[0x2b9591750ce1]
[a06r3n04:15027] [ 6] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4c76)[0x2b61802f1c76]
[a06r3n04:15030] [ 5] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4f35)[0x2b9591750f35]
[a06r3n04:15027] [ 7] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x42c167]
[a06r3n04:15027] [ 8] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bee0]
/public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4ce1)[0x2b61802f1ce1]
[a06r3n04:15030] [ 6] [a06r3n04:15027] [ 9] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bb30]
[a06r3n04:15027] [10] /lib64/libc.so.6(libc_start_main+0xf5)[0x2b9591aa83d5]
[a06r3n04:15027] [11] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40b0f8]
[a06r3n04:15027] End of error message
/public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4f35)[0x2b61802f1f35]
[a06r3n04:15030] [ 7] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x42c167]
[a06r3n04:15030] [ 8] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bee0]
[a06r3n04:15030] [ 9] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bb30]
[a06r3n04:15030] [10] /lib64/libc.so.6(libc_start_main+0xf5)[0x2b61806493d5]
[a06r3n04:15030] [11] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40b0f8]
[a06r3n04:15030] End of error message
prterun noticed that process rank 0 with PID 0 on node a06r3n04 exited on signal 6 (Aborted).