icl-utk-edu / heffte

BSD 3-Clause "New" or "Revised" License
20 stars 15 forks source link

Multi stream calculation error #34

Closed zhaohaifei closed 1 year ago

zhaohaifei commented 1 year ago

Hello, I am using the rocm backend to build the heffte2.3 version for unit testing, and the heffte_streams test item will report the following error:

Start testing: Aug 16 15:16 CST

12/22 Testing: heffte_streams_np6 12/22 Test: heffte_streams_np6 Command: "/public/home/knight_wp/openmpi-5.0.0rc12/install/bin/mpiexec" "-n" "6" "/public/home/knight_wp/heffte-2.3.0/build/test/test_streams" Directory: /public/home/knight_wp/heffte-2.3.0/build/test "heffte_streams_np6" start time: Aug 16 15:16 CST Output:


                          heffte::fft streams

------------------------------------------------------------------------------- ccomplex -np 6 test heffte::fft3d (stream) pass zcomplex -np 6 test heffte::fft3d (stream) pass float -np 6 test heffte::fft3d_r2c (stream) pass double -np 6 test heffte::fft3d_r2c (stream) pass error magnitude: 0.294887 error magnitude: 0.29489 error magnitude: 0.294889 error magnitude: 0.294887 terminate called after throwing an instance of 'std::runtime_error' what(): mpi rank = 0 test -np 6 test heffte::fft3d (stream) in file: /public/home/knight_wp/heffte-2.3.0/test/test_fft3d.h line: 283 [a06r3n04:15025] Process received signal [a06r3n04:15025] Signal: Aborted (6) [a06r3n04:15025] Signal code: (-6) [a06r3n04:15025] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b84f305b5d0] [a06r3n04:15025] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2b84fcd57207] [a06r3n04:15025] [ 2] /lib64/libc.so.6(abort+0x148)[0x2b84fcd588f8] [a06r3n04:15025] [ 3] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0x99203)[0x2b84fc9e0203] [a06r3n04:15025] [ 4] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4c76)[0x2b84fc9ebc76] [a06r3n04:15025] [ 5] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4ce1)[0x2b84fc9ebce1] [a06r3n04:15025] [ 6] terminate called after throwing an instance of 'std::runtime_error' /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4f35)[0x2b84fc9ebf35] [a06r3n04:15025] [ 7] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x42c167] [a06r3n04:15025] [ 8] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bee0] [a06r3n04:15025] [ 9] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bb30] [a06r3n04:15025] [10] /lib64/libc.so.6(libc_start_main+0xf5)[0x2b84fcd433d5] [a06r3n04:15025] [11] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40b0f8] [a06r3n04:15025] End of error message what(): mpi rank = 4 test -np 6 test heffte::fft3d (stream) in file: /public/home/knight_wp/heffte-2.3.0/test/test_fft3d.h line: 283 terminate called after throwing an instance of 'std::runtime_error' [a06r3n04:15029] Process received signal [a06r3n04:15029] Signal: Aborted (6) [a06r3n04:15029] Signal code: (-6) what(): mpi rank = 3 test -np 6 test heffte::fft3d (stream) in file: /public/home/knight_wp/heffte-2.3.0/test/test_fft3d.h line: 283 [a06r3n04:15028] Process received signal [a06r3n04:15028] Signal: Aborted (6) [a06r3n04:15028] Signal code: (-6) [a06r3n04:15029] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2ab27ffc95d0] [a06r3n04:15029] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2ab289cc5207] [a06r3n04:15029] [ 2] [a06r3n04:15028] [ 0] /lib64/libc.so.6(abort+0x148)[0x2ab289cc68f8] [a06r3n04:15029] [ 3] /lib64/libpthread.so.0(+0xf5d0)[0x2ae3196a55d0] [a06r3n04:15028] [ 1] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0x99203)[0x2ab28994e203] [a06r3n04:15029] [ 4] /lib64/libc.so.6(gsignal+0x37)[0x2ae3233a1207] [a06r3n04:15028] [ 2] /lib64/libc.so.6(abort+0x148)[0x2ae3233a28f8] [a06r3n04:15028] [ 3] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4c76)[0x2ab289959c76] [a06r3n04:15029] [ 5] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4ce1)[0x2ab289959ce1] [a06r3n04:15029] [ 6] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0x99203)[0x2ae32302a203] [a06r3n04:15028] [ 4] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4f35)[0x2ab289959f35] [a06r3n04:15029] [ 7] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x42c167] [a06r3n04:15029] [ 8] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bee0] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4c76)[0x2ae323035c76] [a06r3n04:15028] [ 5] terminate called after throwing an instance of 'std::runtime_error' [a06r3n04:15029] [ 9] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bb30] [a06r3n04:15029] [10] /lib64/libc.so.6(libc_start_main+0xf5)[0x2ab289cb13d5] [a06r3n04:15029] [11] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40b0f8] [a06r3n04:15029] End of error message /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4ce1)[0x2ae323035ce1] [a06r3n04:15028] [ 6] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4f35)[0x2ae323035f35] [a06r3n04:15028] [ 7] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x42c167] [a06r3n04:15028] [ 8] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bee0] [a06r3n04:15028] [ 9] what(): mpi rank = 1 test -np 6 test heffte::fft3d (stream) in file: /public/home/knight_wp/heffte-2.3.0/test/test_fft3d.h line: 283 /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bb30] [a06r3n04:15028] [10] /lib64/libc.so.6(libc_start_main+0xf5)[0x2ae32338d3d5] [a06r3n04:15028] [11] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40b0f8] [a06r3n04:15028] End of error message [a06r3n04:15026] Process received signal [a06r3n04:15026] Signal: Aborted (6) [a06r3n04:15026] Signal code: (-6) [a06r3n04:15026] [ 0] terminate called after throwing an instance of 'std::runtime_error' terminate called after throwing an instance of 'std::runtime_error' /lib64/libpthread.so.0(+0xf5d0)[0x2ad73a2775d0] [a06r3n04:15026] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2ad743f73207] [a06r3n04:15026] [ 2] what(): mpi rank = 2 test -np 6 test heffte::fft3d (stream) in file: /public/home/knight_wp/heffte-2.3.0/test/test_fft3d.h line: 283 /lib64/libc.so.6(abort+0x148)[0x2ad743f748f8] [a06r3n04:15026] [ 3] what(): mpi rank = 5 test -np 6 test heffte::fft3d (stream) in file: /public/home/knight_wp/heffte-2.3.0/test/test_fft3d.h line: 283 /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0x99203)[0x2ad743bfc203] [a06r3n04:15026] [ 4] [a06r3n04:15027] Process received signal [a06r3n04:15027] Signal: Aborted (6) [a06r3n04:15027] Signal code: (-6) /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4c76)[0x2ad743c07c76] [a06r3n04:15026] [ 5] [a06r3n04:15030] Process received signal [a06r3n04:15030] Signal: Aborted (6) [a06r3n04:15030] Signal code: (-6) /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4ce1)[0x2ad743c07ce1] [a06r3n04:15026] [ 6] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4f35)[0x2ad743c07f35] [a06r3n04:15026] [ 7] [a06r3n04:15027] [ 0] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x42c167] [a06r3n04:15026] [ 8] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bee0] [a06r3n04:15026] [ 9] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bb30] [a06r3n04:15026] /lib64/libpthread.so.0(+0xf5d0)[0x2b9587dc05d0] [a06r3n04:15027] [ 1] [10] /lib64/libc.so.6(libc_start_main+0xf5)[0x2ad743f5f3d5] [a06r3n04:15026] [11] /lib64/libc.so.6(gsignal+0x37)[0x2b9591abc207] [a06r3n04:15027] [ 2] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40b0f8] [a06r3n04:15026] End of error message [a06r3n04:15030] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b61769615d0] [a06r3n04:15030] [ 1] /lib64/libc.so.6(abort+0x148)[0x2b9591abd8f8] [a06r3n04:15027] [ 3] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0x99203)[0x2b9591745203] [a06r3n04:15027] [ 4] /lib64/libc.so.6(gsignal+0x37)[0x2b618065d207] [a06r3n04:15030] [ 2] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4c76)[0x2b9591750c76] [a06r3n04:15027] [ 5] /lib64/libc.so.6(abort+0x148)[0x2b618065e8f8] [a06r3n04:15030] [ 3] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0x99203)[0x2b61802e6203] [a06r3n04:15030] [ 4] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4ce1)[0x2b9591750ce1] [a06r3n04:15027] [ 6] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4c76)[0x2b61802f1c76] [a06r3n04:15030] [ 5] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4f35)[0x2b9591750f35] [a06r3n04:15027] [ 7] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x42c167] [a06r3n04:15027] [ 8] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bee0] /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4ce1)[0x2b61802f1ce1] [a06r3n04:15030] [ 6] [a06r3n04:15027] [ 9] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bb30] [a06r3n04:15027] [10] /lib64/libc.so.6(libc_start_main+0xf5)[0x2b9591aa83d5] [a06r3n04:15027] [11] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40b0f8] [a06r3n04:15027] End of error message /public/software/compiler/gnu/gcc-9.3.0/lib64/libstdc++.so.6(+0xa4f35)[0x2b61802f1f35] [a06r3n04:15030] [ 7] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x42c167] [a06r3n04:15030] [ 8] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bee0] [a06r3n04:15030] [ 9] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40bb30] [a06r3n04:15030] [10] /lib64/libc.so.6(libc_start_main+0xf5)[0x2b61806493d5] [a06r3n04:15030] [11] /public/home/knight_wp/heffte-2.3.0/build/test/test_streams[0x40b0f8] [a06r3n04:15030] End of error message

prterun noticed that process rank 0 with PID 0 on node a06r3n04 exited on signal 6 (Aborted).

Test time = 7.80 sec ---------------------------------------------------------- Test Failed. "heffte_streams_np6" end time: Aug 16 15:16 CST "heffte_streams_np6" time elapsed: 00:00:07 ---------------------------------------------------------- End testing: Aug 16 15:16 CST I changed the computation using multiple streams to be on the default stream and the unit test passed. May I ask if there is a numerical dependency in the program that causes the calculation order to be wrong? Also, my cmake instructions to build heffte look like this: cmake .. -DCMAKE_BUILD_TYPE=Release \ -DBUILD_SHARED_LIBS=ON \ -DCMAKE_INSTALL_PREFIX=/public/home/knight_wp/heffte-2.3.0/install/ \ -DHeffte_ENABLE_AVX=ON \ -DHeffte_ENABLE_ROCM=ON \ -DCMAKE_CXX_COMPILER=hipcc \ -DHeffte_ROCM_ROOT=/public/software/compiler/rocm/dtk-23.04 \
mkstoyanov commented 1 year ago

Stream synchronization is hard due to sometimes poor documentation and inconsistencies in the different implementations. The MPI-direct methods are sometimes blocking and sometimes not.

zhaohaifei commented 1 year ago

Thank you.