Closed mkstoyanov closed 1 year ago
I'm still working out some bugs with our github runner container. Let me fix this compiler issue and I'll rerun the checks.
@G-Ragghianti another quick question.
Why version of CUDA was used in the test that was failing?
For CUDA 12 we have this: https://docs.nvidia.com/cuda/cufft/#free-memory-requirement
The first program call to any cuFFT function causes the initialization of the cuFFT kernels. This can fail if there is not enough free memory on the GPU. It is advisable to initialize cufft first (e.g. by creating a plan) and then allocating memory.
Calling cuFFT within an MPI environment will cause each MPI rank to initialize cuFFT at the same time, this running out of memory. The hack makes cuFFT calls without MPI but on each MPI rank in sequence, that way every rank will initialize cuFFT separately and with the ability to use all available GPU RAM.
Note that the actual test, outside of the cuFFT overhead, will use only a few MB of RAM.
The github runner uses cuda 11.4
The failed spack-gpu_nvidia job executed on a DXG2 with A100s (82GB GPU RAM) and failed with OOM error:
24/24 Test: heffte_longlong_np4
Command: "/tmp/heffte/spack/opt/spack/linux-rocky8-zen2/gcc-9.5.0/openmpi-4.1.5-thhcvl5ee66gn6bjzr4vbh4eyqkdv4ph/bin/mpiexec"
"-n" "4" "/tmp/heffte/heffte/spack-build-jyo3mr5/test/test_longlong"
Directory: /tmp/heffte/heffte/spack-build-jyo3mr5/test
"heffte_longlong_np4" start time: Mar 16 15:40 UTC
Output:
----------------------------------------------------------
--------------------------------------------------------------------------------
heffte::fft class
--------------------------------------------------------------------------------
float -np 4 test int/long long<stock> pass
double -np 4 test int/long long<stock> pass
float -np 4 test int/long long<stock> pass
double -np 4 test int/long long<stock> pass
float -np 4 test int/long long<stock> pass
double -np 4 test int/long long<stock> pass
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaMalloc() failed with message: out of memory
[b83948dccf17:103835] *** Process received signal ***
[b83948dccf17:103835] Signal: Aborted (6)
[b83948dccf17:103835] Signal code: (-6)
[b83948dccf17:103835] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7febf1de2cf0]
[b83948dccf17:103835] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7febf1a58aff]
[b83948dccf17:103835] [ 2] /lib64/libc.so.6(abort+0x127)[0x7febf1a2bea5]
[b83948dccf17:103835] [ 3] /spack/opt/spack/linux-rocky8-x86_64/gcc-8.5.0/gcc-9.5.0-fozxtd2ai2fu2wlr3mrii35ggn7fbbt6/lib64/libstdc++.so.6(+0xa1fd3)[0x7febf262bfd3]
[b83948dccf17:103835] [ 4] /spack/opt/spack/linux-rocky8-x86_64/gcc-8.5.0/gcc-9.5.0-fozxtd2ai2fu2wlr3mrii35ggn7fbbt6/lib64/libstdc++.so.6(+0xad6f6)[0x7febf26376f6]
[b83948dccf17:103835] [ 5] /spack/opt/spack/linux-rocky8-x86_64/gcc-8.5.0/gcc-9.5.0-fozxtd2ai2fu2wlr3mrii35ggn7fbbt6/lib64/libstdc++.so.6(+0xad761)[0x7febf2637761]
[b83948dccf17:103835] [ 6] /spack/opt/spack/linux-rocky8-x86_64/gcc-8.5.0/gcc-9.5.0-fozxtd2ai2fu2wlr3mrii35ggn7fbbt6/lib64/libstdc++.so.6(+0xad9b5)[0x7febf26379b5]
[b83948dccf17:103835] [ 7] /tmp/heffte/heffte/spack-build-jyo3mr5/test/test_longlong(_ZN6heffte4cuda11check_errorE9cudaErrorPKc+0xb5)[0x40e655]
[b83948dccf17:103835] [ 8] /tmp/heffte/heffte/spack-build-jyo3mr5/libheffte.so.2(_ZN6heffte10gpu_warmupEv+0x183)[0x7fec08e760a3]
[b83948dccf17:103835] [ 9] /tmp/heffte/heffte/spack-build-jyo3mr5/test/test_longlong[0x40cbf1]
[b83948dccf17:103835] [10] /tmp/heffte/heffte/spack-build-jyo3mr5/test/test_longlong[0x40b5b3]
[b83948dccf17:103835] [11] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7febf1a44d85]
[b83948dccf17:103835] [12] /tmp/heffte/heffte/spack-build-jyo3mr5/test/test_longlong[0x40b64e]
[b83948dccf17:103835] *** End of error message ***
The issue is with environment variable: CTEST_PARALLEL_LEVEL
which is begin set by either the system or spack, i.e., see here:
https://spack.readthedocs.io/en/latest/_modules/spack/build_systems/cmake.html
This is causing multiple tests to run on top of each other and the GPU cannot handle it.
It does reduce the memory usage, but it is not the mains source of the problem.