icl-utk-edu / heffte

BSD 3-Clause "New" or "Revised" License
20 stars 15 forks source link

hack to reduce gpu memory usage #12

Closed mkstoyanov closed 1 year ago

mkstoyanov commented 1 year ago
G-Ragghianti commented 1 year ago

I'm still working out some bugs with our github runner container. Let me fix this compiler issue and I'll rerun the checks.

mkstoyanov commented 1 year ago

@G-Ragghianti another quick question.

Why version of CUDA was used in the test that was failing?

For CUDA 12 we have this: https://docs.nvidia.com/cuda/cufft/#free-memory-requirement

The first program call to any cuFFT function causes the initialization of the cuFFT kernels. This can fail if there is not enough free memory on the GPU. It is advisable to initialize cufft first (e.g. by creating a plan) and then allocating memory.

Calling cuFFT within an MPI environment will cause each MPI rank to initialize cuFFT at the same time, this running out of memory. The hack makes cuFFT calls without MPI but on each MPI rank in sequence, that way every rank will initialize cuFFT separately and with the ability to use all available GPU RAM.

Note that the actual test, outside of the cuFFT overhead, will use only a few MB of RAM.

G-Ragghianti commented 1 year ago

The github runner uses cuda 11.4

G-Ragghianti commented 1 year ago

The failed spack-gpu_nvidia job executed on a DXG2 with A100s (82GB GPU RAM) and failed with OOM error:

24/24 Test: heffte_longlong_np4
Command: "/tmp/heffte/spack/opt/spack/linux-rocky8-zen2/gcc-9.5.0/openmpi-4.1.5-thhcvl5ee66gn6bjzr4vbh4eyqkdv4ph/bin/mpiexec"
 "-n" "4" "/tmp/heffte/heffte/spack-build-jyo3mr5/test/test_longlong"
Directory: /tmp/heffte/heffte/spack-build-jyo3mr5/test
"heffte_longlong_np4" start time: Mar 16 15:40 UTC
Output:
----------------------------------------------------------

--------------------------------------------------------------------------------
                               heffte::fft class
--------------------------------------------------------------------------------

     float                  -np 4  test int/long long<stock>              pass
    double                  -np 4  test int/long long<stock>              pass
     float                  -np 4  test int/long long<stock>              pass
    double                  -np 4  test int/long long<stock>              pass
     float                  -np 4  test int/long long<stock>              pass
    double                  -np 4  test int/long long<stock>              pass
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaMalloc() failed with message: out of memory
[b83948dccf17:103835] *** Process received signal ***
[b83948dccf17:103835] Signal: Aborted (6)
[b83948dccf17:103835] Signal code:  (-6)
[b83948dccf17:103835] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7febf1de2cf0]
[b83948dccf17:103835] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7febf1a58aff]
[b83948dccf17:103835] [ 2] /lib64/libc.so.6(abort+0x127)[0x7febf1a2bea5]
[b83948dccf17:103835] [ 3] /spack/opt/spack/linux-rocky8-x86_64/gcc-8.5.0/gcc-9.5.0-fozxtd2ai2fu2wlr3mrii35ggn7fbbt6/lib64/libstdc++.so.6(+0xa1fd3)[0x7febf262bfd3]
[b83948dccf17:103835] [ 4] /spack/opt/spack/linux-rocky8-x86_64/gcc-8.5.0/gcc-9.5.0-fozxtd2ai2fu2wlr3mrii35ggn7fbbt6/lib64/libstdc++.so.6(+0xad6f6)[0x7febf26376f6]
[b83948dccf17:103835] [ 5] /spack/opt/spack/linux-rocky8-x86_64/gcc-8.5.0/gcc-9.5.0-fozxtd2ai2fu2wlr3mrii35ggn7fbbt6/lib64/libstdc++.so.6(+0xad761)[0x7febf2637761]
[b83948dccf17:103835] [ 6] /spack/opt/spack/linux-rocky8-x86_64/gcc-8.5.0/gcc-9.5.0-fozxtd2ai2fu2wlr3mrii35ggn7fbbt6/lib64/libstdc++.so.6(+0xad9b5)[0x7febf26379b5]
[b83948dccf17:103835] [ 7] /tmp/heffte/heffte/spack-build-jyo3mr5/test/test_longlong(_ZN6heffte4cuda11check_errorE9cudaErrorPKc+0xb5)[0x40e655]
[b83948dccf17:103835] [ 8] /tmp/heffte/heffte/spack-build-jyo3mr5/libheffte.so.2(_ZN6heffte10gpu_warmupEv+0x183)[0x7fec08e760a3]
[b83948dccf17:103835] [ 9] /tmp/heffte/heffte/spack-build-jyo3mr5/test/test_longlong[0x40cbf1]
[b83948dccf17:103835] [10] /tmp/heffte/heffte/spack-build-jyo3mr5/test/test_longlong[0x40b5b3]
[b83948dccf17:103835] [11] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7febf1a44d85]
[b83948dccf17:103835] [12] /tmp/heffte/heffte/spack-build-jyo3mr5/test/test_longlong[0x40b64e]
[b83948dccf17:103835] *** End of error message ***
mkstoyanov commented 1 year ago

The issue is with environment variable: CTEST_PARALLEL_LEVEL which is begin set by either the system or spack, i.e., see here:

https://spack.readthedocs.io/en/latest/_modules/spack/build_systems/cmake.html

This is causing multiple tests to run on top of each other and the GPU cannot handle it.

mkstoyanov commented 1 year ago

It does reduce the memory usage, but it is not the mains source of the problem.