kokkos / kokkos

Kokkos C++ Performance Portability Programming Ecosystem: The Programming Model - Parallel Execution and Memory Abstraction
https://kokkos.org
Other
1.85k stars 421 forks source link

MDRange tuner creates an illegal range in Z dimension for CUDA back end (and probably others) #7105

Closed khuck closed 2 weeks ago

khuck commented 2 months ago

Describe the bug

The Kokkos internals tuning for the Cuda back-end has a constructor method for declaring the ranges of potential values for the block.x,y,z dimensions of deep copies. The sets of values are [1,2,4,8,16,32,64,128,256,512,1024] for all three dimensions. They are set here:

https://github.com/kokkos/kokkos/blob/2d7715239700f50169bc50a96a234b05c28c9a2e/core/src/Kokkos_Tuners.hpp#L565-L567

Unfortunately, the Z range should only go up to 64, see: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications-technical-specifications-per-compute-capability

This gets exposed by the deep copy of a 3D view.

Please include the following for a minimal reproducer

  1. Compilers (with versions) - GCC is fine
  2. Kokkos release or commit used (i.e., the sha1 number) current main/develop
  3. Platform, architecture and backend - x86_64 with cuda
  4. CMake configure command
    cmake -B build \
    -DCMAKE_BUILD_TYPE=Debug \
    -DCMAKE_CXX_COMPILER=g++ \
    -DCMAKE_C_COMPILER=gcc \
    -DCMAKE_INSTALL_PREFIX=`pwd`/install \
    -DKokkos_ENABLE_TUNING=ON \
    -DKokkos_ENABLE_OPENMP=ON \
    -DKokkos_ENABLE_SERIAL=ON \
    -DKokkos_ENABLE_CUDA=ON \
    -DKokkos_ENABLE_CUDA_LAMBDA=ON \
    -DKokkos_ARCH_VOLTA70=ON \
    -DKokkos_ARCH_NATIVE=ON \
    -DKokkos_ENABLE_COMPILER_WARNINGS=ON \
    -DKokkos_ENABLE_DEPRECATED_CODE_4=ON \
    -DKokkos_ENABLE_DEBUG_BOUNDS_CHECK=ON \
    -DKokkos_ENABLE_DEPRECATION_WARNINGS=OFF \
  5. Output from CMake configure command (the usual)
  6. Minimum, complete code needed to reproduce the bug - this program: https://github.com/khuck/apex-kokkos-tuning/blob/main/tests/deep_copy.cpp
  7. Command line needed to reproduce the bug
    apex_exec --apex:kokkos-tuning /home/users/khuck/src/apex-kokkos-tuning/build/tests/deep_copy --kokkos-tune-internals

    ...or any other tuner that will search into the range where 128-1024 could be used for the block Z dimension.

  8. KokkosCore_config.h header file (generated during the build)
  9. Please provide any additional relevant error logs - this doesn't get caught unless the -DKokkos_ENABLE_COMPILER_WARNINGS=ON -DKokkos_ENABLE_DEBUG_BOUNDS_CHECK=ON is set at configure time.
masterleinad commented 2 months ago

this doesn't get caught unless the -DKokkos_ENABLE_COMPILER_WARNINGS=ON is set at configure time.

What do compiler warnings have to do with the issue? I would rather have expected that you get a Cuda error.

khuck commented 2 months ago

oops! I meant -DKokkos_ENABLE_DEBUG_BOUNDS_CHECK. There is a CUDA error, but it doesn't get caught unless this setting is enabled. The cudaGetLastError() doesn't happen otherwise.

cedricchevalier19 commented 2 months ago

A colleague of mine ran in a similar issue last week.

khuck commented 2 months ago

A colleague of mine ran in a similar issue last week.

@cedricchevalier19 Was the colleague using tuning? or was this this just running a CUDA back end? Thanks!

khuck commented 1 month ago

@vlkale BTW, a quick and dirty fix for this is to change this line: https://github.com/kokkos/kokkos/blob/2d7715239700f50169bc50a96a234b05c28c9a2e/core/src/Cuda/Kokkos_Cuda_Parallel_MDRange.hpp#L114 to:

        const dim3 block(m_rp.m_tile[0], m_rp.m_tile[1],
          ((m_rp.m_tile[2] > 64) ? 64 : m_rp.m_tile[2]));

...and maybe there's a better way to interrogate that 64 value, rather than hard-coding it. You can also change the assertions: https://github.com/kokkos/kokkos/blob/2d7715239700f50169bc50a96a234b05c28c9a2e/core/src/Cuda/Kokkos_Cuda_Parallel_MDRange.hpp#L115-L117 to be:

        KOKKOS_ASSERT(block.x > 0 && block.x < 1025);
        KOKKOS_ASSERT(block.y > 0 && block.y < 1025);
        KOKKOS_ASSERT(block.z > 0 && block.z < 65);

...again, with interrogated values not hard-coded limits.

khuck commented 1 month ago

...and while we're adding assertions, you can also add:

        KOKKOS_ASSERT(grid.x > 0); 
        KOKKOS_ASSERT(grid.y > 0 && grid.y < 65536);
        KOKKOS_ASSERT(grid.z > 0 && grid.z < 65536);

...before the CudaParallelLaunch() call, probably with interrogated limits (not hard-coded).

vlkale commented 1 month ago

@vlkale BTW, a quick and dirty fix for this is to change this line:

https://github.com/kokkos/kokkos/blob/2d7715239700f50169bc50a96a234b05c28c9a2e/core/src/Cuda/Kokkos_Cuda_Parallel_MDRange.hpp#L114

to:

        const dim3 block(m_rp.m_tile[0], m_rp.m_tile[1],
          ((m_rp.m_tile[2] > 64) ? 64 : m_rp.m_tile[2]));

...and maybe there's a better way to interrogate that 64 value, rather than hard-coding it. You can also change the assertions:

https://github.com/kokkos/kokkos/blob/2d7715239700f50169bc50a96a234b05c28c9a2e/core/src/Cuda/Kokkos_Cuda_Parallel_MDRange.hpp#L115-L117

to be:

        KOKKOS_ASSERT(block.x > 0 && block.x < 1025);
        KOKKOS_ASSERT(block.y > 0 && block.y < 1025);
        KOKKOS_ASSERT(block.z > 0 && block.z < 65);

...again, with interrogated values not hard-coded limits.

I am good with this quick fix. Yes, we definitely want the interrogated values.

ajpowelsnl commented 1 month ago

@khuck - might you be able to raise a PR addressing this issue?

khuck commented 3 weeks ago

@ajpowelsnl is there a kokkos call equivalent to const auto maxblocks = m_rp.space().cuda_device_prop().maxBlockSize;? I cannot find in the kokkos source code where m_rp.space().cuda_device_prop().maxGridSize is defined ...

khuck commented 3 weeks ago

never mind, I found https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_192d195493a9d36b2d827aaf3ffd89f1e

ajpowelsnl commented 3 weeks ago

Hi @khuck - in Kokkos develop, is this definition helpful?

ajpowelsnl commented 3 weeks ago

never mind, I found https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_192d195493a9d36b2d827aaf3ffd89f1e

Have a look here 😄

khuck commented 3 weeks ago

@ajpowelsnl OK, I put together a commit here - is this more or less what you have in mind? I added a lot of assertions, and I have tested it with 2-6 dimensions. https://github.com/kokkos/kokkos/compare/master...khuck:kokkos:master

If that looks OK, I'll make a PR