intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.26k stars 741 forks source link

Atomic Exchange CUDA Error #16037

Open GaryHuan9 opened 2 weeks ago

GaryHuan9 commented 2 weeks ago

Describe the bug

Hey! I am learning to use SYCL but I encountered a little issue when using sycl::atomic_ref::exchange. Things work fine on CPU, but when I switched to GPU even a very simple test (see below) crash with a CUDA error. Other atomic primitives such as store or load works fine.

To reproduce

  1. Include code snippet as short as possible
#include <sycl.hpp>

int main()
{
    sycl::queue queue(sycl::gpu_selector_v);
    std::cout << "Device: " << queue.get_device().get_info<sycl::info::device::name>() << std::endl;

    queue.submit([&](sycl::handler& diana)
    {
        sycl::stream out(1024, 256, diana);

        diana.parallel_for(1, [=](sycl::id<> id)
        {
            int memory = 3;
            sycl::atomic_ref<int,
                sycl::memory_order::relaxed,
                sycl::memory_scope::work_item> at(memory);

            int load = at.exchange(123);
            out << "id " << id << " load " << load << sycl::endl;
        });
    });

    queue.wait_and_throw();
}
  1. Specify the command which should be used to compile the program
icpx -fsycl -fsycl-targets=nvptx64-nvidia-cuda main.cpp 
  1. Specify the command which should be used to launch the program
./a.out
  1. Indicate what is wrong and what was expected

This is my output; obviously it crashes which is not what one would expect.

Device: NVIDIA GeForce RTX 4090
<CUDA>[ERROR]: 
UR CUDA ERROR:
        Value:           719
        Name:            CUDA_ERROR_LAUNCH_FAILED
        Description:     unspecified launch failure
        Function:        urEnqueueMemBufferRead
        Source Location: /tmp/tmp.nlKu2FwFq5/intel-llvm-mirror/build/_deps/unified-runtime-src/source/adapters/cuda/enqueue.cpp:1777

terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)
Aborted (core dumped)

Environment

Intel(R) oneAPI DPC++/C++ Compiler 2025.0.0 (2025.0.0.20241008)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2025.0/bin/compiler
Configuration file: /opt/intel/oneapi/compiler/2025.0/bin/compiler/../icpx.cfg
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+

And output of sycl-ls --verbose:

[opencl:cpu][opencl:0] Intel(R) OpenCL, AMD Ryzen 9 3900X 12-Core Processor             OpenCL 3.0 (Build 0) [2024.18.10.0.08_160000]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 4090 8.9 [CUDA 12.6]

Platforms: 2
Platform [#1]:
    Version  : OpenCL 3.0 LINUX
    Name     : Intel(R) OpenCL
    Vendor   : Intel(R) Corporation
    Devices  : 1
        Device [#0]:
        Type              : cpu
        Version           : OpenCL 3.0 (Build 0)
        Name              : AMD Ryzen 9 3900X 12-Core Processor            
        Vendor            : Intel(R) Corporation
        Driver            : 2024.18.10.0.08_160000
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : cpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_system_allocations usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_oneapi_srgb ext_oneapi_native_assert ext_intel_legacy_image ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_oneapi_private_alloca
        info::device::sub_group_sizes: 4 8 16 32 64
        Architecture: x86_64
Platform [#2]:
    Version  : CUDA 12.6
    Name     : NVIDIA CUDA BACKEND
    Vendor   : NVIDIA Corporation
    Devices  : 1
        Device [#0]:
        Type              : gpu
        Version           : 8.9
        Name              : NVIDIA GeForce RTX 4090
        Vendor            : NVIDIA Corporation
        Driver            : CUDA 12.6
        UUID              : 1367131105491041301142711512019110415220878
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_cuda_async_barrier ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthImages are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime.
 ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_1d_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_external_memory_import ext_oneapi_external_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_cubemap ext_oneapi_cubemap_seamless_filtering ext_oneapi_bindless_sampled_image_fetch_1d_usm ext_oneapi_bindless_sampled_image_fetch_2d_usm ext_oneapi_bindless_sampled_image_fetch_2d ext_oneapi_bindless_sampled_image_fetch_3d ext_oneapi_queue_profiling_tag ext_oneapi_virtual_mem ext_oneapi_image_array ext_oneapi_unique_addressing_per_dim ext_oneapi_bindless_images_sample_1d_usm ext_oneapi_bindless_images_sample_2d_usm
        info::device::sub_group_sizes: 32
        Architecture: nvidia_gpu_sm_89
default_selector()      : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 4090 8.9 [CUDA 12.6]
accelerator_selector()  : No device of requested type available. Please chec...
cpu_selector()          : cpu, Intel(R) OpenCL, AMD Ryzen 9 3900X 12-Core Processor             OpenCL 3.0 (Build 0) [2024.18.10.0.08_160000]
gpu_selector()          : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 4090 8.9 [CUDA 12.6]
custom_selector(gpu)    : gpu, NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 4090 8.9 [CUDA 12.6]
custom_selector(cpu)    : cpu, Intel(R) OpenCL, AMD Ryzen 9 3900X 12-Core Processor             OpenCL 3.0 (Build 0) [2024.18.10.0.08_160000]
custom_selector(acc)    : No device of requested type available. Please chec...

Additional context

No response

Seanst98 commented 1 day ago

Thank you for the bug report! It's great to see people learning SYCL.

Your reproducer suffers from a couple problems:

The first has shown us that we were missing the implementation of the work_item scope. It's generated a few internal discussions on how to properly handle work_item scopes and, for now, we're going to allow this scope to fallback to a coarser grained scope, so that users won't run into the unhelpful error that you were seeing.

You can find the change for this here: https://github.com/intel/llvm/pull/16172

As a current workaround, before the above PR is merged, you could try compiling the reproducer by specifying your device architecture:

icpx -fsycl -fsycl-targets=nvidia_gpu_sm_xx main.cpp

By specifying your achitecture, it enables more scopes available to your device, and will fallback to a much coarser grained scope, at the system level. At least sm_60 is required, which your device is capable of.

The second is that your reproducer is attempting to apply an atomic_ref to memory that is private to the thread. This is disallowed on NVIDIA and you will run into address space errors.

Please try applying atomic_ref to device memory that is not private to the thread. For example:

#include <sycl/sycl.hpp>

int main() {

  sycl::queue queue(sycl::gpu_selector_v);
  std::cout << "Device: "
            << queue.get_device().get_info<sycl::info::device::name>()
            << std::endl;

  int *data = sycl::malloc_device<int>(1, queue);

  queue.submit([&](sycl::handler &cgh) {
    sycl::stream out(1024, 256, cgh);

    cgh.parallel_for(10, [=](sycl::id<> id) {
      data[0] = 0;
      sycl::atomic_ref<int, sycl::memory_order::relaxed,
                       sycl::memory_scope::work_item,
                       sycl::access::address_space::generic_space>
          at(data[0]);

      int load = at.exchange(2);
      out << "id " << id << " load " << load << sycl::endl;
    });
  });

  queue.wait_and_throw();

  sycl::free(data, queue);
}