intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.2k stars 708 forks source link

[SYCL][CUDA][HIP] CUDA, and HIP devices ignore required subgroup size kernel attribute #14357

Open ayylol opened 1 week ago

ayylol commented 1 week ago

Describe the bug

The required subgroup size kernel attribute is ignored on cuda and hip devices. When checking what the compile subgroup size is of a kernel that had the required subgroup size attribute set, 0 is returned.

To reproduce

  1. Include a code snippet that is as short as possible
    
    #include <iostream>
    #include <sycl/sycl.hpp>

using namespace sycl;

class Kernel1;

// Change this, to a value that is inside sg_sizes const int SG_SIZE = 32;

int main() { queue Q(gpu_selector_v); device D = Q.get_device(); std::vector sg_sizes = D.get_info(); std::cout << "Supported subgroup sizes for " << D.get_info() << ": { "; for (size_t size : sg_sizes) { std::cout << size << " "; } std::cout << "}" << std::endl; Q.submit([&](handler &h) { h.parallel_for(nd_range<1>(512, 128), [=](nd_item<1> ndi) [[sycl::reqd_sub_group_size(SG_SIZE)]] {}); }).wait(); auto KernelID = get_kernel_id(); auto Kernel = get_kernel_bundle(Q.get_context(), {KernelID}) .get_kernel(KernelID); std::cout << "Tried to use subgroup size: " << SG_SIZE << std::endl; std::cout << "Actual used subgroup size: " << Kernel.get_info( D) << std::endl; std::cout << "Done" << std::endl; return 0; }

3. Specify the command which should be used to compile the program
`clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda sg-bug.cpp -o sg-bug`
5. Specify the command which should be used to launch the program
`./sg-bug`
7. Indicate what is wrong and what was expected
we expect that the subgroup size set by `reqd_sub_group_size`, should be the same as the one reported from `compile_sub_group_size`. Instead, we get a 0.

Supported subgroup sizes for NVIDIA GeForce RTX 2060: { 32 } Tried to use subgroup size: 32 Actual used subgroup size: 0 Done


### Environment

- OS: Linux
- Target device and vendor: NVIDIA GPU (similar behaviour was observed on HIP as well)
- Dependencies version:

Platform [#3]: Version : CUDA 12.4 Name : NVIDIA CUDA BACKEND Vendor : NVIDIA Corporation Devices : 1 Device [#0]: Type : gpu Version : 7.5 Name : NVIDIA GeForce RTX 2060 Vendor : NVIDIA Corporation Driver : CUDA 12.4 UUID : 862229814554125179692371413399212160130147 Num SubDevices : 0 Num SubSubDevices : 0 Aspects : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthImages are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime. ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_interop_memory_import ext_oneapi_interop_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_cubemap ext_oneapi_cubemap_seamless_filtering ext_oneapi_bindless_sampled_image_fetch_1d_usm ext_oneapi_bindless_sampled_image_fetch_2d_usm ext_oneapi_bindless_sampled_image_fetch_2d ext_oneapi_bindless_sampled_image_fetch_3d ext_oneapi_queue_profiling_tag info::device::sub_group_sizes: 32 Architecture: nvidia_gpu_sm_75



### Additional context

[SubGroup/attributes.cpp](https://github.com/intel/llvm/blob/sycl/sycl/test-e2e/SubGroup/attributes.cpp) e2e test is currently marked as unsupported on cuda/hip due to this behaviour
AlexeySachkov commented 3 days ago

This looks like a query bug to me. AFAIK, CUDA only supports sub-group size 32 and use of other sizes should lead to errors. But we should at least be able to apply the supported size and see that it indeed took an effect, I think.