microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.53k stars 823 forks source link

No OpenCL platforms reported #6951

Open perrymacmurray opened 3 years ago

perrymacmurray commented 3 years ago

Windows Build Number

21382.1

WSL Version

Kernel Version

5.10.16.3

Distro Version

Ubuntu 20.04

Other Software

Inside WSL: clinfo (for checking OpenCL platforms) CUDA 11.3 (docker container runs with NVIDIA_DISABLE_REQUIRE=1, as it otherwise thinks it's running 11.0) Docker 20.10.6, build 370c289 (with custom container) nvidia-docker2 2.5.0-1

On Windows: NVIDIA Graphics Driver for CUDA on WSL 470.14

Repro Steps

I installed the Nvidia drivers and docker as according to Nvidia's user guide I am however running an older version of nvidia-docker2 (and dependencies) as according to a forum post here

Additionally, I have also installed the CUDA on WSL driver here

Steps: Run clinfo (both in and outside of the Docker container)

Expected Behavior

clinfo should return the graphics card (in my case, GTX 970) as an OpenCL platform

Actual Behavior

clinfo reports 0 platforms available, both inside the container and just on WSL

Diagnostic Logs

cuda nvidia-container-cli glxinfo (from inside of container) glxinfo (from WSL, outside of container) wsl.etl

Tongzhao9417 commented 9 months ago

@Bossach

Thanks for your share! I follow your step and it almost successful. However, the clinfo told me that "unknown target CPU 'sm_89'". Here is my full output and full benchmark.

clinfo:

Number of platforms                               1
  Platform Name                                   Portable Computing Language
  Platform Vendor                                 The pocl project
  Platform Version                                OpenCL 3.0 PoCL 5.0  Linux, RelWithDebInfo, RELOC, SPIR, LLVM 14.0.0, SLEEF, CUDA, POCL_DEBUG
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_pocl_content_size
  Platform Extensions with Version                cl_khr_icd                                                       0x400000 (1.0.0)
                                                  cl_pocl_content_size                                             0x400000 (1.0.0)
  Platform Numeric Version                        0xc00000 (3.0.0)
  Platform Extensions function suffix             POCL
  Platform Host timer resolution                  0ns

  Platform Name                                   Portable Computing Language
Number of devices                                 1
  Device Name                                     NVIDIA GeForce RTX 4090
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 3.0 PoCL HSTR: CUDA-sm_89
  Device Numeric Version                          0xc00000 (3.0.0)
  Driver Version                                  5.0
  Device OpenCL C Version                         OpenCL C 1.2 PoCL
  Device OpenCL C all versions                    OpenCL C                                                         0x400000 (1.0.0)
                                                  OpenCL C                                                         0x401000 (1.1.0)
                                                  OpenCL C                                                         0x402000 (1.2.0)
                                                  OpenCL C                                                         0xc00000 (3.0.0)
  Device OpenCL C features                        __opencl_c_images                                                0xc00000 (3.0.0)
                                                  __opencl_c_atomic_order_acq_rel                                  0xc00000 (3.0.0)
                                                  __opencl_c_atomic_order_seq_cst                                  0xc00000 (3.0.0)
                                                  __opencl_c_atomic_scope_device                                   0xc00000 (3.0.0)
                                                  __opencl_c_program_scope_global_variables                        0xc00000 (3.0.0)
                                                  __opencl_c_generic_address_space                                 0xc00000 (3.0.0)
                                                  __opencl_c_fp16                                                  0xc00000 (3.0.0)
                                                  __opencl_c_fp64                                                  0xc00000 (3.0.0)
  Latest comfornace test passed                   (n/a)
  Device Type                                     GPU
  Device Topology (NV)                            PCI-E, 0000:01:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               128
  Max clock frequency                             2595MHz
  Compute Capability (NV)                         8.9
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple (device)     32
=== CL_PROGRAM_BUILD_LOG ===
error: unknown target CPU 'sm_89'
Device NVIDIA GeForce RTX 4090 failed to build the program
  Preferred work group size multiple (kernel)     <getWGsizes:1504: create kernel : error -45>
  Warp size (NV)                                  32
  Max sub-groups per work group                   32
  Preferred / native vector sizes
    char                                                 1 / 1
    short                                                1 / 1
    int                                                  1 / 1
    long                                                 1 / 1
    half                                                 0 / 0        (cl_khr_fp16)
    float                                                1 / 1
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     No
    Infinity and NANs                             No
    Round to nearest                              No
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              25756696576 (23.99GiB)
  Error Correction support                        No
  Max memory allocation                           6439174144 (5.997GiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Shared Virtual Memory (SVM) capabilities        (core)
    Coarse-grained buffer sharing                 Yes
    Fine-grained buffer sharing                   Yes
    Fine-grained system sharing                   No
    Atomics                                       No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Preferred alignment for atomics
    SVM                                           64 bytes
    Global                                        64 bytes
    Local                                         64 bytes
  Atomic memory capabilities                      relaxed, work-group scope
  Atomic fence capabilities                       relaxed, acquire/release, work-group scope
  Max size for global variable                    0
  Preferred total size of global vars             0
  Global Memory cache type                        None
  Image support                                   No
  Pipe support                                    No
  Max number of pipe args                         0
  Max active pipe reservations                    0
  Max pipe packet size                            0
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max number of constant args                     8
  Max constant buffer size                        65536 (64KiB)
  Generic address space support                   Yes
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties (on host)
    Out-of-order execution                        No
    Profiling                                     Yes
  Device enqueue capabilities                     (n/a)
  Queue properties (on device)
    Out-of-order execution                        No
    Profiling                                     No
    Preferred size                                0
    Max size                                      0
  Max queues on device                            0
  Max events on device                            0
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      1ns
  Execution capabilities
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Non-uniform work-groups                       No
    Work-group collective functions               No
    Sub-group independent forward progress        Yes
    Kernel execution timeout (NV)                 Yes
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  1
    IL version                                    (n/a)
    ILs with version                              (n/a)
    SPIR versions                                 (n/a)
  printf() buffer size                            16777216 (16MiB)
  Built-in kernels                                pocl.mul.i32;pocl.add.i32;pocl.dnn.conv2d_int8_relu;pocl.sgemm.local.f32;pocl.sgemm.tensor.f16f16f32;pocl.sgemm_ab.tensor.f16f16f32;pocl.abs.f32;pocl.add.i8;org.khronos.openvx.scale_image.nn.u8;org.khronos.openvx.scale_image.bl.u8;org.khronos.openvx.tensor_convert_depth.wrap.u8.f32
  Built-in kernels with version                   pocl.mul.i32                                                     0x402000 (1.2.0)
                                                  pocl.add.i32                                                     0x402000 (1.2.0)
                                                  pocl.dnn.conv2d_int8_relu                                        0x402000 (1.2.0)
                                                  pocl.sgemm.local.f32                                             0x402000 (1.2.0)
                                                  pocl.sgemm.tensor.f16f16f32                                      0x402000 (1.2.0)
                                                  pocl.sgemm_ab.tensor.f16f16f32                                   0x402000 (1.2.0)
                                                  pocl.abs.f32                                                     0x402000 (1.2.0)
                                                  pocl.add.i8                                                      0x402000 (1.2.0)
                                                  org.khronos.openvx.scale_image.nn.u8                             0x402000 (1.2.0)
                                                  org.khronos.openvx.scale_image.bl.u8                             0x402000 (1.2.0)
                                                  org.khronos.openvx.tensor_convert_depth.wrap.u8.f32              0x402000 (1.2.0)
  Device Extensions                               cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics     cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics     cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics     cl_khr_int64_extended_atomics cl_nv_device_attribute_query cl_khr_spir cl_khr_fp16 cl_khr_fp64
  Device Extensions with Version                  cl_khr_byte_addressable_store                                    0x400000 (1.0.0)
                                                  cl_khr_global_int32_base_atomics                                 0x400000 (1.0.0)
                                                  cl_khr_global_int32_extended_atomics                             0x400000 (1.0.0)
                                                  cl_khr_local_int32_base_atomics                                  0x400000 (1.0.0)
                                                  cl_khr_local_int32_extended_atomics                              0x400000 (1.0.0)
                                                  cl_khr_int64_base_atomics                                        0x400000 (1.0.0)
                                                  cl_khr_int64_extended_atomics                                    0x400000 (1.0.0)
                                                  cl_nv_device_attribute_query                                     0x400000 (1.0.0)
                                                  cl_khr_spir                                                      0x801000 (2.1.0)
                                                  cl_khr_fp16                                                      0x400000 (1.0.0)
                                                  cl_khr_fp64                                                      0x400000 (1.0.0)

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   No platform
  clCreateContext(NULL, ...) [default]            No platform
  clCreateContext(NULL, ...) [other]              Success [POCL]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 Portable Computing Language
    Device Name                                   NVIDIA GeForce RTX 4090
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 Portable Computing Language
    Device Name                                   NVIDIA GeForce RTX 4090
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 Portable Computing Language
    Device Name                                   NVIDIA GeForce RTX 4090

benchmark:

.-----------------------------------------------------------------------------.
|                       ______________   ______________                       |
|                       \   ________  | |  ________   /                       |
|                        \  \       | | | |       /  /                        |
|                         \  \      | | | |      /  /                         |
|                          \  \     | | | |     /  /                          |
|                           \  \_.-"  | |  "-._/  /                           |
|                            \    _.-" _ "-._    /                            |
|                             \.-" _.-" "-._ "-./                             |
|                               .-"  .-"-.  "-.                               |
|                               \  v"     "v  /                               |
|                                \  \     /  /                                |
|                                 \  \   /  /                                 |
|                                  \  \ /  /                                  |
|                                   \  '  /                                   |
|                                    \   /                                    |
|                                     \ /               FluidX3D Version 2.13 |
|                                      '     Copyright (c) Dr. Moritz Lehmann |
|-----------------------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID    0 | NVIDIA GeForce RTX 4090                                    |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA GeForce RTX 4090                                    |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 5.0 (Linux)                                                |
| OpenCL Version | OpenCL C 1.2 PoCL                                          |
| Compute Units  | 128 at 2595 MHz (16384 cores, 85.033 TFLOPs/s)             |
| Memory, Cache  | 24563 MB, 0 KB global / 48 KB local                        |
| Buffer Limits  | 6140 MB global, 64 KB constant                             |
|----------------'------------------------------------------------------------|
| Warning: error: unknown target CPU 'sm_89' Device NVIDIA GeForce RTX 4090   |
|          failed to build the program                                        |
| Error: OpenCL C code compilation failed with error code -11. Make sure      |
|        there are no errors in kernel.cpp.                                   |
'-----------------------------------------------------------------------------'
Bossach commented 9 months ago

@Tongzhao9417 Your LLVM doesn't know how to compile for your GPU You can check supported ones by $ clang --target=nvptx -print-supported-cpus where --target=nvptx(nvptx64) stands for "nvidia architecture" and supported cpus are specific GPUs Output:

Debian clang version 14.0.6
Target: nvptx
Thread model: posix
InstalledDir: /usr/bin
Available CPUs for this target:

        sm_20
        sm_21
        sm_30
        sm_32
        sm_35
        sm_37
        sm_50
        sm_52
        sm_53
        sm_60
        sm_61
        sm_62
        sm_70
        sm_72
        sm_75
        sm_80
        sm_86

Use -mcpu or -mtune to specify the target's processor.
For example, clang --target=aarch64-unknown-linux-gui -mcpu=cortex-a35

You need newer version of LLVM/clang. (Just checked llvm-16 from debian repo have "sm_89" one) So $ sudo apt install llvm-16 clang-16 should fix your problem. Or most actual ones avalible on llvm.org repo And you have to clean rebuild PoCL with option -DWITH_LLVM_CONFIG=/usr/bin/llvm-config-16 (or your actual llvm-config path) in order to bond PoCL with correct LLVM version.

Tongzhao9417 commented 8 months ago

@Tongzhao9417 Your LLVM doesn't know how to compile for your GPU You can check supported ones by $ clang --target=nvptx -print-supported-cpus where --target=nvptx(nvptx64) stands for "nvidia architecture" and supported cpus are specific GPUs Output:

Debian clang version 14.0.6
Target: nvptx
Thread model: posix
InstalledDir: /usr/bin
Available CPUs for this target:

        sm_20
        sm_21
        sm_30
        sm_32
        sm_35
        sm_37
        sm_50
        sm_52
        sm_53
        sm_60
        sm_61
        sm_62
        sm_70
        sm_72
        sm_75
        sm_80
        sm_86

Use -mcpu or -mtune to specify the target's processor.
For example, clang --target=aarch64-unknown-linux-gui -mcpu=cortex-a35

You need newer version of LLVM/clang. (Just checked llvm-16 from debian repo have "sm_89" one) So $ sudo apt install llvm-16 clang-16 should fix your problem. Or most actual ones avalible on llvm.org repo And you have to clean rebuild PoCL with option -DWITH_LLVM_CONFIG=/usr/bin/llvm-config-16 (or your actual llvm-config path) in order to bond PoCL with correct LLVM version.

Sorry for late reply. I follow your step and it's worked for me.

Cheers!

olympichek commented 6 months ago

I compiled POCL as decribed above and now clinfo works. But when I try to run an OpenCL application I am getting an error:

 Build option -cl-std specified OpenCL C version 2.0,but device NVIDIA GeForce GTX 1080 Ti doesn't support that OpenCL C version

Does POCL not support OpenCL 2.0 ?

monkeyden commented 5 months ago

Absolute king. pocl-opencl-icd was the missing link for me. Ty, sir.

CLRafaelR commented 3 months ago

@Bossach

I really appreciate for your brilliant solution!

I want to ask one question to you and everyone who reacted to Bossach's comment and/or tried the solution (@husmen @joaomamede @Tongzhao9417 @olympichek @htao7 @kirse @kon332k): have you tried the PoCL verification tests for NIVIDIA GPU ../tools/scripts/run_cuda_tests as documented in NVIDIA GPU support — Portable Computing Language (PoCL) 6.0 documentation and have all of the test successfully passed?

I basically followed Bossach's steps to install PoCL and now have clinfo and clinfo -l functioning like a charm. However, I found four tests failed when I ran the PoCL verification test as shown below:

cd ~/pocl-6.0/build # move to my `build` directory
../tools/scripts/run_cuda_tests

# For rerunning the failed tests:
../tools/scripts/run_cuda_tests --rerun-failed --output-on-failure

Failed tests were:

The following tests FAILED:
          4 - kernel/test_as_type_loopvec (Failed)
        166 - regression/clSetKernelArg_overwriting_the_previous_kernel's_args_loopvec (Failed)
        208 - runtime/test_device_address (SEGFAULT)
        209 - runtime/test_svm (SEGFAULT)
Errors while running CTest

If anybody has conducted the verification test, could you please tell us whether you pass all tests or which tests you miss? It would be also very helpful if you could tell us about the runtime environment and settings, and configurations for PoCL installation.

I opend an issue on PoCL's repo ../tools/scripts/run_cuda_tests Fails on WSL2 · Issue #1533 · pocl/pocl. Comments on there are also appreciated, and such comments would be helpful for the developers of PoCL to know success/failure of the tests on WSL2 is reproducible and to enhance the PoCL.

Shazway commented 1 month ago

Hi, I saw the POCL solution and jumped on the occasion to try fixing this issue but it didn't work for me. After the step with cmake --build -j16 which worked fine, for the export of the variable OCL_ICD_VENDORS, there is no ocl-vendors folder Result of ls in build: CMakeCache.txt CTestCustom.cmake cl_offline_compiler.sh config.h kernellib_hash.h pocl_opencl.h CMakeFiles CTestTestfile.cmake cmake_install.cmake config2.h lib pocl_version.h CPackConfig.cmake Makefile compile_commands.json examples pocl.pc poclu CPackSourceConfig.cmake bin compile_test_. include pocl_build_timestamp.h tests

Result of clinfo: Number of platforms 2 And it is too long to paste here but it sees two intel graphics platforms instead of one intel and one nvidia

Result of nvidia-smi ` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 565.51.01 Driver Version: 565.90 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3050 ... On | 00000000:01:00.0 Off | N/A | | N/A 73C P0 52W / 75W | 1453MiB / 4096MiB | 59% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ ` Any clues why ?