preda / gpuowl

GPU Mersenne primality test.
GNU General Public License v3.0
127 stars 35 forks source link

Support rusticl runtime #276

Open chocolate42 opened 7 months ago

chocolate42 commented 7 months ago

I'm trying to run gpuowl on rusticl instead of rocm, it fails with this:

[f40@p95 gpuowl]$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 ./build-debug/gpuowl -prp 77936867
20240218 14:10:06  GpuOwl VERSION v7.5-2-gba227ce
20240218 14:10:06  GpuOwl VERSION v7.5-2-gba227ce
20240218 14:10:06  config: -prp 77936867 
20240218 14:10:06  device 0, unique id ''
20240218 14:10:06 77936867 FFT: 4M 1K:8:256 (18.58 bpw)
20240218 14:10:06 77936867 OpenCL args "-DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=2u -DWEIGHT_STEP=0.33644726404543274 -DIWEIGHT_STEP=-0.25174750481886216 -DIWEIGHTS={0,-0.44011820345520131,-0.37306474779553728,-0.29798072935699788,-0.21390437908665341,-0.11975874301407295,-0.014337887291734644,-0.44814572555075455,} -DFWEIGHTS={0,0.78609128957452257,0.5950610473469905,0.42446232150303748,0.2721098723818392,0.1360521812214803,0.014546452690911484,0.81207258201996746,}  -cl-std=CL2.0 -cl-finite-math-only "
20240218 14:10:06 77936867 ASM compilation failed, retrying compilation using NO_ASM
20240218 14:10:06 77936867 OpenCL compilation error -11 (args -DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=2u -DWEIGHT_STEP=0.33644726404543274 -DIWEIGHT_STEP=-0.25174750481886216 -DIWEIGHTS={0,-0.44011820345520131,-0.37306474779553728,-0.29798072935699788,-0.21390437908665341,-0.11975874301407295,-0.014337887291734644,-0.44814572555075455,} -DFWEIGHTS={0,0.78609128957452257,0.5950610473469905,0.42446232150303748,0.2721098723818392,0.1360521812214803,0.014546452690911484,0.81207258201996746,}  -cl-std=CL2.0 -cl-finite-math-only  -DNO_ASM=1)
20240218 14:10:06 77936867 input.cl:44:26: warning: unsupported OpenCL extension 'cl_khr_int64_base_atomics' - ignoring [-Wignored-pragmas]
input.cl:1494:29: error: call to 'atom_add' is ambiguous
Error executing LLVM compilation action.

20240218 14:10:06  Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at src/clwrap.cpp:245 build
20240218 14:10:06  Bye

gpuowl.cl says this:

// 64-bit atomics used in kernel sum64
// If 64-bit atomics aren't available, sum64() can be implemented with 32-bit
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable

and the sum64 function looks simple enough:

KERNEL(256) sum64(global ulong* out, u32 sizeBytes, global ulong* in) {
  if (get_global_id(0) == 0) { out[0] = 0; }

  ulong sum = 0;
  for (i32 p = get_global_id(0); p < sizeBytes / sizeof(u64); p += get_global_size(0)) {
    sum += in[p];
  }
  sum = work_group_reduce_add(sum);
  if (get_local_id(0) == 0) { atom_add(&out[0], sum); }
}

If implementing sum64 with 32 bit atomics and sorting the atom_add ambiguity (which appears to be related) is all it takes to get gpuowl working on rusticl then cool. Rusticl is cross-platform and built into mesa (should be in OOTB for Ubuntu 24.04) and should be the way forwards for opencl on Linux (also mfakto is quicker under rusticl, for my hardware). I have no idea if atomic 64 bit int will ever be supported with rusticl, or if there are other hidden or vendor issues to be uncovered (can only test with RDNA3 iGPU 780M).

chocolate42 commented 6 months ago

I'm having a play trying to get this to work, but as I don't know OpenCL or why sum64 is used this is a crapshoot. My initial read was that the extension was needed for 64 bit ints to work at all but it appears to just be needed for atom_add. And atom_add is just used to return the sum value to the host, seemingly atomic in this context ensures host and device are in sync. So I tried replacing with the 32 bit version which should be present always?:

KERNEL(256) sum64(global ulong* out, u32 sizeBytes, global ulong* in) {
  if (get_global_id(0) == 0) { out[0] = 0; }

  ulong sum = 0;
  for (i32 p = get_global_id(0); p < sizeBytes / sizeof(u64); p += get_global_size(0)) {
    sum += in[p];
  }
  sum = work_group_reduce_add(sum);
  if (get_local_id(0) == 0) {
    uint hi = (sum>>32)&0xFFFFFFFF;
    uint lo = sum&0xFFFFFFFF;
    global uint* recast = (global uint*) (&out[0]);
    atomic_add(recast, lo);
    atomic_add(recast+1, hi);
  }
}

Which ran, directly into a rusticl panic. Probably this is some deep syntax crime anyway, at least splitting into multiple atomic ops seems bad form:

[f40@p95 gpuowl]$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 RUST_BACKTRACE=full ./build-debug/gpuowl -prp 77936867
20240219 11:33:26  GpuOwl VERSION ba227ce-dirty
20240219 11:33:26  GpuOwl VERSION ba227ce-dirty
20240219 11:33:26  config: -prp 77936867 
20240219 11:33:26  device 0, unique id ''
20240219 11:33:26 77936867 FFT: 4M 1K:8:256 (18.58 bpw)
20240219 11:33:26 77936867 OpenCL args "-DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=2u -DWEIGHT_STEP=0.33644726404543274 -DIWEIGHT_STEP=-0.25174750481886216 -DIWEIGHTS={0,-0.44011820345520131,-0.37306474779553728,-0.29798072935699788,-0.21390437908665341,-0.11975874301407295,-0.014337887291734644,-0.44814572555075455,} -DFWEIGHTS={0,0.78609128957452257,0.5950610473469905,0.42446232150303748,0.2721098723818392,0.1360521812214803,0.014546452690911484,0.81207258201996746,}  -cl-std=CL2.0 -cl-finite-math-only "
20240219 11:33:26 77936867 ASM compilation failed, retrying compilation using NO_ASM
thread '<unnamed>' panicked at ../mesa-23.3.5/src/gallium/frontends/rusticl/core/program.rs:260:13:
called `Option::unwrap()` on a `None` value
stack backtrace:
   0:     0x7f54dc3cf69c - <unknown>
   1:     0x7f54dc3f0230 - <unknown>
   2:     0x7f54dc3cd0dd - <unknown>
   3:     0x7f54dc3cf485 - <unknown>
   4:     0x7f54dc3d0a93 - <unknown>
   5:     0x7f54dc3d07ca - <unknown>
   6:     0x7f54dc3d0fc5 - <unknown>
   7:     0x7f54dc3d0e79 - <unknown>
   8:     0x7f54dc3cfb56 - <unknown>
   9:     0x7f54dc3d0c32 - <unknown>
  10:     0x7f54db46beb5 - <unknown>
  11:     0x7f54db46bf53 - <unknown>
  12:     0x7f54dc2d2ab0 - <unknown>
  13:     0x7f54dc35590b - <unknown>
  14:     0x7f54dc2f0e88 - <unknown>
  15:     0x7f54dc354cfe - <unknown>
  16:     0x7f54dc358f97 - <unknown>
  17:     0x7f54dc30d8f3 - <unknown>
  18:     0x7f54dc30d326 - <unknown>
  19:     0x7f54dd4d1bf3 - clBuildProgram
  20:     0x557c74db6b34 - build
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/clwrap.cpp:227:25
  21:     0x557c74db7e38 - _Z7compileP11_cl_contextP13_cl_device_idRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_RKSt6vectorIS8_SaIS8_EE
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/clwrap.cpp:275:10
  22:     0x557c74dac717 - compile
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Gpu.cpp:317:22
  23:     0x557c74dacb8d - _ZN3GpuC2ERK4ArgsjjjjjjP13_cl_device_idbbO7Weights
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Gpu.cpp:350:3
  24:     0x557c74daedf0 - _ZN3GpuC2ERK4ArgsjjjjjjP13_cl_device_idbb
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Gpu.cpp:330:107
  25:     0x557c74db5ada - _ZSt11make_uniqueI3GpuJRK4ArgsRjS4_jS4_S4_S4_P13_cl_device_idRbS7_EENSt8__detail9_MakeUniqIT_E15__single_objectEDpOT0_
                               at /usr/include/c++/13.2.1/bits/unique_ptr.h:1070:30
  26:     0x557c74daf113 - _ZN3Gpu4makeEjRK4Args
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Gpu.cpp:595:76
  27:     0x557c74dbaaec - _ZN4Task7executeERK4Args
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Task.cpp:174:38
  28:     0x557c74da3ce0 - main
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/main.cpp:56:51
  29:     0x7f54dcf76cd0 - <unknown>
  30:     0x7f54dcf76d8a - __libc_start_main
  31:     0x557c74d985a5 - _start
  32:                0x0 - <unknown>
20240219 11:33:26  Unexpected exception
fatal runtime error: Rust panics must be rethrown
Aborted (core dumped)

But why do we need atomic at all right? if (get_local_id(0) == 0) { out[0] = sum; }. Well it still panics. So how about just ignoring the result of sum64. It's only used in vector<u32> Gpu::readAndCompress(ConstBuffer<int>& buf) from Gpu.cpp, and it's immediately checked against the CPU doing the sum. So sod it by removing the sum64 call and faking the result in readAndCompress() with:

    expectedSum = sum;
    if (sum != expectedSum || (allZero && nRetry == 0)) {

And just to be safe delete the return value in sum64 in case syntax messes anything up. That should completely bypass sum64 execution, unless it's being called somehow from other kernels not by a name that I can grep. However still a panic:

[f40@p95 gpuowl]$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 RUST_BACKTRACE=full ./build-debug/gpuowl -prp 77936867 -use NO_ASM
20240219 11:44:35  GpuOwl VERSION ba227ce-dirty
20240219 11:44:35  GpuOwl VERSION ba227ce-dirty
20240219 11:44:35  config: -prp 77936867 -use NO_ASM 
20240219 11:44:35  device 0, unique id ''
20240219 11:44:35 77936867 FFT: 4M 1K:8:256 (18.58 bpw)
20240219 11:44:35 77936867 OpenCL args "-DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=2u -DWEIGHT_STEP=0.33644726404543274 -DIWEIGHT_STEP=-0.25174750481886216 -DIWEIGHTS={0,-0.44011820345520131,-0.37306474779553728,-0.29798072935699788,-0.21390437908665341,-0.11975874301407295,-0.014337887291734644,-0.44814572555075455,} -DFWEIGHTS={0,0.78609128957452257,0.5950610473469905,0.42446232150303748,0.2721098723818392,0.1360521812214803,0.014546452690911484,0.81207258201996746,} -DNO_ASM=1  -cl-std=CL2.0 -cl-finite-math-only "
thread '<unnamed>' panicked at ../mesa-23.3.5/src/gallium/frontends/rusticl/core/program.rs:260:13:
called `Option::unwrap()` on a `None` value
stack backtrace:
   0:     0x7fc80997a69c - <unknown>
   1:     0x7fc80999b230 - <unknown>
   2:     0x7fc8099780dd - <unknown>
   3:     0x7fc80997a485 - <unknown>
   4:     0x7fc80997ba93 - <unknown>
   5:     0x7fc80997b7ca - <unknown>
   6:     0x7fc80997bfc5 - <unknown>
   7:     0x7fc80997be79 - <unknown>
   8:     0x7fc80997ab56 - <unknown>
   9:     0x7fc80997bc32 - <unknown>
  10:     0x7fc808a16eb5 - <unknown>
  11:     0x7fc808a16f53 - <unknown>
  12:     0x7fc80987dab0 - <unknown>
  13:     0x7fc80990090b - <unknown>
  14:     0x7fc80989be88 - <unknown>
  15:     0x7fc8098ffcfe - <unknown>
  16:     0x7fc809903f97 - <unknown>
  17:     0x7fc8098b88f3 - <unknown>
  18:     0x7fc8098b8326 - <unknown>
  19:     0x7fc80aa7cbf3 - clBuildProgram
  20:     0x55972f405a5f - build
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/clwrap.cpp:222:27
  21:     0x55972f406e2a - _Z7compileP11_cl_contextP13_cl_device_idRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_RKSt6vectorIS8_SaIS8_EE
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/clwrap.cpp:275:10
  22:     0x55972f3fb709 - compile
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Gpu.cpp:317:22
  23:     0x55972f3fbb7f - _ZN3GpuC2ERK4ArgsjjjjjjP13_cl_device_idbbO7Weights
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Gpu.cpp:350:3
  24:     0x55972f3fdde2 - _ZN3GpuC2ERK4ArgsjjjjjjP13_cl_device_idbb
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Gpu.cpp:330:107
  25:     0x55972f404acc - _ZSt11make_uniqueI3GpuJRK4ArgsRjS4_jS4_S4_S4_P13_cl_device_idRbS7_EENSt8__detail9_MakeUniqIT_E15__single_objectEDpOT0_
                               at /usr/include/c++/13.2.1/bits/unique_ptr.h:1070:30
  26:     0x55972f3fe105 - _ZN3Gpu4makeEjRK4Args
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Gpu.cpp:595:76
  27:     0x55972f409ade - _ZN4Task7executeERK4Args
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/Task.cpp:174:38
  28:     0x55972f3f2ce0 - main
                               at /home/f40/distrobox/rusticl-test/gpuowl/src/main.cpp:56:51
  29:     0x7fc80a521cd0 - <unknown>
  30:     0x7fc80a521d8a - __libc_start_main
  31:     0x55972f3e75a5 - _start
  32:                0x0 - <unknown>
20240219 11:44:35  Unexpected exception
fatal runtime error: Rust panics must be rethrown
Aborted (core dumped)

That leads me to think that the runtime panics are entirely unrelated to sum64 hackery? They've only been exposed now because the hackery let the kernel compile.

Sorry for being verbose, seemed sensible to let you know how I'm stabbing in the dark as there's a good chance I'm misunderstanding how sum64 is used and crimes are being committed.

preda commented 6 months ago

Thanks, attempting to compile with rusticl is an useful exercise. And it's a good approach to simplify/remove the initial trouble bits (sum64) like you did just to get it to compile.

It seems what we hit now may be a rusticl bug. We don't have the rusticl stack-trace symbols, but there is a line# and we know it tries to unwrap a None. ./mesa-23.3.5/src/gallium/frontends/rusticl/core/program.rs:260:13: called Option::unwrap() on a None value

chocolate42 commented 6 months ago

Probably there is a bug in the fp64 implementation, it's still experimental after all. I'll look into it ability permitting and add an issue to rusticl's tracker if appropriate. Probably need to learn how to compile and use mesa first, if they reply I should be in a position to respond and test.

Actually the more I look at it the more I think it was premature to test fp64. It's available behind a flag but it seems that's for implementers to be able to test it as they implement, it's described as "in-progress" on their tracker: https://mesamatrix.net/

Have been peeking at rusticl's code to see how hard it would be to implement the extension, but I'm struggling to even find atomic_add right now. I'm thinking that Core CL functionality might not be done via rust/rusticl but directly in spirv or something (unclear). All I can find in the rust codebase is exposing the 32 bit atomic support not an implementation.

int64 atomics is one of the few extensions that clover supports that rusticl doesn't. Extensions have been implemented on rusticl on a priority basis, maybe very few programs make use of it and it's just low priority. But maybe we'll get lucky and it'll have higher priority just because they want to fully succeed clover.

preda commented 6 months ago

Which rusticl version are you using? if under Ubuntu, could be obtained as the version of the package mesa-opencl-icl. It might also be obtained with "gpuowl -h" or clinfo as the driver version.

I'm using rusticl 23.2.1-1ubuntu3.1~22.04.2 and it does not produce the nice error messages you're seeing :) (it fails, but I can't see where)

chocolate42 commented 6 months ago

Mesa 23.3.5 the latest arch packages. But I'm not even on arch host this is through a temp arch environment with distrobox which you could do too. It's as simple as distrobox create --name whatever --image archlinux, distrobox enter whatever, then you're in a shell of a base arch environment that you update install rusticl etc as normal. Much less painful than installing all these different often conflicting toolchains direct to host or in a vm.

chocolate42 commented 6 months ago

This is with the latest sources:

📦[f40@f39 f39]$ RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C ../mesabuild ../f39/gpuowl/build-debug/gpuowl -prp 77936867
20240222 12:49:34  GpuOwl VERSION v7.5-6-gd522826
20240222 12:49:34  GpuOwl VERSION v7.5-6-gd522826
20240222 12:49:34  config: -prp 77936867 
20240222 12:49:34  device 0, unique id '', driver '24.1.0-devel'
20240222 12:49:34 77936867 FFT: 4M 1K:8:256 (18.58 bpw)
20240222 12:49:34 77936867 OpenCL args "-DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=2u -DWEIGHT_STEP=0.33644726404543274 -DIWEIGHT_STEP=-0.25174750481886216 -DIWEIGHTS={0,-0.44011820345520131,-0.37306474779553728,-0.29798072935699788,-0.21390437908665341,-0.11975874301407295,-0.014337887291734644,-0.44814572555075455,} -DFWEIGHTS={0,0.78609128957452257,0.5950610473469905,0.42446232150303748,0.2721098723818392,0.1360521812214803,0.014546452690911484,0.81207258201996746,}  -cl-std=CL2.0 -cl-finite-math-only "
20240222 12:49:34 77936867 ASM compilation failed, retrying compilation using NO_ASM
SPIR-V parsing FAILED:
    In file ../mesa-main-2024-02-22/src/compiler/spirv/vtn_variables.c:2341
    Initializer for CrossWorkgroup variable 3 not yet supported in Mesa.
    164196 bytes into the SPIR-V binary

which corresponds to this in vtn_variables.c:

      case SpvStorageClassCrossWorkgroup:
         vtn_assert(b->options->environment == NIR_SPIRV_OPENCL);
         vtn_fail("Initializer for CrossWorkgroup variable %u "
                  "not yet supported in Mesa.",
                  vtn_id_for_value(b, val));
         break;

Possibly I didn't use correct compiler settings to build mesa, possibly this is just where rusticl is at.

Here's the gist of how I compiled mesa

# mesa source in mesa-main-2024-02-22
mkdir mesabuild
cd mesabuild
# specific to amd
meson setup ../mesa-main-2024-02-22 --libdir lib64 -Dgallium-rusticl=true -Dllvm=enabled -Drust_std=2021 -Dgallium-drivers=radeonsi,swrast
meson compile -C ./
# mesa built to mesabuild

# to use mesabuild
RUSTICL_ENABLE=radeonsi RUSTICL_FEATURES=fp64 meson devenv -C wherever/mesabuild relative/to/mesabuild/dir/gpuowl/build-debug/gpuowl -prp 77936867

To do this yourself you'll need recent toolchains, at least meson 1.3.1 so either compile youself or take the easy route and build in a Fedora 39 environment which has a recent enough meson. If in a F39 environment you can also get most of the build dependencies easily by installing the builddep plugin to dnf sudo dnf install "dnf-command(builddep)" followed by sudo dnf builddep mesa which installs all dependencies that were needed to build the mesa in the fedora repo.

chocolate42 commented 6 months ago

If I'm interpreting things correctly, rusticl doesn't yet support the opencl 2.0 extension that adds things like work_group_reduce_add()

https://registry.khronos.org/OpenCL/sdk/3.0/docs/man/html/workGroupFunctions.html

Corresponding to this feature on the tracker:

https://mesamatrix.net/#RusticlOpenCL2.0_Extension__Workgroup_Collective_Functions

preda commented 4 months ago

This is in standby from my POV, maybe in 6 months we'll have another go.

chocolate42 commented 1 month ago

Tried this again with latest mesa. Now meson 1.4.0+ is required to build mesa, which I got by upgrading to Fedora 40 (alternatively build meson from source). Also needed to add PyYAML (pip install PyYAML) and cbindgen (dnf install cbindgen) to the environment I used before to get mesa to successfully compile.

20240721 11:50:22  GpuOwl VERSION v7.5-8-gb400b88
20240721 11:50:22  GpuOwl VERSION v7.5-8-gb400b88
20240721 11:50:22  config: -prp 77936867 
20240721 11:50:23  device 0, unique id '', driver '24.3.0-devel'
20240721 11:50:23 77936867 FFT: 4M 1K:8:256 (18.58 bpw)
20240721 11:50:23 77936867 OpenCL args "-DEXP=77936867u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=8u -DAMDGPU=1 -DMM_CHAIN=1u -DMM2_CHAIN=2u -DWEIGHT_STEP=0.33644726404543274 -DIWEIGHT_STEP=-0.25174750481886216 -DIWEIGHTS={0,-0.44011820345520131,-0.37306474779553728,-0.29798072935699788,-0.21390437908665341,-0.11975874301407295,-0.014337887291734644,-0.44814572555075455,} -DFWEIGHTS={0,0.78609128957452257,0.5950610473469905,0.42446232150303748,0.2721098723818392,0.1360521812214803,0.014546452690911484,0.81207258201996746,}  -cl-std=CL2.0 -cl-finite-math-only "
20240721 11:50:23 77936867 ASM compilation failed, retrying compilation using NO_ASM
SPIR-V parsing FAILED:
    In file ../mesa-main-2024-07-21/mesa-main/src/compiler/spirv/vtn_variables.c:2354
    Initializer for CrossWorkgroup variable 3 not yet supported in Mesa.
    164196 bytes into the SPIR-V binary

Same blocker as before. Will try again in a few months.