oneapi-src / Velocity-Bench

Other
42 stars 14 forks source link

out-of-bounds access in oneAPI reverse_time_migration #43

Closed frasercrmck closed 8 months ago

frasercrmck commented 10 months ago

I've been running the reverse_time_migration with the oneAPI GPU workload/algorithm and have seen an out-of-bounds access in the SecondOrderComputationKernel, specifically this access into current:

  front[HALF_LENGTH_] = current[idx + v_end];

I'm running:

LD_LIBRARY_PATH=~/work/intel-llvm/build/install/lib ONEAPI_DEVICE_SELECTOR=opencl:cpu ./build/Engine -p **workloads/bp_model/computation_parameters.json

I've tested with three OpenCL devices:

Used parameters :
        order of stencil used : 16
        boundary length used : 20
        source frequency : 20
        dt relaxation coefficient : 0.9
        block factor in x-direction : 64
        block factor in z-direction : 8
        block factor in y-direction : 1
        Using GPU Device - Slice z + STATIC x Hybrid

Device Info:
        Using SYCL device         : Intel(R) FPGA Emulation Device (Driver version 2023.16.11.0.22_160000)
        Platform                  : Intel(R) FPGA Emulation Platform for OpenCL(TM)
        Vendor                    : Intel(R) Corporation
        Max compute units         : 24
        Using OpenCL library      : /opt/intel/oneapi/redist/lib/libOpenCL.so.1.2
        Using OneAPI SYCL library : /home/fraser/work/intel-llvm/build/install/lib/libsycl.so.7.2.0-8
        Using Level Zero library  : WARNING! Level zero library not found! L0 backend may not be available!
        Window mode : enabled
                Left window : 1300
                Right window : 600
                NO WINDOW IN Y-axis
                NO WINDOW IN Z-axis

 Running on Intel(R) FPGA Emulation Device
 The Device Max Work Group Size is : 67108864
 The Device Max EUCount is : 24

and

Device Info:
        Using SYCL device         : 12th Gen Intel(R) Core(TM) i9-12900K (Driver version 2023.16.11.0.22_160000)
        Platform                  : Intel(R) OpenCL
        Vendor                    : Intel(R) Corporation
        Max compute units         : 24
        Using OpenCL library      : /opt/intel/oneapi/redist/lib/libOpenCL.so.1.2
        Using OneAPI SYCL library : /home/fraser/work/intel-llvm/build/install/lib/libsycl.so.7.2.0-8
        Using Level Zero library  : WARNING! Level zero library not found! L0 backend may not be available!
        Window mode : enabled
                Left window : 1300
                Right window : 600
                NO WINDOW IN Y-axis
                NO WINDOW IN Z-axis

 Running on 12th Gen Intel(R) Core(TM) i9-12900K
 The Device Max Work Group Size is : 8192
 The Device Max EUCount is : 24

And also with our own OpenCL implementation.

All three implementations show the same results, and the same values for all of the relevant variables.

The benchmark itself prints:

GridBox Report
==============================
Actual Grid Size:
- nx    : 3253
- ny    : 1
- nz    : 622
- nt    : 0

Logical Grid Size:
- nx    : 3253
- ny    : 1
- nz    : 622

Actual Window Size:
- wnx   : 1232
- wny   : 1
- wnz   : 624

Logical Window Size:
- wnx   : 1182
- wny   : 1
- wnz   : 622

Computation Grid Size:
- x elements    : 1216
- y elements    : 1
- z elements    : 608

Cell Dimensions:
- dx    : 21.0938
- dy    : 1
- dz    : 21.0938
- dt    : 0.0020567

Wave Fields:
- Count : 3
- Names :
        1. Wave Pressure Curr Z
        2. Wave Pressure Prev Z
        3. Wave Pressure Prev Z

Parameters:
- Count : 1
- Names :
        1. Parameter Velocity Z

To condense the problematic access somewhat, in the kernel we have:

  cgh.parallel_for(workgroup_range, [=](nd_item<2> it) {
    int idx = it.get_global_id(1) + hl +
            (it.get_global_id(0) * idx_range + hl) * nx;

    for (int i = 0; i < idx_range; i++) {
      idx += nx;
      front[HALF_LENGTH_] = current[idx + v_end];
    }
  });

All of the other variables involved in the computation (v_end, nx, idx_range, hl) are constants.

For all of of the OpenCL implementations I've tried, I have added debug printing - they show the same values:

idx_range: 8
v_end: 9856
nx: 1232
hl: 8
workgroup size: 76 * 1216 = 92416

And the size of the current buffer is 3075072 bytes, so 768768 floats.

In practice there are many work-items which are out of bounds, but if we take the last work-item in the work-group, we can see that:

base idx:
  idx = 1215 + 8 + (75 * 8 + 8) * 1232 => 750279
base access index (without any pre-increments to idx):
  idx + v_end = 750279 + 9856 => 760135
base access index (last iteration of the loop):
  idx + v_end + (idx_range * nx) = 760135 + (8 * 1232) => 769991

This is larger than the size of current and so is out of bounds.

I don't know how this algorithm is supposed to work, so forgive my speculating, but is the increment idx += nx before the access correct? If it was incremented after the access into current, there would be no out-of-bounds access.

If I've missed any relevant details, please let me know. I'm happy to contribute a PR but I think someone better-versed in the algorithm might need to help steer me in the right direction. Thanks!

mgrabban commented 10 months ago

Hello @frasercrmck thanks for filing this issue. We will take a look and get back.

jgtong commented 9 months ago

Greetings @frasercrmck

Looking at the issue, you are running the GPU version of the algorithm on the CPU, which is not tested. Can you try to run the cpu version of the algorithm? You can do this by passing in computational_parameters_cpu.json as the input configuration

frasercrmck commented 9 months ago

Hi @jgtong, thanks for getting in touch.

I was a bit confused about the CPU algorithm initially, because there are still comments in the source code (including the line below the one you linked) suggesting the CPU version isn't supported.

Unfortunately I'm unable to run the CPU version of the algorithm, because it crashes somewhere inside DPC++.

Adding padding to ensure alignment of each row
Engine: /home/fraser/work/intel-llvm/sycl/source/detail/scheduler/commands.cpp:2184: void sycl::_V1::detail::adjustNDRangePerKernel(sycl::_V1::detail::NDRDescT&, sycl::_V1::detail::pi::PiKernel, const sycl::_V1::detail::device_impl&): Assertion `NDR.LocalSize[0] == 0' failed.

I hadn't properly investigated this - I just saw that it crashed and moved onto the GPU version of the algorithm - but it seems like it's trying to enqueue a kernel with a global size of zero and a local size of non-zero. DPC++ asserts on this here.

If you look here, the global_nd_range is the one causing problems.

These are the global/local values I'm seeing locally:

compute_nx = 0
compute_nz = 608
mpParameters->GetBlockX() = 1
mpParameters->GetBlockZ() = 8

I don't know if this is a bug in DPC++ or a bug in reverse_time_migration, sorry.

jgtong commented 9 months ago

Greetings @frasercrmck

Thanks for your response. May I can close this issue? If you are experiencing an internal compiler failure, then I think an issue should be filed in their github

Naghasan commented 9 months ago

Hi @jgtong

@frasercrmck and I are wondering if compute_nx = 0 is actually expected. SYCL spec wise this means you are skipping the kernel invocation.

Is that the intention ?

jgtong commented 9 months ago

@Naghasan and @frasercrmck ,

Thanks for letting me know of the issue. The variables compute_nx and compute_nz should not be zero. The compute settings in computational_parameters_cpu.json is incorrect.

In order to get the RTM-CPU algorithm to run, please modify the computation_parameters.json , then change the algorithm and device values to cpu. I successfully ran this on the system using the following compiler:

clang version 18.0.0git (https://github.com/intel/llvm.git 89327e0a96e9b480e807211be79749741eb78fc1)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/jaytong/sycl_workspace/llvm/build/bin

To run, launch the command:

ONEAPI_DEVICE_SELECTOR=opencl:cpu ./bin/Engine -p workloads/bp_model/computation_parameters.json

Please note that the compute parameters are not optimized for the RTM-CPU algorithm.

I hope this helps and I am going to update the ReadMe file accordingly

frasercrmck commented 9 months ago

Thanks @jgtong. That does indeed fix the "zero problem".

However, now I'm seeing it run the same kernel with the following ND range, which throws an exception:

compute_nx = 3237 compute_nz = 606
mpParameters->GetBlockX() = 64 mpParameters->GetBlockZ() = 8

Exception caught: Global work size {606, 3237, 1} is not evenly divisible by local work-group size {8, 64, 1}. Non-uniform work-groups are not allowed by default. Underlying OpenCL 2.x implementation supports this feature and to enable it, build device program with -cl-std=CL2.0 -54 (PI_ERROR_INVALID_WORK_GROUP_SIZE)
jgtong commented 9 months ago

@frasercrmck ,

Your compute_{nx|nz} values are different compared to what I have for my run for the CPU algorithm. The values below what I got are divisible by the block sizes

compute_nx: 1216
compute_nz: 608
blockX: 64
blockZ: 8

Are you using the same configuration file as I suggested earlier (computation_parameters.json ) ?

Lastly, we currently do not support the CPU versions for all of our workloads in Velocity-Bench. We regularly test our workloads on the GPU. So the functionality on the CPU may not be correct.

frasercrmck commented 9 months ago

Hi @jgtong ,

Yes I do see those parameter values used twice, but then on the third time the kernel is called they change to the ones I mentioned earlier, at which point it crashes.

compute_nx = 1216 compute_nz = 608
mpParameters->GetBlockX() = 64 mpParameters->GetBlockZ() = 8
compute_nx = 1216 compute_nz = 608
mpParameters->GetBlockX() = 64 mpParameters->GetBlockZ() = 8
compute_nx = 3237 compute_nz = 606
mpParameters->GetBlockX() = 64 mpParameters->GetBlockZ() = 8

The first two times it enters into the if (enable_window) case here, and on the third and fatal time it chooses the other path:

        if (enable_window) {
          nx = apGridBox->GetActualWindowSize(X_AXIS);
          nz = apGridBox->GetActualWindowSize(Z_AXIS);
          compute_nx = apGridBox->GetComputationGridSize(X_AXIS);
          compute_nz = apGridBox->GetComputationGridSize(Z_AXIS);
        } else {
          nx = apGridBox->GetActualGridSize(X_AXIS);
          nz = apGridBox->GetActualGridSize(Z_AXIS);
          compute_nx = (nx - 2 * mpParameters->GetHalfLength());
          compute_nz = (nz - 2 * mpParameters->GetHalfLength());
        }

This is using our own OpenCL implementation (OCK) so perhaps our device's parameters are changing how the benchmark behaves.

Just to confirm, I am using a modified bp_model/computation_parameters_cpu.json as you suggested:

% diff -u workloads/bp_model/computation_parameters.json workloads/bp_model/computation_parameters_cpu.json
--- workloads/bp_model/computation_parameters.json      2023-12-07 18:11:19.992932795 +0000
+++ workloads/bp_model/computation_parameters_cpu.json  2024-01-24 08:47:20.937355558 +0000
@@ -5,8 +5,8 @@
     "source-frequency": 20,
     "isotropic-radius": 5,
     "dt-relax": 0.9,
-    "algorithm": "gpu",
-    "device": "gpu",
+    "algorithm": "cpu",
+    "device": "cpu",
     "cache-blocking": {
       "block-x": 64,
       "block-z": 8,
jgtong commented 9 months ago

@frasercrmck ,

Thanks for the diagnosis. I understand the issue and the runtime did not bail out when compute_{nx,nz} are 3237 and 606 respectively. I ran this on an Xeon 8480+ server.

One humble suggestion is to modify the WaveFieldsMemoryHandler.cpp file by setting different values of BlockX and BlockZ. For instance, whenever the execution goes to the else branch from the enable_window condition, you can set BlockX and BlockZ to 1. See below:

+    int iBlockX(0), iBlockZ(0);
     if (enable_window) {
       nx = apGridBox->GetActualWindowSize(X_AXIS);
       nz = apGridBox->GetActualWindowSize(Z_AXIS);
       compute_nx = apGridBox->GetComputationGridSize(X_AXIS);
       compute_nz = apGridBox->GetComputationGridSize(Z_AXIS);
+      iBlockX = mpParameters->GetBlockX();
+      iBlockZ = mpParameters->GetBlockZ();
     } else {
       nx = apGridBox->GetActualGridSize(X_AXIS);
       nz = apGridBox->GetActualGridSize(Z_AXIS);
       compute_nx = (nx - 2 * mpParameters->GetHalfLength());
       compute_nz = (nz - 2 * mpParameters->GetHalfLength());
+      iBlockX = 1; 
+      iBlockZ = 1; 
+
     }

And then at the kernel launch:

     OneAPIBackend::GetInstance()->GetDeviceQueue()->submit([&](handler &cgh) {
       auto global_range = range<2>(compute_nx, compute_nz);
-      auto local_range = range<2>(mpParameters->GetBlockX(), mpParameters->GetBlockZ());
+      auto local_range = range<2>(iBlockX, iBlockZ);
       auto starting_offset = id<2>(mpParameters->GetHalfLength(), mpParameters->GetHalfLength());
       auto global_nd_range = nd_range<2>(global_range, local_range, starting_offset);
       float *curr_base = ptr;

I ran this modified version and was able to get a correct output.

I hope this helps

frasercrmck commented 8 months ago

Hi @jgtong. Thanks for the suggestion. I can indeed confirm that this change makes the benchmark pass for me. Thank you!

Will this change be committed to the repo?

jgtong commented 8 months ago

@frasercrmck , Awesome! This fix will be pushed in after some thorough testing.

jgtong commented 8 months ago

This has been fixed in PR: https://github.com/oneapi-src/Velocity-Bench/pull/49