Closed frasercrmck closed 8 months ago
Hello @frasercrmck thanks for filing this issue. We will take a look and get back.
Greetings @frasercrmck
Looking at the issue, you are running the GPU version of the algorithm on the CPU, which is not tested. Can you try to run the cpu version of the algorithm? You can do this by passing in computational_parameters_cpu.json
as the input configuration
Hi @jgtong, thanks for getting in touch.
I was a bit confused about the CPU algorithm initially, because there are still comments in the source code (including the line below the one you linked) suggesting the CPU version isn't supported.
Unfortunately I'm unable to run the CPU version of the algorithm, because it crashes somewhere inside DPC++.
Adding padding to ensure alignment of each row
Engine: /home/fraser/work/intel-llvm/sycl/source/detail/scheduler/commands.cpp:2184: void sycl::_V1::detail::adjustNDRangePerKernel(sycl::_V1::detail::NDRDescT&, sycl::_V1::detail::pi::PiKernel, const sycl::_V1::detail::device_impl&): Assertion `NDR.LocalSize[0] == 0' failed.
I hadn't properly investigated this - I just saw that it crashed and moved onto the GPU version of the algorithm - but it seems like it's trying to enqueue a kernel with a global size of zero and a local size of non-zero. DPC++ asserts on this here.
If you look here, the global_nd_range
is the one causing problems.
These are the global/local values I'm seeing locally:
compute_nx = 0
compute_nz = 608
mpParameters->GetBlockX() = 1
mpParameters->GetBlockZ() = 8
I don't know if this is a bug in DPC++ or a bug in reverse_time_migration
, sorry.
Greetings @frasercrmck
Thanks for your response. May I can close this issue? If you are experiencing an internal compiler failure, then I think an issue should be filed in their github
Hi @jgtong
@frasercrmck and I are wondering if compute_nx = 0
is actually expected. SYCL spec wise this means you are skipping the kernel invocation.
Is that the intention ?
@Naghasan and @frasercrmck ,
Thanks for letting me know of the issue. The variables compute_nx
and compute_nz
should not be zero. The compute settings in computational_parameters_cpu.json
is incorrect.
In order to get the RTM-CPU algorithm to run, please modify the computation_parameters.json
, then change the algorithm
and device
values to cpu
. I successfully ran this on the system using the following compiler:
clang version 18.0.0git (https://github.com/intel/llvm.git 89327e0a96e9b480e807211be79749741eb78fc1)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/jaytong/sycl_workspace/llvm/build/bin
To run, launch the command:
ONEAPI_DEVICE_SELECTOR=opencl:cpu ./bin/Engine -p workloads/bp_model/computation_parameters.json
Please note that the compute parameters are not optimized for the RTM-CPU algorithm.
I hope this helps and I am going to update the ReadMe file accordingly
Thanks @jgtong. That does indeed fix the "zero problem".
However, now I'm seeing it run the same kernel with the following ND range, which throws an exception:
compute_nx = 3237 compute_nz = 606
mpParameters->GetBlockX() = 64 mpParameters->GetBlockZ() = 8
Exception caught: Global work size {606, 3237, 1} is not evenly divisible by local work-group size {8, 64, 1}. Non-uniform work-groups are not allowed by default. Underlying OpenCL 2.x implementation supports this feature and to enable it, build device program with -cl-std=CL2.0 -54 (PI_ERROR_INVALID_WORK_GROUP_SIZE)
@frasercrmck ,
Your compute_{nx|nz}
values are different compared to what I have for my run for the CPU algorithm. The values below what I got are divisible by the block sizes
compute_nx: 1216
compute_nz: 608
blockX: 64
blockZ: 8
Are you using the same configuration file as I suggested earlier (computation_parameters.json
) ?
Lastly, we currently do not support the CPU versions for all of our workloads in Velocity-Bench. We regularly test our workloads on the GPU. So the functionality on the CPU may not be correct.
Hi @jgtong ,
Yes I do see those parameter values used twice, but then on the third time the kernel is called they change to the ones I mentioned earlier, at which point it crashes.
compute_nx = 1216 compute_nz = 608
mpParameters->GetBlockX() = 64 mpParameters->GetBlockZ() = 8
compute_nx = 1216 compute_nz = 608
mpParameters->GetBlockX() = 64 mpParameters->GetBlockZ() = 8
compute_nx = 3237 compute_nz = 606
mpParameters->GetBlockX() = 64 mpParameters->GetBlockZ() = 8
The first two times it enters into the if (enable_window)
case here, and on the third and fatal time it chooses the other path:
if (enable_window) {
nx = apGridBox->GetActualWindowSize(X_AXIS);
nz = apGridBox->GetActualWindowSize(Z_AXIS);
compute_nx = apGridBox->GetComputationGridSize(X_AXIS);
compute_nz = apGridBox->GetComputationGridSize(Z_AXIS);
} else {
nx = apGridBox->GetActualGridSize(X_AXIS);
nz = apGridBox->GetActualGridSize(Z_AXIS);
compute_nx = (nx - 2 * mpParameters->GetHalfLength());
compute_nz = (nz - 2 * mpParameters->GetHalfLength());
}
This is using our own OpenCL implementation (OCK) so perhaps our device's parameters are changing how the benchmark behaves.
Just to confirm, I am using a modified bp_model/computation_parameters_cpu.json
as you suggested:
% diff -u workloads/bp_model/computation_parameters.json workloads/bp_model/computation_parameters_cpu.json
--- workloads/bp_model/computation_parameters.json 2023-12-07 18:11:19.992932795 +0000
+++ workloads/bp_model/computation_parameters_cpu.json 2024-01-24 08:47:20.937355558 +0000
@@ -5,8 +5,8 @@
"source-frequency": 20,
"isotropic-radius": 5,
"dt-relax": 0.9,
- "algorithm": "gpu",
- "device": "gpu",
+ "algorithm": "cpu",
+ "device": "cpu",
"cache-blocking": {
"block-x": 64,
"block-z": 8,
@frasercrmck ,
Thanks for the diagnosis. I understand the issue and the runtime did not bail out when compute_{nx,nz}
are 3237 and 606 respectively. I ran this on an Xeon 8480+ server.
One humble suggestion is to modify the WaveFieldsMemoryHandler.cpp
file by setting different values of BlockX
and BlockZ
. For instance, whenever the execution goes to the else
branch from the enable_window
condition, you can set BlockX
and BlockZ
to 1. See below:
+ int iBlockX(0), iBlockZ(0);
if (enable_window) {
nx = apGridBox->GetActualWindowSize(X_AXIS);
nz = apGridBox->GetActualWindowSize(Z_AXIS);
compute_nx = apGridBox->GetComputationGridSize(X_AXIS);
compute_nz = apGridBox->GetComputationGridSize(Z_AXIS);
+ iBlockX = mpParameters->GetBlockX();
+ iBlockZ = mpParameters->GetBlockZ();
} else {
nx = apGridBox->GetActualGridSize(X_AXIS);
nz = apGridBox->GetActualGridSize(Z_AXIS);
compute_nx = (nx - 2 * mpParameters->GetHalfLength());
compute_nz = (nz - 2 * mpParameters->GetHalfLength());
+ iBlockX = 1;
+ iBlockZ = 1;
+
}
And then at the kernel launch:
OneAPIBackend::GetInstance()->GetDeviceQueue()->submit([&](handler &cgh) {
auto global_range = range<2>(compute_nx, compute_nz);
- auto local_range = range<2>(mpParameters->GetBlockX(), mpParameters->GetBlockZ());
+ auto local_range = range<2>(iBlockX, iBlockZ);
auto starting_offset = id<2>(mpParameters->GetHalfLength(), mpParameters->GetHalfLength());
auto global_nd_range = nd_range<2>(global_range, local_range, starting_offset);
float *curr_base = ptr;
I ran this modified version and was able to get a correct output.
I hope this helps
Hi @jgtong. Thanks for the suggestion. I can indeed confirm that this change makes the benchmark pass for me. Thank you!
Will this change be committed to the repo?
@frasercrmck , Awesome! This fix will be pushed in after some thorough testing.
This has been fixed in PR: https://github.com/oneapi-src/Velocity-Bench/pull/49
I've been running the
reverse_time_migration
with the oneAPI GPU workload/algorithm and have seen an out-of-bounds access in theSecondOrderComputationKernel
, specifically this access intocurrent
:I'm running:
I've tested with three OpenCL devices:
and
And also with our own OpenCL implementation.
All three implementations show the same results, and the same values for all of the relevant variables.
The benchmark itself prints:
To condense the problematic access somewhat, in the kernel we have:
All of the other variables involved in the computation (
v_end
,nx
,idx_range
,hl
) are constants.For all of of the OpenCL implementations I've tried, I have added debug printing - they show the same values:
And the size of the
current
buffer is3075072
bytes, so768768
floats.In practice there are many work-items which are out of bounds, but if we take the last work-item in the work-group, we can see that:
This is larger than the size of
current
and so is out of bounds.I don't know how this algorithm is supposed to work, so forgive my speculating, but is the increment
idx += nx
before the access correct? If it was incremented after the access intocurrent
, there would be no out-of-bounds access.If I've missed any relevant details, please let me know. I'm happy to contribute a PR but I think someone better-versed in the algorithm might need to help steer me in the right direction. Thanks!