Closed biergaizi closed 11 months ago
May you attach the GPU assembly codes generated by the compilers ?
OpenSYCL generates the following code:
Intel DPC++ generates the following code:
Does the code show the same performance gap if malloc_device
is used instead of malloc_shared
?
Does the code show the same performance gap if malloc_device is used instead of malloc_shared?
Yes, both OpenSYCL and DPC++ show performance degradation using malloc_device()
, likely because of an issue related memory channel/bank conflict. From my observation, using the HIP backend, malloc_device()
is 1 MiB aligned, while malloc_shared()
is 4 KiB aligned. However, allocating 3 separate arrays each with 1 MiB alignment using malloc_device()
seems to cause a memory channel/bark conflict on the GFX906, which appears to be a poorly documented problem. Thus, counterintuitively, malloc_device()
is slower than malloc_shared()
in some cases.
On OpenSYCL, I found it can be worked around by using a single large malloc_device()
allocation, with manually padding between the three arrays. However, I was not able to reproduce the same workaround with DPC++ yet. I'm not sure if the performance degradation seen in DPC++ has the same root cause, or a separate problem.
I got a better and even more minimalist example to show the memory bandwidth problems. Memory channel/bank conflict is ruled out because both uses malloc_device()
with the same memory alignment.
#include <sycl/sycl.hpp>
void benchmark(
sycl::float4* __restrict array,
sycl::range<1> global_size,
sycl::range<1> local_size,
sycl::queue Q
)
{
int timesteps = 1000;
sycl::range global_range{global_size[0]};
sycl::range local_range{local_size[0]};
auto t1 = std::chrono::high_resolution_clock().now();
for (int i = 0; i < timesteps; i++) {
Q.submit([&](sycl::handler &h) {
h.parallel_for<class Bandwidth>(
sycl::nd_range<1>{global_range, local_range}, [=](sycl::nd_item<1> item) {
uint32_t i = item.get_global_id()[0];
sycl::float4 elem = array[i];
elem += 1;
array[i] = elem;
}
);
});
}
Q.wait_and_throw();
auto t2 = std::chrono::high_resolution_clock().now();
double dt = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() / 1e6;
size_t bytes_per_iteration = global_size[0] * sizeof(sycl::float4);
bytes_per_iteration *= 2; // read + write
fprintf(stderr, "speed: %.0f MB/s\n",
(double) bytes_per_iteration * timesteps / dt / 1024 / 1024
);
}
int main(int argc, char** argv)
{
size_t wg_min = 10240;
size_t wg_max = 65536;
sycl::queue Q({sycl::property::queue::in_order()});
sycl::range local_size{1024};
/*
* Note that the pointer returned by sycl::malloc_device() on AMD
* HIP backend is always 1 MiB aligned.
*/
size_t size = 1024 * wg_max;
sycl::float4 *buf = sycl::malloc_device<sycl::float4>(size, Q); // 256 float4 = 4096 bytes
Q.memset(buf, 0, sizeof(sycl::float4) * size);
for (size_t wg_num = wg_min; wg_num < wg_max; wg_num += 128) {
sycl::range global_size{1024 * wg_num};
printf("workgroups: %zu, p: %p\n", wg_num, buf);
benchmark(buf, global_size, local_size, Q);
benchmark(buf, global_size, local_size, Q);
benchmark(buf, global_size, local_size, Q);
Q.wait_and_throw();
}
return 0;
}
Using OpenSYCL, typical bandwidth is around 700 to 800 GB/s. But using Intel DPC++, the bandwidth is only 500 GB/s.
Just to add, and thanks to @biergaizi for bringing this up with me. We've verified the lower performance on our end as well. Here's BabelStream compiled with ICPX compared to quite a few other models on a MI100 (bandwidth normalised to theoretical peak, a value of 1 is 1228GB/s, higher is better):
The only compiler that showed a severe bandwidth limit is ICPX, and we see identical results from a RadeonVII as well. APU and some RDNA numbers are blocked by https://github.com/intel/llvm/issues/11203 which will be resolved if we merge https://github.com/intel/llvm/pull/11254 (thanks @al42and!) ICPX used here is 2023.2.1 with the Codeplay vendor plugin. StdPar on ICPX uses the oneDPL library.
(FYI, @tomdeakin)
@biergaizi Could you please try with the latest changes, as we attempted to close the gap by getting rid of unnecessary stack stores with commit: 00cf4c29740b2ec8d027e59e49e376a211599a58 (see intel/llvm/pull/11674).
We found a lot of the perf degradation came from the above and to mitigate it now, we have provided a way to switch off the culprit compiler transformation by compiling with the -mllvm -enable-global-offset=false
.
We are looking into more possible causes for such performance degradation.
Thanks!
We found a lot of the perf degradation came from the above and to mitigate it now, we have provided a way to switch off the culprit compiler transformation by compiling with the -mllvm -enable-global-offset=false.
I confirm that -mllvm -enable-global-offset=false
fixed the performance issue in both cases.
The original low-performance object code was:
; _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth
s_add_u32 s0, s0, s11
s_load_dword s11, s[4:5], 0x4
s_load_dwordx2 s[8:9], s[6:7], 0x0
s_addc_u32 s1, s1, 0
v_mov_b32_e32 v4, 0
s_waitcnt lgkmcnt(0)
s_and_b32 s4, s11, 0xffff
s_mul_i32 s4, s4, s10
v_add_u32_e32 v3, s4, v0
v_lshlrev_b64 v[0:1], 4, v[3:4]
v_mov_b32_e32 v2, s9
v_add_co_u32_e32 v5, vcc, s8, v0
v_addc_co_u32_e32 v6, vcc, v2, v1, vcc
global_load_dwordx4 v[0:3], v[5:6], off
s_nop 0
buffer_store_dword v4, off, s[0:3], 0 offset:8
buffer_store_dword v4, off, s[0:3], 0 offset:4
buffer_store_dword v4, off, s[0:3], 0 offset:12
s_waitcnt vmcnt(3)
v_add_f32_e32 v3, 1.0, v3
v_add_f32_e32 v2, 1.0, v2
v_add_f32_e32 v1, 1.0, v1
v_add_f32_e32 v0, 1.0, v0
global_store_dwordx4 v[5:6], v[0:3], off
s_endpgm
; _ZTSZZ9benchmarkPN4sycl3_V13vecIfLi4EEENS0_5rangeILi1EEES5_NS0_5queueEENKUlRNS0_7handlerEE_clES8_E9Bandwidth
s_load_dword s8, s[4:5], 0x4
s_load_dwordx4 s[12:15], s[6:7], 0x0
s_load_dword s9, s[6:7], 0x10
s_add_u32 s0, s0, s11
s_addc_u32 s1, s1, 0
s_waitcnt lgkmcnt(0)
s_and_b32 s4, s8, 0xffff
s_mul_i32 s4, s4, s10
v_add_u32_e32 v0, s14, v0
v_add_u32_e32 v0, s4, v0
v_mov_b32_e32 v1, 0
v_lshlrev_b64 v[0:1], 4, v[0:1]
v_mov_b32_e32 v2, s13
v_add_co_u32_e32 v4, vcc, s12, v0
v_addc_co_u32_e32 v5, vcc, v2, v1, vcc
global_load_dwordx4 v[0:3], v[4:5], off
v_mov_b32_e32 v6, s9
v_mov_b32_e32 v7, s15
v_mov_b32_e32 v8, s14
buffer_store_dword v6, off, s[0:3], 0 offset:12
buffer_store_dword v7, off, s[0:3], 0 offset:8
buffer_store_dword v8, off, s[0:3], 0 offset:4
s_waitcnt vmcnt(3)
v_add_f32_e32 v3, 1.0, v3
v_add_f32_e32 v2, 1.0, v2
v_add_f32_e32 v1, 1.0, v1
v_add_f32_e32 v0, 1.0, v0
global_store_dwordx4 v[4:5], v[0:3], off
s_endpgm
After using -mllvm -enable-global-offset=false
, the high-performance object code is:
s_load_dword s2, s[4:5], 0x4
s_load_dwordx2 s[0:1], s[6:7], 0x0
v_mov_b32_e32 v1, 0
s_waitcnt lgkmcnt(0)
s_and_b32 s2, s2, 0xffff
s_mul_i32 s2, s2, s8
v_add_u32_e32 v0, s2, v0
v_lshlrev_b64 v[0:1], 4, v[0:1]
v_mov_b32_e32 v2, s1
v_add_co_u32_e32 v4, vcc, s0, v0
v_addc_co_u32_e32 v5, vcc, v2, v1, vcc
global_load_dwordx4 v[0:3], v[4:5], off
s_waitcnt vmcnt(0)
v_add_f32_e32 v3, 1.0, v3
v_add_f32_e32 v2, 1.0, v2
v_add_f32_e32 v1, 1.0, v1
v_add_f32_e32 v0, 1.0, v0
global_store_dwordx4 v[4:5], v[0:3], off
s_endpgm
This was consistent with my AMDGPU micro-benchmark experience that global_load_dwordx4
is a high-performance pattern.
Running the executables showed that both examples now run at above 700 GB/s. Thus, I confirm that the workaround works.
If anyone is also investigating memory bandwidth on AMD GPU. here my own findings of another related memory bandwidth issue that is of general interest that I've tracked down at a downstream project. Basically, in addition to global_load_dwordx4
in isolation, another high-performance pattern on AMDGPU is four contiguous `global_load_dwordx4
with offset 0, 16, 32, 48 (with offset either incrementing or decreasing), and all pointed to the same 64-byte cache line, such as:
global_load_dwordx4 v[15:18], v[19:20], off
global_load_dwordx4 v[11:14], v[19:20], off offset:16
global_load_dwordx4 v[7:10], v[19:20], off offset:32
global_load_dwordx4 v[3:6], v[19:20], off offset:48
If, for some reason, the compiler doesn't generate 4 contiguous loads but tries to interleave them with other instructions, performance degradation can be significant. I've recently pinpointed a OpenCL performance degradation from upstream LLVM 15 to LLVM 16 because LLVM started to interleave loads and compute instructions more aggressively, causing performance degradation. The same kind of problem also affects SYCL.
More information can be found at: https://github.com/AdaptiveCpp/AdaptiveCpp/issues/1143
@biergaizi Thanks for the proactive response!
If you would like to open a new issue with your latter findings on memory bandwidth performance degradation with a reproducible, we can track it and look into it.
Otherwise let me know if we can further assist on this one, and if not, maybe happy to close it? Also double checking with @tom91136. :)
I am closing this now but please feel free to reopen if the original discussed issue isn't fully addressed, and as suggested for any of your following findings you can file a new ticket that we can track and address. Thanks a lot for these!
Describe the bug When targeting AMD HIP
gfx906
, simple SYCL code compiled by DPC++ often shows poor memory bandwidth, 50% below the hardware peak performance. Profiling shows the memory write traffic is amplified by 300% for unclear reason and it's suspected to be the cause.To Reproduce
The problem seems to exist in many simple memory read-write kernels.
To illustrate the point, consider the following
A[X] = A[X] * B[X] + C[X]
vector triad kernel,test1.cpp
.Compile the code via Intel DPC++ using the command
clang++ test1.cpp -o test1_dpcpp.elf -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx906 --rocm-device-lib-path=/usr/lib/amdgcn/bitcode/ -Ofast -march=native -Wall -Wextra
.Running
test1_dpcpp.elf
shows:The performance is 50% too low.
Compile the same code via OpenSYCL using the command
syclcc test1.cpp -o test1_opensycl.elf --opensycl-targets=hip:gfx906 --rocm-device-lib-path=/usr/lib/amdgcn/bitcode/ -Ofast -march=native -Wall -Wextra --save-temps
.Running
test1_opensycl.elf
shows:This is close to the realizable peak memory bandwidth on AMD Radeon Pro VII / Instinct MI50.
To profile the kernels via AMD
rocprof
, create a file namedrocprof_input.txt
with the following content:And running
rocprof -i rocprof_input.txt -o rocprof_dpcpp.csv ./test1_dpcpp.elf
androcprof -i rocprof_input.txt -o rocprof_opensycl.csv ./test1_opensycl.elf
.According to
rocprof_opensycl.csv
, in each iteration, around 65536 kilobytes are written into memory in each iteration, as expected for an FP32 array with 16777216 elements. TheL2CacheHit
rate is around 0% as expected, since there's no data reuse.But according to
rocprof_dpcpp.csv
, in each iteration, around 159336 kilobytes in written into memory (not including the first three iterations, which arememset()
). This is 300% as much as the theoretical value. TheL2CacheHit
rate is around 48%, also indicating that somehow there are redundant loads or stores.Environment (please complete the following information):
gfx906
(Radeon Pro VII / Instinct MI50)clang version 17.0.0 (https://github.com/intel/llvm.git 8ea3e8eb65b863dfacb3c970d4403ae322e8d02e)