Open zjin-lcf opened 3 years ago
Replacing sycl::native::powr(a, b)
with a + b
shows only a slight performance improvement:
➜ minkowski-sycl git:(master) ✗ ./main_powr 10
Problem size: c(512,2048) = a(512,1024) * b(1024,2048)
Minkowski distance with p = 1
Average kernel execution time: 0.067641 (s)
PASS
Minkowski distance with p = 2
Average kernel execution time: 0.060487 (s)
PASS
Minkowski distance with p = 3
Average kernel execution time: 0.060484 (s)
PASS
Minkowski distance with p = 4
Average kernel execution time: 0.060478 (s)
PASS
➜ minkowski-sycl git:(master) ✗ ./main_sum 10
Problem size: c(512,2048) = a(512,1024) * b(1024,2048)
Minkowski distance with p = 1
Average kernel execution time: 0.066280 (s)
Fail - The result is incorrect for element: [0, 0], expected: 0.460009, but found: 1025.46
...
Minkowski distance with p = 2
Average kernel execution time: 0.060351 (s)
Fail - The result is incorrect for element: [0, 0], expected: 0.0166156, but found: 2048.96
...
Minkowski distance with p = 3
Average kernel execution time: 0.060350 (s)
Fail - The result is incorrect for element: [0, 0], expected: 0.00571904, but found: 3072.79
...
Minkowski distance with p = 4
Average kernel execution time: 0.060347 (s)
Fail - The result is incorrect for element: [0, 0], expected: 0.00341466, but found: 4096.71
Meaning that the performance gap between CUDA and SYCL is likely due to something other than powr
. Also note that using sycl::powr
with the -ffast-math
flag will transform this call into sycl::native::powr
which in turn calls __nv_fast_powf
. So we don't need to explicitly use sycl::native
namespace if we are using the -ffast-math
flag.
In DPC++, CXX stdlib funcs called from kernel code (such as std::pow, or powf) will not change to respond to the -ffast-math
flag. Meaning it is better to use funcs in the SYCL namespace if you wish for the funcs implementation to be chosen based on the use of the flag. Perhaps we should add this -ffast-math
support for CXX stdlib funcs. What are your thoughts?
To recap, the perf difference between CUDA and SYCL is not related to sycl::powr
. Should we investigate the performance difference between CUDA and SYCL some more?
Thank you very much for your answer. Please see the changes https://github.com/zjin-lcf/HeCBench/commit/dd16a935c4d198a4cd84d8b540411ca0e6050233
for (int i = 0; i < N; i++) {
sum += sycl::native::powr(sycl::fabs(A[row * N + i] - B[i * P + col]), p);
//sum += sycl::pow(sycl::fabs(A[row * N + i] - B[i * P + col]), p);
}
C[row * P + col] = sycl::native::powr(sum, one_over_p);
//C[row * P + col] = sycl::pow(sum, one_over_p);
The kernel execution time reduces from ~0.007 to ~0.002. The performance improvement is due to the replacement of sycl::pow() with sycl::native::powr. Since the cuda compiler can convert powf() to powr() using the compiler option "-O3 --use_fast_math" in the cuda program, I hope that the sycl compiler might convert sycl::pow() to sycl::native_powr() using the compiler option "-O3 -ffast-math" in the sycl program.
https://github.com/zjin-lcf/HeCBench/tree/master/minkowski-cuda https://github.com/zjin-lcf/HeCBench/tree/master/minkowski-sycl
That's great. You can see from this patch https://github.com/intel/llvm/pull/5801/files the -ffast-math
will only make sycl::powr
fast and not sycl::pow
. This is probably a bug. Can you confirm that switching from sycl::native::powr
to sycl::powr
gives the same performance with the -ffast-math
flag?
Yes, it gives the same (similar) performance.
Great. Unfortunately there is no fast equivalent of sycl::pow
only sycl::powr
, since sycl::powr(x,y)
requires x >=0
, which sycl::pow
does not require. This means that we can only expect speedups with -ffast-math
when using sycl::powr
. We therefore cannot change the behaviour of sycl::pow
with -ffast-math
.
Do you have any other questions?
Yes. Does the cuda compiler apply some advanced optimization to change the behavior of the pow() function when fast math is enabled ?
The CUDA compiler defaults to FTZ implementations of most math funcs, which can give significant speedups. We can choose to turn FTZ on in DPC++ by using the compiler flag -fdenormal-fp-math-f32=preserve-sign
. There is currently a bug in DPC++ that prevents this flag from being used with CUDA backend, but I am working on it at the moment and will update this issue when it is fixed.
See perf difference in native CUDA:
minkowski-cuda git:(master) ✗ ./ftz-false-main 10
Problem size: c(512,2048) = a(512,1024) * b(1024,2048)
Minkowski distance with p = 1
Average kernel execution time: 0.017983 (s)
PASS
Minkowski distance with p = 2
Average kernel execution time: 0.017114 (s)
PASS
Minkowski distance with p = 3
Average kernel execution time: 0.017897 (s)
PASS
Minkowski distance with p = 4
Average kernel execution time: 0.019500 (s)
PASS
➜ minkowski-cuda git:(master) ✗ ./ftz-true-main 10
Problem size: c(512,2048) = a(512,1024) * b(1024,2048)
Minkowski distance with p = 1
Average kernel execution time: 0.018164 (s)
PASS
Minkowski distance with p = 2
Average kernel execution time: 0.016961 (s)
PASS
Minkowski distance with p = 3
Average kernel execution time: 0.016552 (s)
PASS
Minkowski distance with p = 4
Average kernel execution time: 0.018630 (s)
PASS
Hi @zjin-lcf here is a fix for FTZ https://github.com/intel/llvm/pull/7616
Is there a significant performance difference between the math function "powf()" in CUDA and sycl::pow() in SYCL on an Nvidia GPU (e.g. V100)? The "fast math" option is enabled when building the CUDA and SYCL programs. Thanks for your investigation.
https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/minkowski-sycl https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/minkowski-cuda