intel / llvm

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.
Other
1.22k stars 732 forks source link

[SYCL][MATH][CUDA] powf() vs sycl::pow() #4595

Open zjin-lcf opened 3 years ago

zjin-lcf commented 3 years ago

Is there a significant performance difference between the math function "powf()" in CUDA and sycl::pow() in SYCL on an Nvidia GPU (e.g. V100)? The "fast math" option is enabled when building the CUDA and SYCL programs. Thanks for your investigation.

https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/minkowski-sycl https://github.com/zjin-lcf/oneAPI-DirectProgramming/tree/master/minkowski-cuda

hdelan commented 1 year ago

Replacing sycl::native::powr(a, b) with a + b shows only a slight performance improvement:

➜  minkowski-sycl git:(master) ✗ ./main_powr 10
Problem size: c(512,2048) = a(512,1024) * b(1024,2048)
Minkowski distance with p = 1
Average kernel execution time: 0.067641 (s)
PASS
Minkowski distance with p = 2
Average kernel execution time: 0.060487 (s)
PASS
Minkowski distance with p = 3
Average kernel execution time: 0.060484 (s)
PASS
Minkowski distance with p = 4
Average kernel execution time: 0.060478 (s)
PASS
➜  minkowski-sycl git:(master) ✗ ./main_sum 10
Problem size: c(512,2048) = a(512,1024) * b(1024,2048)
Minkowski distance with p = 1
Average kernel execution time: 0.066280 (s)
Fail - The result is incorrect for element: [0, 0], expected: 0.460009, but found: 1025.46
...
Minkowski distance with p = 2
Average kernel execution time: 0.060351 (s)
Fail - The result is incorrect for element: [0, 0], expected: 0.0166156, but found: 2048.96
...
Minkowski distance with p = 3
Average kernel execution time: 0.060350 (s)
Fail - The result is incorrect for element: [0, 0], expected: 0.00571904, but found: 3072.79
...
Minkowski distance with p = 4
Average kernel execution time: 0.060347 (s)
Fail - The result is incorrect for element: [0, 0], expected: 0.00341466, but found: 4096.71

Meaning that the performance gap between CUDA and SYCL is likely due to something other than powr. Also note that using sycl::powr with the -ffast-math flag will transform this call into sycl::native::powr which in turn calls __nv_fast_powf. So we don't need to explicitly use sycl::native namespace if we are using the -ffast-math flag.

In DPC++, CXX stdlib funcs called from kernel code (such as std::pow, or powf) will not change to respond to the -ffast-math flag. Meaning it is better to use funcs in the SYCL namespace if you wish for the funcs implementation to be chosen based on the use of the flag. Perhaps we should add this -ffast-math support for CXX stdlib funcs. What are your thoughts?

To recap, the perf difference between CUDA and SYCL is not related to sycl::powr. Should we investigate the performance difference between CUDA and SYCL some more?

zjin-lcf commented 1 year ago

Thank you very much for your answer. Please see the changes https://github.com/zjin-lcf/HeCBench/commit/dd16a935c4d198a4cd84d8b540411ca0e6050233

            for (int i = 0; i < N; i++) {
              sum += sycl::native::powr(sycl::fabs(A[row * N + i] - B[i * P + col]), p);
              //sum += sycl::pow(sycl::fabs(A[row * N + i] - B[i * P + col]), p);
            }
            C[row * P + col] = sycl::native::powr(sum, one_over_p);
            //C[row * P + col] = sycl::pow(sum, one_over_p);

The kernel execution time reduces from ~0.007 to ~0.002. The performance improvement is due to the replacement of sycl::pow() with sycl::native::powr. Since the cuda compiler can convert powf() to powr() using the compiler option "-O3 --use_fast_math" in the cuda program, I hope that the sycl compiler might convert sycl::pow() to sycl::native_powr() using the compiler option "-O3 -ffast-math" in the sycl program.

https://github.com/zjin-lcf/HeCBench/tree/master/minkowski-cuda https://github.com/zjin-lcf/HeCBench/tree/master/minkowski-sycl

hdelan commented 1 year ago

That's great. You can see from this patch https://github.com/intel/llvm/pull/5801/files the -ffast-math will only make sycl::powr fast and not sycl::pow. This is probably a bug. Can you confirm that switching from sycl::native::powr to sycl::powr gives the same performance with the -ffast-math flag?

zjin-lcf commented 1 year ago

Yes, it gives the same (similar) performance.

hdelan commented 1 year ago

Great. Unfortunately there is no fast equivalent of sycl::pow only sycl::powr, since sycl::powr(x,y) requires x >=0, which sycl::pow does not require. This means that we can only expect speedups with -ffast-math when using sycl::powr. We therefore cannot change the behaviour of sycl::pow with -ffast-math.

Do you have any other questions?

zjin-lcf commented 1 year ago

Yes. Does the cuda compiler apply some advanced optimization to change the behavior of the pow() function when fast math is enabled ?

hdelan commented 1 year ago

The CUDA compiler defaults to FTZ implementations of most math funcs, which can give significant speedups. We can choose to turn FTZ on in DPC++ by using the compiler flag -fdenormal-fp-math-f32=preserve-sign. There is currently a bug in DPC++ that prevents this flag from being used with CUDA backend, but I am working on it at the moment and will update this issue when it is fixed.

hdelan commented 1 year ago

See perf difference in native CUDA:

minkowski-cuda git:(master) ✗ ./ftz-false-main 10 
Problem size: c(512,2048) = a(512,1024) * b(1024,2048)
Minkowski distance with p = 1
Average kernel execution time: 0.017983 (s)
PASS
Minkowski distance with p = 2
Average kernel execution time: 0.017114 (s)
PASS
Minkowski distance with p = 3
Average kernel execution time: 0.017897 (s)
PASS
Minkowski distance with p = 4
Average kernel execution time: 0.019500 (s)
PASS
➜  minkowski-cuda git:(master) ✗ ./ftz-true-main 10 
Problem size: c(512,2048) = a(512,1024) * b(1024,2048)
Minkowski distance with p = 1
Average kernel execution time: 0.018164 (s)
PASS
Minkowski distance with p = 2
Average kernel execution time: 0.016961 (s)
PASS
Minkowski distance with p = 3
Average kernel execution time: 0.016552 (s)
PASS
Minkowski distance with p = 4
Average kernel execution time: 0.018630 (s)
PASS
hdelan commented 1 year ago

Hi @zjin-lcf here is a fix for FTZ https://github.com/intel/llvm/pull/7616