oneapi-src / Velocity-Bench

Other
44 stars 15 forks source link

SVM: Unusual Time Results on Higher-End Nvidia GPUs #67

Closed cl-bautch closed 4 months ago

cl-bautch commented 4 months ago

After testing this benchmarks on both the RTX A5000 and L40 GPUs, the (higher end) L40 shows significantly slower results for this benchmark, which indicates something is wrong. All other benchmarks show much faster results on the L40 GPU than the RTX A5000 GPU as expected. These results were achieved on both the CUDA and Sycl with Nvidia backend implementations.

Results from 10 trials of Sycl with Nvidia backend: RTX A5000 L40
Mean (ms) StDev (ms) Mean (ms) StDev (ms)
Training 258.60 9.18 817.38 8.95
Loading 74.91 0.20 64.77 6.71
Processing 267.19 9.19 826.56 8.94
Storing 1.29 0.03 1.31 0.17
Total 343.41 9.21 892.63 10.65

RTX A5000: CC=icpx CXX=icpx cmake -DUSE_NVIDIA_BACKEND=TRUE -DUSE_SM=86 ../ L40: CC=icpx CXX=icpx cmake -DUSE_NVIDIA_BACKEND=TRUE -DUSE_SM=89 ../

Results from 10 trials of CUDA: RTX A5000 L40
Mean (ms) StDev (ms) Mean (ms) StDev (ms)
Training 216.90 3.74 415.16 3.91
Loading 77.04 0.69 64.93 6.04
Processing 285.29 4.18 500.93 4.11
Storing 1.37 0.05 1.15 0.14
Total 363.69 4.41 557.03 8.72

RTX A5000: cmake -DUSE_SM=86 ../ L40: cmake -DUSE_SM=89 ../

Environment

icpx: 2024.0.0.20231017 CUDA: 12.4 Driver: 550.90.07 Device: NVIDIA RTX A5000 (SM=86) and NVIDIA L40 (SM=89)

KateBlueSky commented 4 months ago

@cl-bautch

Some of the kernels are running faster on H100 when comparing RTX3070 Ti vs H100 cuda code and profiling the kernel execution times. The kernels that are not running faster need to have their execution configuration updated based on hardware(i.e H100, A100, or RTX 3070 Ti) the benchmark is being ran on for the sycl and cuda version. I can just work on updating the workload for this. image

KateBlueSky commented 4 months ago

Hi, after more investigation into this issue, I don't think it has anything to do with the workload itself. I think the issue is happening in the context creation of lower-end vs higher-end Nvidia gpus. For instance when I run the following command "nsys profile --stats=true ./svm_cuda a9a a.m" on my RTX3070 Ti the first call to cuda which is cudaEventCreate takes 67,369,773 (ns) vs H100 it takes 234,070,858(ns). I think this is more noticeable in the SVM workload because of the short runtime of the workload. As I mentioned in the previous comment the kernels are running faster on H100, but it's not noticeable because the context creation dominates in the total time.