SVM: Unusual Time Results on Higher-End Nvidia GPUs

cl-bautch commented 4 months ago

After testing this benchmarks on both the RTX A5000 and L40 GPUs, the (higher end) L40 shows significantly slower results for this benchmark, which indicates something is wrong. All other benchmarks show much faster results on the L40 GPU than the RTX A5000 GPU as expected. These results were achieved on both the CUDA and Sycl with Nvidia backend implementations.

Results from 10 trials of Sycl with Nvidia backend:		RTX A5000		L40
	Mean (ms)	StDev (ms)	Mean (ms)	StDev (ms)
Training	258.60	9.18	817.38	8.95
Loading	74.91	0.20	64.77	6.71
Processing	267.19	9.19	826.56	8.94
Storing	1.29	0.03	1.31	0.17
Total	343.41	9.21	892.63	10.65

RTX A5000: CC=icpx CXX=icpx cmake -DUSE_NVIDIA_BACKEND=TRUE -DUSE_SM=86 ../ L40: CC=icpx CXX=icpx cmake -DUSE_NVIDIA_BACKEND=TRUE -DUSE_SM=89 ../

Results from 10 trials of CUDA:		RTX A5000		L40
	Mean (ms)	StDev (ms)	Mean (ms)	StDev (ms)
Training	216.90	3.74	415.16	3.91
Loading	77.04	0.69	64.93	6.04
Processing	285.29	4.18	500.93	4.11
Storing	1.37	0.05	1.15	0.14
Total	363.69	4.41	557.03	8.72

RTX A5000: cmake -DUSE_SM=86 ../ L40: cmake -DUSE_SM=89 ../

Environment

icpx: 2024.0.0.20231017 CUDA: 12.4 Driver: 550.90.07 Device: NVIDIA RTX A5000 (SM=86) and NVIDIA L40 (SM=89)

KateBlueSky commented 4 months ago

@cl-bautch

Some of the kernels are running faster on H100 when comparing RTX3070 Ti vs H100 cuda code and profiling the kernel execution times. The kernels that are not running faster need to have their execution configuration updated based on hardware(i.e H100, A100, or RTX 3070 Ti) the benchmark is being ran on for the sycl and cuda version. I can just work on updating the workload for this.

KateBlueSky commented 4 months ago

Hi, after more investigation into this issue, I don't think it has anything to do with the workload itself. I think the issue is happening in the context creation of lower-end vs higher-end Nvidia gpus. For instance when I run the following command "nsys profile --stats=true ./svm_cuda a9a a.m" on my RTX3070 Ti the first call to cuda which is cudaEventCreate takes 67,369,773 (ns) vs H100 it takes 234,070,858(ns). I think this is more noticeable in the SVM workload because of the short runtime of the workload. As I mentioned in the previous comment the kernels are running faster on H100, but it's not noticeable because the context creation dominates in the total time.

oneapi-src / Velocity-Bench

SVM: Unusual Time Results on Higher-End Nvidia GPUs #67

Environment