Closed cl-bautch closed 4 months ago
@cl-bautch
Some of the kernels are running faster on H100 when comparing RTX3070 Ti vs H100 cuda code and profiling the kernel execution times. The kernels that are not running faster need to have their execution configuration updated based on hardware(i.e H100, A100, or RTX 3070 Ti) the benchmark is being ran on for the sycl and cuda version. I can just work on updating the workload for this.
Hi, after more investigation into this issue, I don't think it has anything to do with the workload itself. I think the issue is happening in the context creation of lower-end vs higher-end Nvidia gpus. For instance when I run the following command "nsys profile --stats=true ./svm_cuda a9a a.m" on my RTX3070 Ti the first call to cuda which is cudaEventCreate takes 67,369,773 (ns) vs H100 it takes 234,070,858(ns). I think this is more noticeable in the SVM workload because of the short runtime of the workload. As I mentioned in the previous comment the kernels are running faster on H100, but it's not noticeable because the context creation dominates in the total time.
After testing this benchmarks on both the RTX A5000 and L40 GPUs, the (higher end) L40 shows significantly slower results for this benchmark, which indicates something is wrong. All other benchmarks show much faster results on the L40 GPU than the RTX A5000 GPU as expected. These results were achieved on both the CUDA and Sycl with Nvidia backend implementations.
RTX A5000:
CC=icpx CXX=icpx cmake -DUSE_NVIDIA_BACKEND=TRUE -DUSE_SM=86 ../
L40:CC=icpx CXX=icpx cmake -DUSE_NVIDIA_BACKEND=TRUE -DUSE_SM=89 ../
RTX A5000:
cmake -DUSE_SM=86 ../
L40:cmake -DUSE_SM=89 ../
Environment
icpx: 2024.0.0.20231017 CUDA: 12.4 Driver: 550.90.07 Device: NVIDIA RTX A5000 (SM=86) and NVIDIA L40 (SM=89)