Closed ManuelCostanzo closed 1 week ago
@ManuelCostanzo V100 GPU ?
@jinz2014 I tried on RTX 2070, RTX3070, RTX 3090, V100, GTX 980 and GTX1080. For some reason, the only GPU that gets similar times is gtx1080, then, in the others, CUDA results in better times (and of course gflops)
I couldn't reproduce the significant performance gap on a V100. The number of iterations is 10 for both programs. You observed gaps on many GPUs. Curious if there is any change in execution time when 2D memory accesses are converted to 1D memory accesses for the shared memory in the SYCL program.
./main 16384 16384 16384 16384 [Matrix Multiply Using CUDA] - Starting... Device Name: Tesla V100-DGXS-32GB MatrixA(16384,16384), MatrixB(16384,16384) Computing result using CUDA Kernel... done 3.271226 Performance= 2856.27 GFlop/s, Time= 3079.578 msec, Size= 8796093022208 Ops, WorkgroupSize= 256 threads/block Checking computed result for correctness: Result = PASS
./main 16384 16384 16384 16384 [Matrix Multiply Using SYCL] - Starting... Device Name: Tesla V100-DGXS-32GB MatrixA(16384,16384), MatrixB(16384,16384) Computing result using SYCL Kernel... done 3.248668 Performance= 2712.60 GFlop/s, Time= 3242.683 msec, Size= 8796093022208 Ops, WorkgroupSize= 256 threads/block Checking computed result for correctness: Result = PASS
Hey, sorry the code I pasted was wrong because in SYCL nIter variable was 1 instead of 10, I updated that.
Here are my results for RTX 3090:
CUDA_VISIBLE_DEVICES=0 ./mm_cu 16384 16384 16384 16384 [Matrix Multiply Using CUDA] - Starting... Device Name: NVIDIA GeForce RTX 3090 MatrixA(16384,16384), MatrixB(16384,16384) Computing result using CUDA Kernel... done 3.126466 Performance= 2933.73 GFlop/s, Time= 2998.263 msec, Size= 8796093022208 Ops, WorkgroupSize= 256 threads/block Checking computed result for correctness: Result = PASS
CUDA_VISIBLE_DEVICES=0 SYCL_DEVICE_FILTER=gpu ./mm_sycl 16384 16384 16384 16384 [Matrix Multiply Using SYCL] - Starting... Device Name: NVIDIA GeForce RTX 3090 MatrixA(16384,16384), MatrixB(16384,16384) Computing result using SYCL Kernel... done 4.167054 Performance= 2099.78 GFlop/s, Time= 4189.050 msec, Size= 8796093022208 Ops, WorkgroupSize= 256 threads/block Checking computed result for correctness: Result = PASS
Here are the results with V100:
CUDA_VISIBLE_DEVICES=1 ./mm_cu 16384 16384 16384 16384 [Matrix Multiply Using CUDA] - Starting... Device Name: Tesla V100-PCIE-32GB MatrixA(16384,16384), MatrixB(16384,16384) Computing result using CUDA Kernel... done 3.377448 Performance= 2735.60 GFlop/s, Time= 3215.415 msec, Size= 8796093022208 Ops, WorkgroupSize= 256 threads/block Checking computed result for correctness: Result = PASS
CUDA_VISIBLE_DEVICES=1 SYCL_DEVICE_FILTER=gpu ./mm_sycl 16384 16384 16384 16384 [Matrix Multiply Using SYCL] - Starting... Device Name: Tesla V100-PCIE-32GB MatrixA(16384,16384), MatrixB(16384,16384) Computing result using SYCL Kernel... done 3.417807 Performance= 2580.01 GFlop/s, Time= 3409.328 msec, Size= 8796093022208 Ops, WorkgroupSize= 256 threads/block Checking computed result for correctness: Result = PASS
Is true that the difference is less in V100, but in the RTX is ~900gflops that is a lot in my opinion, just for a matrix multiplication. What do you think?
For 50 iters in V100 I'm getting 3.8 seg per iter in SYCL and 3.2 in CUDA, ~600msec isn't a little number I think:
[Matrix Multiply Using CUDA] - Starting... Device Name: Tesla V100-PCIE-32GB MatrixA(16384,16384), MatrixB(16384,16384) Computing result using CUDA Kernel... done 3.382438 Performance= 2681.07 GFlop/s, Time= 3280.812 msec, Size= 8796093022208 Ops, WorkgroupSize= 256 threads/block Checking computed result for correctness: Result = PASS
[Matrix Multiply Using SYCL] - Starting... Device Name: Tesla V100-PCIE-32GB MatrixA(16384,16384), MatrixB(16384,16384) Computing result using SYCL Kernel... done 3.414617 Performance= 2288.59 GFlop/s, Time= 3843.451 msec, Size= 8796093022208 Ops, WorkgroupSize= 256 threads/block Checking computed result for correctness: Result = PAS
"RTX is ~900gflops that is a lot in my opinion/~600msec isn't a little number" Yes, I will try to run programs on 3090.
Excellent, because I'm a bit lost with this. I would expect the performance to be the same, because the code is not a complex one. I am suspicious about shared memory, but idk how to determine that doing a profiling
Can you try using accessor::get_pointer() in the SYCL code? I noticed that accessors sometimes lead to slower Cuda code, I guess the compiler fails to optimize them.
Hi, I just tried, and was the same. I sent to the kernel the accessor.get_pointer() and then I received the float*, but I didn't get any changes doing that :/
I also tried using a 1D accessor, but neither, the same
Then a detailed profiling for both codes would be interesting. Run the Cuda and Sycl code both through the Nsight Compute profiler (ncu --set=full
) and compare them. If, for example, the Sycl code loads more data compared to the Cuda code, then it could explain the difference.
I run in the RTX 3070 because is where I have root access.
Here are the reports just for 1 iter (I have nsys installed, I believe will work, right?)
CUDA SYCL_DEVICE_FILTER=gpu nsys profile --stats=true --trace=cuda,nvtx ./mm_cu 16384 16384 16384 16384 [4/7] Executing 'cudaapisum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
81.2 5317895575 3 1772631858.3 163084415.0 2877 5154808283 2930185475.8 cudaStreamSynchronize
12.1 792685438 3 264228479.3 235289108.0 230984087 326412243 53895720.1 cudaHostAlloc
6.6 431403952 3 143801317.3 142455426.0 129784667 159163859 14735766.0 cudaFreeHost
0.1 5628676 3 1876225.3 2086470.0 763250 2778956 1024167.8 cudaFree
0.0 2470952 3 823650.7 808338.0 807549 855065 27208.5 cudaMalloc
0.0 39071 3 13023.7 13754.0 3226 22091 9453.7 cudaMemcpyAsync
0.0 28239 1 28239.0 28239.0 28239 28239 0.0 cudaLaunchKernel
0.0 18927 1 18927.0 18927.0 18927 18927 0.0 cudaStreamCreateWithFlags
0.0 617 1 617.0 617.0 617 617 0.0 cuModuleGetLoadingMode
[5/7] Executing 'gpukernsum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) GridXYZ BlockXYZ Name
100.0 4821601335 1 4821601335.0 4821601335.0 4821601335 4821601335 0.0 1024 1024 1 16 16 1 void MatrixMulCUDA<(int)16>(float *, float *, float *, int, int)
[6/7] Executing 'gpumemtimesum' stats report
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
67.1 333230635 2 166615317.5 166615317.5 166328177 166902458 406078.0 [CUDA memcpy HtoD]
32.9 163077603 1 163077603.0 163077603.0 163077603 163077603 0.0 [CUDA memcpy DtoH]
[7/7] Executing 'gpumemsizesum' stats report
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
2147.484 2 1073.742 1073.742 1073.742 1073.742 0.000 [CUDA memcpy HtoD] 1073.742 1 1073.742 1073.742 1073.742 1073.742 0.000 [CUDA memcpy DtoH]
SYCL SYCL_DEVICE_FILTER=gpu nsys profile --stats=true --trace=cuda,nvtx ./mm_sycl 16384 16384 16384 16384 [4/7] Executing 'cudaapisum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
81.3 7308632995 8 913579124.4 810.0 530 7308624297 2583988461.5 cuStreamSynchronize
8.2 740074949 3 246691649.7 246157512.0 241777811 252139626 5201517.1 cuMemAllocHost_v2
5.5 497126323 3 165708774.3 166567995.0 163039627 167518701 2359919.7 cuEventSynchronize
4.8 434799197 3 144933065.7 144337695.0 130999647 159461855 14240441.4 cuMemFreeHost
0.1 5608288 3 1869429.3 2109918.0 744995 2753375 1025560.2 cuMemFree_v2
0.0 2489266 3 829755.3 812702.0 810059 866505 31853.6 cuMemAlloc_v2
0.0 291732 1 291732.0 291732.0 291732 291732 0.0 cuModuleLoadDataEx
0.0 36600 3 12200.0 11363.0 6295 18942 6364.9 cuMemcpyAsync
0.0 26622 4 6655.5 5169.5 3636 12647 4213.0 cuStreamCreate
0.0 18077 6 3012.8 1609.0 605 8521 3139.4 cuEventRecord
0.0 16289 1 16289.0 16289.0 16289 16289 0.0 cuLaunchKernel
0.0 9160 4 2290.0 1484.0 1085 5107 1907.8 cuStreamDestroy_v2
0.0 5768 6 961.3 603.0 368 2126 763.3 cuEventCreate
0.0 2630 11 239.1 122.0 105 776 236.9 cuCtxSetCurrent
0.0 1842 4 460.5 457.0 351 577 101.0 cuEventDestroy_v2
[5/7] Executing 'gpukernsum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) GridXYZ BlockXYZ Name
100.0 7308627231 1 7308627231.0 7308627231.0 7308627231 7308627231 0.0 1024 1024 1 16 16 1 Typeinfo name for MatrixMultiply(int, char **, int, const sycl::_V1::range<(int)3> &, const sycl::_…
[6/7] Executing 'gpumemtimesum' stats report
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
67.2 334089782 2 167044891.0 167044891.0 166574324 167515458 665482.2 [CUDA memcpy HtoD]
32.8 163037802 1 163037802.0 163037802.0 163037802 163037802 0.0 [CUDA memcpy DtoH]
[7/7] Executing 'gpumemsizesum' stats report
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
2147.484 2 1073.742 1073.742 1073.742 1073.742 0.000 [CUDA memcpy HtoD] 1073.742 1 1073.742 1073.742 1073.742 1073.742 0.000 [CUDA memcpy DtoH]
nsys gives you an application-level overview and tracks Cuda API calls, ncu gives you a kernel profile but doesn't know what's happening outside of a Cuda kernel. What would be interesting is a kernel profile, so that we can see what's happening on a kernel level.
Hey, I don't have root access, for now, to install NCU, in the meantime could someone run NCU, please? Thanks
@MarkusBuettner Hi, I could execute with NCU in the RTX 2070. I tried with 8192.
CUDA
`void MatrixMulCUDA<(int)16>(float , float , float *, int, int) (512, 512, 1)x(16, 16, 1), Context 1, Stream 13, Device 0, CC 7.5 Section: GPU Speed Of Light Throughput
Metric Name Metric Unit Metric Value
----------------------- ------------- -------------
DRAM Frequency cycle/nsecond 6.79
SM Frequency cycle/nsecond 1.41
Elapsed Cycles cycle 1932539592
Memory Throughput % 74.30
DRAM Throughput % 23.94
Duration second 1.36
L1/TEX Cache Throughput % 95.20
L2 Cache Throughput % 14.60
SM Active Cycles cycle 1927275789.33
Compute (SM) Throughput % 74.30
----------------------- ------------- -------------
INF Compute and Memory are well-balanced: To reduce runtime, both computation and memory traffic must be reduced.
Check both the Compute Workload Analysis and Memory Workload Analysis sections.
Section: GPU Speed Of Light Roofline Chart
INF The ratio of peak float (fp32) to double (fp64) performance on this device is 32:1. The kernel achieved 12%
of this device's fp32 peak performance and 0% of its fp64 peak performance. See the Kernel Profiling Guide
(https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#roofline) for more details on roofline
analysis.
Section: Compute Workload Analysis
-------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------- ----------- ------------
Executed Ipc Active inst/cycle 0.77
Executed Ipc Elapsed inst/cycle 0.77
Issue Slots Busy % 19.37
Issued Ipc Active inst/cycle 0.77
SM Busy % 26.83
-------------------- ----------- ------------
WRN All compute pipelines are under-utilized. Either this kernel is very small or it doesn't issue enough warps
per scheduler. Check the Launch Statistics and Scheduler Statistics sections for further details.
Section: Memory Workload Analysis
----------------- ------------ ------------
Metric Name Metric Unit Metric Value
----------------- ------------ ------------
Memory Throughput Gbyte/second 103.99
Mem Busy % 47.60
Max Bandwidth % 74.30
L1/TEX Hit Rate % 0.00
L2 Hit Rate % 48.49
Mem Pipes Busy % 74.30
----------------- ------------ ------------
Section: Memory Workload Analysis Tables
WRN The memory access pattern for loads from L1TEX to L2 is not optimal. The granularity of an L1TEX request to
L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only
accesses an average of 2.0 sectors out of the possible 4 sectors per cache line. Check the Source Counters
section for uncoalesced loads and try to minimize how many cache lines need to be accessed per memory
request.
----- --------------------------------------------------------------------------------------------------------------
WRN The memory access pattern for stores from L1TEX to L2 is not optimal. The granularity of an L1TEX request to
L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only
accesses an average of 2.0 sectors out of the possible 4 sectors per cache line. Check the Source Counters
section for uncoalesced stores and try to minimize how many cache lines need to be accessed per memory
request.
Section: Scheduler Statistics
---------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
---------------------------- ----------- ------------
One or More Eligible % 19.37
Issued Warp Per Scheduler 0.19
No Eligible % 80.63
Active Warps Per Scheduler warp 8.00
Eligible Warps Per Scheduler warp 0.61
---------------------------- ----------- ------------
WRN Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only
issues an instruction every 5.2 cycles. This might leave hardware resources underutilized and may lead to
less optimal performance. Out of the maximum of 8 warps per scheduler, this kernel allocates an average of
8.00 active warps per scheduler, but only an average of 0.61 warps were eligible per cycle. Eligible warps
are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible
warp results in no instruction being issued and the issue slot remains unused. To increase the number of
eligible warps, avoid possible load imbalances due to highly different execution durations per warp.
Reducing stalls indicated on the Warp State Statistics and Source Counters sections can help, too.
Section: Warp State Statistics
---------------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
---------------------------------------- ----------- ------------
Warp Cycles Per Issued Instruction cycle 41.29
Warp Cycles Per Executed Instruction cycle 41.29
Avg. Active Threads Per Warp 32
Avg. Not Predicated Off Threads Per Warp 32.00
---------------------------------------- ----------- ------------
WRN On average, each warp of this kernel spends 19.8 cycles being stalled waiting for an MIO instruction queue to
be not full. This represents about 47.9% of the total average of 41.3 cycles between issuing two
instructions. This stall reason is high in cases of utilization of the MIO pipelines, which include special
math instructions, dynamic branches, as well as shared memory instructions. When caused by shared memory
accesses, trying to use fewer but wider loads can reduce pipeline pressure.
----- --------------------------------------------------------------------------------------------------------------
INF Check the Source Counters section for the top stall locations in your source based on sampling data. The
Kernel Profiling Guide (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#sampling) provides
more details on each stall reason.
Section: Instruction Statistics
---------------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
---------------------------------------- ----------- ------------
Avg. Executed Instructions Per Scheduler inst 373278492.44
Executed Instructions inst 53752102912
Avg. Issued Instructions Per Scheduler inst 373278530.94
Issued Instructions inst 53752108456
---------------------------------------- ----------- ------------
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 256
Function Cache Configuration CachePreferNone
Grid Size 262144
Registers Per Thread register/thread 39
Shared Memory Configuration Size Kbyte 32.77
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block Kbyte/block 2.05
Threads thread 67108864
Waves Per SM 1820.44
-------------------------------- --------------- ---------------
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 16
Block Limit Registers block 6
Block Limit Shared Mem block 16
Block Limit Warps block 4
Theoretical Active Warps per SM warp 32
Theoretical Occupancy % 100
Achieved Occupancy % 99.96
Achieved Active Warps Per SM warp 31.99
------------------------------- ----------- ------------
INF This kernel's theoretical occupancy is not impacted by any block limit.
Section: Source Counters
------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------- ----------- ------------
Branch Instructions Ratio % 0.02
Branch Instructions inst 1077936128
Branch Efficiency % 100
Avg. Divergent Branches 0
------------------------- ----------- ------------
`
SYCL `NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. ==PROF== Disconnected from process 1235 [1235] mm_sycl@127.0.0.1 Typeinfo name for MatrixMultiply(int, char **, int, const sycl::_V1::range<(int)3> &, const sycl::_V1::range<(int)3> &)::[lambda(sycl::_V1::handler &) (instance 1)]::operator ()(sycl::_V1::handler &) const::[lambda(sycl::_V1::nd_item<(int)3>) (instance 1)] (512, 512, 1)x(16, 16, 1), Context 1, Stream 15, Device 0, CC 7.5 Section: GPU Speed Of Light Throughput
Metric Name Metric Unit Metric Value
----------------------- ------------- -------------
DRAM Frequency cycle/nsecond 6.79
SM Frequency cycle/nsecond 1.41
Elapsed Cycles cycle 2449660647
Memory Throughput % 87.91
DRAM Throughput % 18.48
Duration second 1.73
L1/TEX Cache Throughput % 94.62
L2 Cache Throughput % 11.52
SM Active Cycles cycle 2443178755.81
Compute (SM) Throughput % 87.91
----------------------- ------------- -------------
INF The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
further improve performance, work will likely need to be shifted from the most utilized to another unit.
Start by analyzing workloads in the Compute Workload Analysis section.
Section: GPU Speed Of Light Roofline Chart
INF The ratio of peak float (fp32) to double (fp64) performance on this device is 32:1. The kernel achieved 10%
of this device's fp32 peak performance and 0% of its fp64 peak performance. See the Kernel Profiling Guide
(https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#roofline) for more details on roofline
analysis.
Section: Compute Workload Analysis
-------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------- ----------- ------------
Executed Ipc Active inst/cycle 0.76
Executed Ipc Elapsed inst/cycle 0.76
Issue Slots Busy % 18.95
Issued Ipc Active inst/cycle 0.76
SM Busy % 30.93
-------------------- ----------- ------------
WRN All compute pipelines are under-utilized. Either this kernel is very small or it doesn't issue enough warps
per scheduler. Check the Launch Statistics and Scheduler Statistics sections for further details.
Section: Memory Workload Analysis
----------------- ------------ ------------
Metric Name Metric Unit Metric Value
----------------- ------------ ------------
Memory Throughput Gbyte/second 80.36
Mem Busy % 47.31
Max Bandwidth % 87.91
L1/TEX Hit Rate % 0.00
L2 Hit Rate % 49.65
Mem Pipes Busy % 87.91
----------------- ------------ ------------
Section: Memory Workload Analysis Tables
WRN The memory access pattern for loads from L1TEX to L2 is not optimal. The granularity of an L1TEX request to
L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only
accesses an average of 2.0 sectors out of the possible 4 sectors per cache line. Check the Source Counters
section for uncoalesced loads and try to minimize how many cache lines need to be accessed per memory
request.
----- --------------------------------------------------------------------------------------------------------------
WRN The memory access pattern for stores from L1TEX to L2 is not optimal. The granularity of an L1TEX request to
L2 is a 128 byte cache line. That is 4 consecutive 32-byte sectors per L2 request. However, this kernel only
accesses an average of 2.0 sectors out of the possible 4 sectors per cache line. Check the Source Counters
section for uncoalesced stores and try to minimize how many cache lines need to be accessed per memory
request.
Section: Scheduler Statistics
---------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
---------------------------- ----------- ------------
One or More Eligible % 18.95
Issued Warp Per Scheduler 0.19
No Eligible % 81.05
Active Warps Per Scheduler warp 8.00
Eligible Warps Per Scheduler warp 0.71
---------------------------- ----------- ------------
WRN Every scheduler is capable of issuing one instruction per cycle, but for this kernel each scheduler only
issues an instruction every 5.3 cycles. This might leave hardware resources underutilized and may lead to
less optimal performance. Out of the maximum of 8 warps per scheduler, this kernel allocates an average of
8.00 active warps per scheduler, but only an average of 0.71 warps were eligible per cycle. Eligible warps
are the subset of active warps that are ready to issue their next instruction. Every cycle with no eligible
warp results in no instruction being issued and the issue slot remains unused. To increase the number of
eligible warps, avoid possible load imbalances due to highly different execution durations per warp.
Reducing stalls indicated on the Warp State Statistics and Source Counters sections can help, too.
Section: Warp State Statistics
---------------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
---------------------------------------- ----------- ------------
Warp Cycles Per Issued Instruction cycle 42.20
Warp Cycles Per Executed Instruction cycle 42.20
Avg. Active Threads Per Warp 32
Avg. Not Predicated Off Threads Per Warp 32.00
---------------------------------------- ----------- ------------
WRN On average, each warp of this kernel spends 22.7 cycles being stalled waiting for an MIO instruction queue to
be not full. This represents about 53.8% of the total average of 42.2 cycles between issuing two
instructions. This stall reason is high in cases of utilization of the MIO pipelines, which include special
math instructions, dynamic branches, as well as shared memory instructions. When caused by shared memory
accesses, trying to use fewer but wider loads can reduce pipeline pressure.
----- --------------------------------------------------------------------------------------------------------------
INF Check the Source Counters section for the top stall locations in your source based on sampling data. The
Kernel Profiling Guide (https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#sampling) provides
more details on each stall reason.
Section: Instruction Statistics
---------------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
---------------------------------------- ----------- ------------
Avg. Executed Instructions Per Scheduler inst 463048248.89
Executed Instructions inst 66678947840
Avg. Issued Instructions Per Scheduler inst 463048322.28
Issued Instructions inst 66678958409
---------------------------------------- ----------- ------------
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 256
Function Cache Configuration CachePreferNone
Grid Size 262144
Registers Per Thread register/thread 51
Shared Memory Configuration Size Kbyte 32.77
Driver Shared Memory Per Block byte/block 0
Dynamic Shared Memory Per Block Kbyte/block 2.05
Static Shared Memory Per Block byte/block 0
Threads thread 67108864
Waves Per SM 1820.44
-------------------------------- --------------- ---------------
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 16
Block Limit Registers block 4
Block Limit Shared Mem block 16
Block Limit Warps block 4
Theoretical Active Warps per SM warp 32
Theoretical Occupancy % 100
Achieved Occupancy % 99.97
Achieved Active Warps Per SM warp 31.99
------------------------------- ----------- ------------
INF This kernel's theoretical occupancy is not impacted by any block limit.
Section: Source Counters
------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------- ----------- ------------
Branch Instructions Ratio % 0.02
Branch Instructions inst 1077936128
Branch Efficiency % 100
Avg. Divergent Branches 0
------------------------- ----------- ------------
`
RTX 3070, 16384 size
CUDA Section: GPU Speed Of Light Throughput
Metric Name Metric Unit Metric Value
----------------------- ------------- --------------
DRAM Frequency cycle/nsecond 6.79
SM Frequency cycle/nsecond 1.50
Elapsed Cycles cycle 11281589358
Memory Throughput % 79.46
DRAM Throughput % 38.08
Duration second 7.52
L1/TEX Cache Throughput % 79.47
L2 Cache Throughput % 20.05
SM Active Cycles cycle 11280132003.54
Compute (SM) Throughput % 79.46
----------------------- ------------- --------------
INF Compute and Memory are well-balanced: To reduce runtime, both computation and memory traffic must be reduced.
Check both the Compute Workload Analysis and Memory Workload Analysis sections.
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 256
Function Cache Configuration CachePreferNone
Grid Size 1048576
Registers Per Thread register/thread 38
Shared Memory Configuration Size Kbyte 32.77
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block Kbyte/block 2.05
Threads thread 268435456
Waves Per SM 3799.19
-------------------------------- --------------- ---------------
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 16
Block Limit Registers block 6
Block Limit Shared Mem block 10
Block Limit Warps block 6
Theoretical Active Warps per SM warp 48
Theoretical Occupancy % 100
Achieved Occupancy % 100.01
Achieved Active Warps Per SM warp 48.01
------------------------------- ----------- ------------
SYCL Section: GPU Speed Of Light Throughput
Metric Name Metric Unit Metric Value
----------------------- ------------- --------------
DRAM Frequency cycle/nsecond 6.79
SM Frequency cycle/nsecond 1.50
Elapsed Cycles cycle 13643340229
Memory Throughput % 98.56
DRAM Throughput % 29.76
Duration second 9.10
L1/TEX Cache Throughput % 98.60
L2 Cache Throughput % 16.60
SM Active Cycles cycle 13638202932.96
Compute (SM) Throughput % 98.56
----------------------- ------------- --------------
INF The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
further improve performance, work will likely need to be shifted from the most utilized to another unit.
Start by analyzing workloads in the Compute Workload Analysis section.
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 256
Function Cache Configuration CachePreferNone
Grid Size 1048576
Registers Per Thread register/thread 40
Shared Memory Configuration Size Kbyte 32.77
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block Kbyte/block 2.05
Static Shared Memory Per Block byte/block 0
Threads thread 268435456
Waves Per SM 3799.19
-------------------------------- --------------- ---------------
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 16
Block Limit Registers block 6
Block Limit Shared Mem block 10
Block Limit Warps block 6
Theoretical Active Warps per SM warp 48
Theoretical Occupancy % 100
Achieved Occupancy % 100.04
Achieved Active Warps Per SM warp 48.02
------------------------------- ----------- ------------
Could you list the "ncu" command for profiling the SYCL program ?
Yes, I did the same for both SYCL and CUDA:
SYCL_DEVICE_FILTER=gpu /opt/nvidia/nsight-compute/2022.4.0/ncu ./mm_sycl 16384 16384 16384 16384
It is maybe helpful to generate a ncu-rep file:
SYCL_DEVICE_FILTER=gpu /opt/nvidia/nsight-compute/2022.4.0/ncu --set=full -o mm_sycl ./mm_sycl 16384 16384 16384 16384
You can view them in ncu-ui even if you don't have a Cuda-capable device on your machine.
But from the logs you posted it looks like the SYCL kernel is executing about 24% more instructions. Now it is interesting, why...
Also, is there a reason why you're using dimension 1/2 instead of 0/1 in the SYCL kernel?
Hi @MarkusBuettner, There is no reasons at all. I tried with 0/1 dimension but it was the same.
I posted also in this forum: https://support.codeplay.com/t/poor-performance-on-matrix-multiplication/575 and I got an answer that could solve the mystery? The thing is why
Ah, that's interesting. I'm interested in why the clang compiler did not optimize this.
On the other hand, I've noticed that Open SYCL (who also use clang in the background) sometimes generates slower code compared to Intel/Codeplay, probably because they use different flags?
I tried with OpenSYCL but I had the same performance behavior. Maybe there is something inside clang that all of the implementations are sharing
Hi. I've been passed this by my colleague, and at first glance I'm inclined to agree with rbielski's initial assessment on the Codeplay forums, but I'll let you know what I find.
As we've suggested on the Codeplay forum, we suggest not using local accessors and instead using the Intel extension for local memory.
In short, the compiler benefits significantly from knowing the compile-time bounds of the local allocation in this case, as it can statically know the constant offsets for all memory accesses in the following offending loop:
for (int k = 0; k < BLOCK_SIZE; ++k) {
Csub += As[ty][k] * Bs[k][tx];
}
Using local accessors does not convey this information across the host/device boundary.
In MatrixMulCUDA
, remove the local_accessor
parameters As
and Bs
and instead declare the following:
using namespace sycl::ext::oneapi;
auto& As = *group_local_memory_for_overwrite<float[BLOCK_SIZE][BLOCK_SIZE]>(item_ct1.get_group());
auto& Bs = *group_local_memory_for_overwrite<float[BLOCK_SIZE][BLOCK_SIZE]>(item_ct1.get_group());
This should be enough to match the performance of CUDA - it does locally.
Hi! how are you?
I'm posting here because I have a question related with a comparison between CUDA and SYCL (using oneAPI) because CUDA is getting ~600 GFLOPs more than SYCL. I have the following CUDA cuda that represents Matrix Multiplication.
For compile, you can use: nvcc program.cu -o program_cuda -O3
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda program.cpp -o program_sycl -O3
For run, you can use:
./program 16384 16384 16384 16384
My theory is that the "shared" in CUDA is obtaining better performance than the local accessors in SYCL, but I'm not sure really.
Thank you very much.
CUDA CODE:
EQUIVALENT SYCL CODE: