Closed kuhar closed 11 months ago
I added code to prefetch LHS and RHS in hope to hide latency. I'm seeing better numbers now:
-----------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[1x8]/Workgroup[64x1x1]/manual_time 53.5 us 12.5 us 11920 Bytes=314.242G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[1x16]/Workgroup[64x1x1]/manual_time 58.0 us 15.7 us 12337 Bytes=289.814G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x8]/Workgroup[64x1x1]/manual_time 60.7 us 16.4 us 11211 Bytes=276.524G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x16]/Workgroup[64x1x1]/manual_time 58.1 us 12.8 us 10386 Bytes=289.079G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x8]/Workgroup[64x1x1]/manual_time 56.1 us 14.0 us 12489 Bytes=299.303G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x16]/Workgroup[64x1x1]/manual_time 53.4 us 11.8 us 11683 Bytes=314.342G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x8]/Workgroup[64x2x1]/manual_time 64.4 us 13.4 us 9679 Bytes=260.648G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x16]/Workgroup[64x2x1]/manual_time 61.3 us 12.4 us 9736 Bytes=274.182G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x8]/Workgroup[64x2x1]/manual_time 67.9 us 16.9 us 10330 Bytes=247.387G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x16]/Workgroup[64x2x1]/manual_time 51.1 us 11.5 us 10830 Bytes=328.514G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x8]/Workgroup[64x4x1]/manual_time 79.2 us 19.5 us 9463 Bytes=212.14G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x16]/Workgroup[64x4x1]/manual_time 59.2 us 11.0 us 9671 Bytes=283.908G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[1x8]/Workgroup[64x1x1]/manual_time 76.9 us 14.8 us 7301 Bytes=873.317G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[1x16]/Workgroup[64x1x1]/manual_time 85.7 us 18.8 us 7442 Bytes=783.545G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x8]/Workgroup[64x1x1]/manual_time 77.7 us 10.6 us 7309 Bytes=864.229G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x16]/Workgroup[64x1x1]/manual_time 76.4 us 10.1 us 7115 Bytes=879.334G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x8]/Workgroup[64x1x1]/manual_time 80.4 us 16.7 us 6536 Bytes=835.583G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x16]/Workgroup[64x1x1]/manual_time 83.8 us 18.4 us 7601 Bytes=801.437G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x8]/Workgroup[64x2x1]/manual_time 102 us 10.8 us 6059 Bytes=657.937G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x16]/Workgroup[64x2x1]/manual_time 110 us 15.9 us 6062 Bytes=609.361G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x8]/Workgroup[64x2x1]/manual_time 103 us 17.6 us 6073 Bytes=651.854G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x16]/Workgroup[64x2x1]/manual_time 104 us 11.5 us 6162 Bytes=647.435G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x8]/Workgroup[64x4x1]/manual_time 146 us 11.9 us 4441 Bytes=459.739G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x16]/Workgroup[64x4x1]/manual_time 149 us 13.8 us 4380 Bytes=451.568G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[1x8]/Workgroup[64x1x1]/manual_time 358 us 11.2 us 1935 Bytes=751.019G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[1x16]/Workgroup[64x1x1]/manual_time 355 us 15.1 us 1958 Bytes=756.893G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x8]/Workgroup[64x1x1]/manual_time 361 us 11.4 us 1920 Bytes=744.092G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x16]/Workgroup[64x1x1]/manual_time 357 us 12.9 us 1948 Bytes=752.493G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x8]/Workgroup[64x1x1]/manual_time 372 us 12.4 us 1860 Bytes=722.532G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x16]/Workgroup[64x1x1]/manual_time 369 us 11.0 us 1879 Bytes=727.972G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x8]/Workgroup[64x2x1]/manual_time 431 us 15.9 us 1338 Bytes=622.646G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x16]/Workgroup[64x2x1]/manual_time 425 us 11.1 us 1337 Bytes=631.244G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x8]/Workgroup[64x2x1]/manual_time 415 us 10.7 us 1356 Bytes=647.791G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x16]/Workgroup[64x2x1]/manual_time 415 us 11.6 us 1358 Bytes=646.999G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x8]/Workgroup[64x4x1]/manual_time 596 us 11.2 us 960 Bytes=450.851G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x16]/Workgroup[64x4x1]/manual_time 595 us 10.9 us 967 Bytes=451.092G/s
New numbers with increased load type:
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[1x16]/Workgroup[64x1x1]/manual_time 54.3 us 12.0 us 10363 Bytes=309.157G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[1x32]/Workgroup[64x1x1]/manual_time 52.0 us 10.9 us 12351 Bytes=322.95G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[1x64]/Workgroup[64x1x1]/manual_time 50.8 us 11.1 us 12038 Bytes=330.542G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[1x128]/Workgroup[64x1x1]/manual_time 60.9 us 11.0 us 10596 Bytes=275.744G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x16]/Workgroup[64x1x1]/manual_time 52.2 us 12.1 us 12118 Bytes=321.931G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x32]/Workgroup[64x1x1]/manual_time 54.5 us 12.2 us 12197 Bytes=308.301G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x64]/Workgroup[64x1x1]/manual_time 54.3 us 12.5 us 12236 Bytes=309.169G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x128]/Workgroup[64x1x1]/manual_time 59.4 us 11.5 us 11041 Bytes=282.758G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x16]/Workgroup[64x1x1]/manual_time 52.9 us 12.4 us 12068 Bytes=317.288G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x32]/Workgroup[64x1x1]/manual_time 53.0 us 11.6 us 12137 Bytes=317.187G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x64]/Workgroup[64x1x1]/manual_time 60.0 us 17.5 us 12092 Bytes=279.866G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x128]/Workgroup[64x1x1]/manual_time 62.9 us 15.3 us 11022 Bytes=267.132G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x16]/Workgroup[64x2x1]/manual_time 58.7 us 10.5 us 10954 Bytes=286.225G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x32]/Workgroup[64x2x1]/manual_time 52.8 us 10.4 us 10776 Bytes=318.315G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x64]/Workgroup[64x2x1]/manual_time 53.0 us 10.6 us 10903 Bytes=316.809G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x128]/Workgroup[64x2x1]/manual_time 61.2 us 10.6 us 9103 Bytes=274.38G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x16]/Workgroup[64x2x1]/manual_time 63.3 us 15.6 us 10332 Bytes=265.485G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x32]/Workgroup[64x2x1]/manual_time 49.8 us 11.0 us 10735 Bytes=337.464G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x64]/Workgroup[64x2x1]/manual_time 55.1 us 10.4 us 11037 Bytes=305.028G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x128]/Workgroup[64x2x1]/manual_time 60.7 us 11.7 us 9199 Bytes=276.889G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x16]/Workgroup[64x4x1]/manual_time 60.1 us 11.1 us 9309 Bytes=279.383G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x32]/Workgroup[64x4x1]/manual_time 63.9 us 12.7 us 9296 Bytes=262.777G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x64]/Workgroup[64x4x1]/manual_time 68.9 us 15.1 us 9552 Bytes=243.732G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x128]/Workgroup[64x4x1]/manual_time 81.5 us 10.2 us 7080 Bytes=206.221G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[1x16]/Workgroup[64x1x1]/manual_time 84.9 us 18.5 us 7350 Bytes=790.845G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[1x32]/Workgroup[64x1x1]/manual_time 72.9 us 13.2 us 7557 Bytes=921.461G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[1x64]/Workgroup[64x1x1]/manual_time 72.6 us 11.4 us 7456 Bytes=925.018G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[1x128]/Workgroup[64x1x1]/manual_time 74.4 us 14.2 us 7521 Bytes=902.598G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x16]/Workgroup[64x1x1]/manual_time 75.9 us 10.1 us 7211 Bytes=885.057G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x32]/Workgroup[64x1x1]/manual_time 76.5 us 11.2 us 7392 Bytes=877.929G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x64]/Workgroup[64x1x1]/manual_time 78.3 us 14.9 us 7486 Bytes=857.515G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x128]/Workgroup[64x1x1]/manual_time 74.5 us 10.5 us 7314 Bytes=901.18G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x16]/Workgroup[64x1x1]/manual_time 73.1 us 10.9 us 7144 Bytes=918.353G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x32]/Workgroup[64x1x1]/manual_time 70.2 us 10.6 us 7656 Bytes=956.002G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x64]/Workgroup[64x1x1]/manual_time 75.3 us 10.2 us 7480 Bytes=892.07G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x128]/Workgroup[64x1x1]/manual_time 76.0 us 10.8 us 6602 Bytes=883.558G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x16]/Workgroup[64x2x1]/manual_time 105 us 17.0 us 5917 Bytes=636.596G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x32]/Workgroup[64x2x1]/manual_time 91.3 us 10.5 us 6277 Bytes=735.314G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x64]/Workgroup[64x2x1]/manual_time 92.7 us 10.2 us 6228 Bytes=724.199G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x128]/Workgroup[64x2x1]/manual_time 103 us 15.4 us 5939 Bytes=650.338G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x16]/Workgroup[64x2x1]/manual_time 96.8 us 10.6 us 5986 Bytes=693.685G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x32]/Workgroup[64x2x1]/manual_time 97.4 us 10.0 us 6361 Bytes=689.412G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x64]/Workgroup[64x2x1]/manual_time 93.8 us 10.4 us 6072 Bytes=716.008G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x128]/Workgroup[64x2x1]/manual_time 94.3 us 10.5 us 5966 Bytes=712.163G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x16]/Workgroup[64x4x1]/manual_time 143 us 10.2 us 4475 Bytes=468.694G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x32]/Workgroup[64x4x1]/manual_time 143 us 10.4 us 4431 Bytes=469.864G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x64]/Workgroup[64x4x1]/manual_time 148 us 11.3 us 4247 Bytes=453.095G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x128]/Workgroup[64x4x1]/manual_time 147 us 10.3 us 4150 Bytes=457.114G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[1x16]/Workgroup[64x1x1]/manual_time 351 us 10.2 us 1968 Bytes=764.448G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[1x32]/Workgroup[64x1x1]/manual_time 342 us 11.0 us 2047 Bytes=785.825G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[1x64]/Workgroup[64x1x1]/manual_time 340 us 10.9 us 2022 Bytes=790.284G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[1x128]/Workgroup[64x1x1]/manual_time 342 us 14.7 us 2001 Bytes=785.858G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x16]/Workgroup[64x1x1]/manual_time 352 us 11.2 us 1952 Bytes=762.24G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x32]/Workgroup[64x1x1]/manual_time 344 us 10.9 us 2010 Bytes=781.62G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x64]/Workgroup[64x1x1]/manual_time 345 us 11.7 us 2020 Bytes=777.404G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x128]/Workgroup[64x1x1]/manual_time 343 us 11.7 us 2026 Bytes=782.997G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x16]/Workgroup[64x1x1]/manual_time 367 us 11.8 us 1851 Bytes=732.426G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x32]/Workgroup[64x1x1]/manual_time 357 us 10.4 us 1926 Bytes=751.48G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x64]/Workgroup[64x1x1]/manual_time 348 us 11.9 us 1985 Bytes=772.219G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x128]/Workgroup[64x1x1]/manual_time 346 us 11.0 us 2012 Bytes=776.623G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x16]/Workgroup[64x2x1]/manual_time 423 us 10.9 us 1304 Bytes=634.515G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x32]/Workgroup[64x2x1]/manual_time 416 us 10.4 us 1337 Bytes=645.222G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x64]/Workgroup[64x2x1]/manual_time 419 us 10.5 us 1225 Bytes=640.945G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x128]/Workgroup[64x2x1]/manual_time 425 us 10.4 us 1262 Bytes=631.165G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x16]/Workgroup[64x2x1]/manual_time 415 us 12.5 us 1276 Bytes=646.296G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x32]/Workgroup[64x2x1]/manual_time 407 us 10.4 us 1292 Bytes=659.612G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x64]/Workgroup[64x2x1]/manual_time 418 us 10.4 us 1285 Bytes=642.076G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x128]/Workgroup[64x2x1]/manual_time 427 us 10.9 us 1336 Bytes=629.435G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x16]/Workgroup[64x4x1]/manual_time 589 us 12.8 us 883 Bytes=456.136G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x32]/Workgroup[64x4x1]/manual_time 589 us 11.2 us 881 Bytes=455.663G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x64]/Workgroup[64x4x1]/manual_time 608 us 12.1 us 886 Bytes=441.749G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x128]/Workgroup[64x4x1]/manual_time 628 us 11.5 us 916 Bytes=427.491G/s
@antiagainst @qedawkins I'm pretty happy with this implementation. Should we merge?
@antiagainst @qedawkins I'm pretty happy with this implementation. Should we merge?
Works for me, can I give it a pass tomorrow first?
Hi, sorry to ask here.. but what's special about RDNA3 in this test, as I can't run this sample on Nvidia 4070:
~/code/uVkCompute/build/benchmarks/vmt ./vmt_rdna3 2023-11-07T17:08:45+01:00 Running ./vmt_rdna3 Run on (32 X 5881 MHz CPU s) CPU Caches: L1 Data 32 KiB (x16) L1 Instruction 32 KiB (x16) L2 Unified 1024 KiB (x16) L3 Unified 32768 KiB (x2) Load Average: 8.08, 5.68, 2.31 WARNING CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. WARNING Library was built as DEBUG. Timings may be affected. code/uVkCompute/benchmarks/vmt/vmt_main.cc:123: check error: destination buffer element (0) has incorrect value: expected to be 1404 but found -1 ^ In shader: Tile[1x16], i8->i32 Abortado (`core' generado)
@oscarbg noting as of today, you can see the GSL compile target here: https://github.com/google/uVkCompute/commit/3049af9a233ab6d49088f2c99e2623f0c2b5be04#diff-62da6f62b4091626b341c9d8333d332aee35c053ff57cacebbb57792b987702aR30
This is more to communicate that it has been tuned and tested on rdna3, and in the future we may add more target-specific options to GLSL.
code/uVkCompute/benchmarks/vmt/vmt_main.cc:123: check error: destination buffer element (0) has incorrect value: expected to be 1404 but found -1 ^ In shader: Tile[1x16], i8->i32 Abortado (`core' generado)
@oscarbg also this indicates that one of the assumptions made in the GLSL does not hold on this target.
Based on https://github.com/google/uVkCompute/pull/38 by @qedawkins, and earlier mmt by @kuhar.
Add benchmarks for
vmt
, with very similar supporting structure to the existingmmt
benchmark.Changes compared to #38:
The performance depends heavily on the problem size. On 7900XTX, I'm seeing numbers up to 945 GB/s on 8k problem size.