Open gregy4 opened 1 year ago
I even tried to use aocc compiler since support for zen4 is limited in gcc-12 but I ended with similar results.
array type | typeid name | dtype size | array size | avx512 sort | std sort | speed up |
---|---|---|---|---|---|---|
uniform random | j | 4 | 10000 | 590233 | 1038136 | 1.8 |
uniform random | j | 4 | 100000 | 8664066 | 13623417 | 1.6 |
uniform random | j | 4 | 1000000 | 117262111 | 161208450 | 1.4 |
uniform random | i | 4 | 10000 | 602964 | 1026301 | 1.7 |
uniform random | i | 4 | 100000 | 8858911 | 13546989 | 1.5 |
uniform random | i | 4 | 1000000 | 116628655 | 163222875 | 1.4 |
uniform random | f | 4 | 10000 | 609786 | 1541317 | 2.5 |
uniform random | f | 4 | 100000 | 8699332 | 20898121 | 2.4 |
uniform random | f | 4 | 1000000 | 119333844 | 245811136 | 2.1 |
uniform random | m | 8 | 10000 | 976378 | 1045651 | 1.1 |
uniform random | m | 8 | 100000 | 14633266 | 13486549 | 0.9 |
uniform random | m | 8 | 1000000 | 191530741 | 161761275 | 0.8 |
uniform random | l | 8 | 10000 | 978412 | 1026855 | 1.0 |
uniform random | l | 8 | 100000 | 14243476 | 13692690 | 1.0 |
uniform random | l | 8 | 1000000 | 188651173 | 161367916 | 0.9 |
uniform random | d | 8 | 10000 | 969286 | 1462486 | 1.5 |
uniform random | d | 8 | 100000 | 14418094 | 20359714 | 1.4 |
uniform random | d | 8 | 1000000 | 190162746 | 237778132 | 1.3 |
uniform random | t | 2 | 10000 | 402268 | 1040013 | 2.6 |
uniform random | t | 2 | 100000 | 5960407 | 13963662 | 2.3 |
uniform random | t | 2 | 1000000 | 80146876 | 146804863 | 1.8 |
uniform random | s | 2 | 10000 | 405211 | 1014493 | 2.5 |
uniform random | s | 2 | 100000 | 5957181 | 13663327 | 2.3 |
uniform random | s | 2 | 1000000 | 80590644 | 148901683 | 1.8 |
reverse | j | 4 | 10000 | 542799 | 126346 | 0.2 |
reverse | j | 4 | 100000 | 7797145 | 1489576 | 0.2 |
reverse | j | 4 | 1000000 | 98511309 | 16375909 | 0.2 |
reverse | i | 4 | 10000 | 520938 | 119398 | 0.2 |
reverse | i | 4 | 100000 | 7450528 | 1429902 | 0.2 |
reverse | i | 4 | 1000000 | 97871607 | 16308108 | 0.2 |
reverse | f | 4 | 10000 | 542155 | 315031 | 0.6 |
reverse | f | 4 | 100000 | 7771279 | 3420900 | 0.4 |
reverse | f | 4 | 1000000 | 100643359 | 36369022 | 0.4 |
reverse | m | 8 | 10000 | 866056 | 120451 | 0.1 |
reverse | m | 8 | 100000 | 12288924 | 1437381 | 0.1 |
reverse | m | 8 | 1000000 | 162711846 | 16352523 | 0.1 |
reverse | l | 8 | 10000 | 865872 | 121270 | 0.1 |
reverse | l | 8 | 100000 | 12402765 | 1483924 | 0.1 |
reverse | l | 8 | 1000000 | 163603746 | 16913736 | 0.1 |
reverse | d | 8 | 10000 | 898384 | 130437 | 0.1 |
reverse | d | 8 | 100000 | 12923145 | 1649754 | 0.1 |
reverse | d | 8 | 1000000 | 163598584 | 18106596 | 0.1 |
reverse | t | 2 | 10000 | 364963 | 117297 | 0.3 |
reverse | t | 2 | 100000 | 5114641 | 11812023 | 2.3 |
reverse | t | 2 | 1000000 | 71573971 | 64989841 | 0.9 |
reverse | s | 2 | 10000 | 364770 | 117121 | 0.3 |
reverse | s | 2 | 100000 | 5100520 | 8060481 | 1.6 |
reverse | s | 2 | 1000000 | 71520282 | 61821589 | 0.9 |
ordered | j | 4 | 10000 | 516910 | 155038 | 0.3 |
ordered | j | 4 | 100000 | 7505586 | 1914655 | 0.3 |
ordered | j | 4 | 1000000 | 97838487 | 22667913 | 0.2 |
ordered | i | 4 | 10000 | 515178 | 154467 | 0.3 |
ordered | i | 4 | 100000 | 7497252 | 1917751 | 0.3 |
ordered | i | 4 | 1000000 | 97166173 | 22689558 | 0.2 |
ordered | f | 4 | 10000 | 534312 | 352012 | 0.7 |
ordered | f | 4 | 100000 | 7743415 | 3883455 | 0.5 |
ordered | f | 4 | 1000000 | 99754393 | 42380284 | 0.4 |
ordered | m | 8 | 10000 | 854244 | 156010 | 0.2 |
ordered | m | 8 | 100000 | 12244765 | 1916253 | 0.2 |
ordered | m | 8 | 1000000 | 160971448 | 22707598 | 0.1 |
ordered | l | 8 | 10000 | 859099 | 158184 | 0.2 |
ordered | l | 8 | 100000 | 12303855 | 1932700 | 0.2 |
ordered | l | 8 | 1000000 | 162272349 | 22793251 | 0.1 |
ordered | d | 8 | 10000 | 878319 | 163183 | 0.2 |
ordered | d | 8 | 100000 | 12501567 | 2017044 | 0.2 |
ordered | d | 8 | 1000000 | 163338570 | 23704897 | 0.1 |
ordered | t | 2 | 10000 | 359487 | 153931 | 0.4 |
ordered | t | 2 | 100000 | 5092321 | 8177886 | 1.6 |
ordered | t | 2 | 1000000 | 71484822 | 62354659 | 0.9 |
ordered | s | 2 | 10000 | 359428 | 155119 | 0.4 |
ordered | s | 2 | 100000 | 5106033 | 10950057 | 2.1 |
ordered | s | 2 | 1000000 | 71537350 | 61921390 | 0.9 |
limitedrange | j | 4 | 10000 | 358186 | 179856 | 0.5 |
limitedrange | j | 4 | 100000 | 6984247 | 4882684 | 0.7 |
limitedrange | j | 4 | 1000000 | 172545601 | 49335466 | 0.3 |
limitedrange | i | 4 | 10000 | 874782 | 188181 | 0.2 |
limitedrange | i | 4 | 100000 | 10790883 | 4756131 | 0.4 |
limitedrange | i | 4 | 1000000 | 218877264 | 52580556 | 0.2 |
limitedrange | f | 4 | 10000 | 596326 | 1542762 | 2.6 |
limitedrange | f | 4 | 100000 | 8772957 | 20883406 | 2.4 |
limitedrange | f | 4 | 1000000 | 117754398 | 245358517 | 2.1 |
limitedrange | m | 8 | 10000 | 1428448 | 182871 | 0.1 |
limitedrange | m | 8 | 100000 | 12575439 | 4792626 | 0.4 |
limitedrange | m | 8 | 1000000 | 58792036 | 50422311 | 0.9 |
limitedrange | l | 8 | 10000 | 1042609 | 184293 | 0.2 |
limitedrange | l | 8 | 100000 | 5569605 | 4849209 | 0.9 |
limitedrange | l | 8 | 1000000 | 60339919 | 50194309 | 0.8 |
limitedrange | d | 8 | 10000 | 976828 | 1465632 | 1.5 |
limitedrange | d | 8 | 100000 | 14539702 | 19981831 | 1.4 |
limitedrange | d | 8 | 1000000 | 189751693 | 238011439 | 1.3 |
limitedrange | t | 2 | 10000 | 240822 | 206838 | 0.9 |
limitedrange | t | 2 | 100000 | 8734189 | 4837590 | 0.6 |
limitedrange | t | 2 | 1000000 | 103943574 | 50969911 | 0.5 |
limitedrange | s | 2 | 10000 | 458860 | 180045 | 0.4 |
limitedrange | s | 2 | 100000 | 12315523 | 4791978 | 0.4 |
limitedrange | s | 2 | 1000000 | 103630599 | 49719604 | 0.5 |
Zen 4 based CPUs perform poorly because AMD's implementation of compressstoreu
is extremely inefficient (~142 cycles latency / throughput of 54-72 according to uops.info. It is even worse than an 'emulated' version.
I've also run the benchmark on my 7700X CPU and also get extremely poor results for Zen 4.
Thankfully, replacing calls to compressstoreu
with their emulated versions significantly improves the results.
I'm going to push the fix on a fork and will make a PR.
Thanks for the explanation. It is good that the instruction can be emulated and results can be similar to intel speedup.
The answer for slow performance of AVX512 version of x86-simd-sort on Zen 4 is most probably explained in AMD manuals which could be found at: https://www.amd.com/en/search/documentation/hub.html#q=software%20optimization%20guide%20for%20the%20amd%20microarchitecture&f-amd_document_type=Software%20Optimization%20Guides
Software Optimization Guide for the AMD Zen4 Microarchitecture has following remark in "2.11.2 Code recommendations" chapter:
Avoid the memory destination form of COMPRESS instructions. These forms are implemented using microcode and achieve a lower store bandwidth than their register destination forms which use fastpath macro ops.
Software Optimization Guide for the AMD Zen5 Microarchitecture doesn't have any remark about COMPRESS instructions.
Could you add some code that disables the AVX512 version on Zen4, but keeps it enabled on Zen5 and future Zen architectures?
@tarsa Will this fix work for you? You would need to compile this library with a macro enabled.
@r-devulap I'm thinking more about adding Zen 4 AVX-512 exclusion to runtime dispatch, i.e. probably here: https://github.com/intel/x86-simd-sort/blob/59e298d8c9d1bee2cded744b9adbe31107ee220c/lib/x86simdsort.cpp
In short the idea is: if CPU has AVX-512, but is Zen 4, then use AVX2. Otherwise use AVX-512 on Zen 5 and future ones. That would have less negative performance consequences than solutions like https://github.com/openjdk/jdk/pull/16124
Restriction of the AVX512 sort acceleration to only Intel CPUs. A performance regression (due to micro-architectural differences) was reported for AMD Zen4 CPUs in the comments section of PR.
But AVX512 is faster?? Why would you use AVX2?
But AVX512 is faster?? Why would you use AVX2?
The patch from https://web.archive.org/web/20230319232625/https://github.com/natmaurice/x86-simd-sort/commit/41d03b2d8f3b62a2ee6a3a97a8da7f193a407026 is actually incomplete. It improves things if we only care about Zen 4, but the goal is to have single library that is well optimized for wide range of popular CPUs. If you define SW_VCOMPRESS then you gain performance on Zen 4, but lose performance on Zen 5 and other AVX-512 capable CPUs. If you don't define SW_VCOMPRESS then you get situation from this thread's original post. Therefore, to get the most optimized version in all cases, the code would need to be compiled twice, once with SW_VCOMPRESS defined and once without, and then the dispatch would need to choose between these two copies. That would increase maintenance costs, since there would need to be 2x testing of AVX-512 versions, and only improve performance for Zen 4, which is old generation of CPUs already. I'm not sure if it's worth the effort compared to the (at least simple sounding) fix described in https://github.com/intel/x86-simd-sort/issues/6#issuecomment-2506476505
Hello, I tried benchmark on 7950x cpu and performance is in some tests up to 2.3x faster but in other tests much slower (like 0.3x) compared to classical sorting. Is amd implementation of avx512 not so powerful (and your code is not suitable for zen4) or is it something else ?
Thanks, Jan