intel / x86-simd-sort

C++ template library for high performance SIMD based sorting algorithms
BSD 3-Clause "New" or "Revised" License
893 stars 59 forks source link

performance on amd 7950x ... #6

Open gregy4 opened 1 year ago

gregy4 commented 1 year ago

Hello, I tried benchmark on 7950x cpu and performance is in some tests up to 2.3x faster but in other tests much slower (like 0.3x) compared to classical sorting. Is amd implementation of avx512 not so powerful (and your code is not suitable for zen4) or is it something else ?

Thanks, Jan

gregy4 commented 1 year ago

I even tried to use aocc compiler since support for zen4 is limited in gcc-12 but I ended with similar results.

array type typeid name dtype size array size avx512 sort std sort speed up
uniform random j 4 10000 590233 1038136 1.8
uniform random j 4 100000 8664066 13623417 1.6
uniform random j 4 1000000 117262111 161208450 1.4
uniform random i 4 10000 602964 1026301 1.7
uniform random i 4 100000 8858911 13546989 1.5
uniform random i 4 1000000 116628655 163222875 1.4
uniform random f 4 10000 609786 1541317 2.5
uniform random f 4 100000 8699332 20898121 2.4
uniform random f 4 1000000 119333844 245811136 2.1
uniform random m 8 10000 976378 1045651 1.1
uniform random m 8 100000 14633266 13486549 0.9
uniform random m 8 1000000 191530741 161761275 0.8
uniform random l 8 10000 978412 1026855 1.0
uniform random l 8 100000 14243476 13692690 1.0
uniform random l 8 1000000 188651173 161367916 0.9
uniform random d 8 10000 969286 1462486 1.5
uniform random d 8 100000 14418094 20359714 1.4
uniform random d 8 1000000 190162746 237778132 1.3
uniform random t 2 10000 402268 1040013 2.6
uniform random t 2 100000 5960407 13963662 2.3
uniform random t 2 1000000 80146876 146804863 1.8
uniform random s 2 10000 405211 1014493 2.5
uniform random s 2 100000 5957181 13663327 2.3
uniform random s 2 1000000 80590644 148901683 1.8
reverse j 4 10000 542799 126346 0.2
reverse j 4 100000 7797145 1489576 0.2
reverse j 4 1000000 98511309 16375909 0.2
reverse i 4 10000 520938 119398 0.2
reverse i 4 100000 7450528 1429902 0.2
reverse i 4 1000000 97871607 16308108 0.2
reverse f 4 10000 542155 315031 0.6
reverse f 4 100000 7771279 3420900 0.4
reverse f 4 1000000 100643359 36369022 0.4
reverse m 8 10000 866056 120451 0.1
reverse m 8 100000 12288924 1437381 0.1
reverse m 8 1000000 162711846 16352523 0.1
reverse l 8 10000 865872 121270 0.1
reverse l 8 100000 12402765 1483924 0.1
reverse l 8 1000000 163603746 16913736 0.1
reverse d 8 10000 898384 130437 0.1
reverse d 8 100000 12923145 1649754 0.1
reverse d 8 1000000 163598584 18106596 0.1
reverse t 2 10000 364963 117297 0.3
reverse t 2 100000 5114641 11812023 2.3
reverse t 2 1000000 71573971 64989841 0.9
reverse s 2 10000 364770 117121 0.3
reverse s 2 100000 5100520 8060481 1.6
reverse s 2 1000000 71520282 61821589 0.9
ordered j 4 10000 516910 155038 0.3
ordered j 4 100000 7505586 1914655 0.3
ordered j 4 1000000 97838487 22667913 0.2
ordered i 4 10000 515178 154467 0.3
ordered i 4 100000 7497252 1917751 0.3
ordered i 4 1000000 97166173 22689558 0.2
ordered f 4 10000 534312 352012 0.7
ordered f 4 100000 7743415 3883455 0.5
ordered f 4 1000000 99754393 42380284 0.4
ordered m 8 10000 854244 156010 0.2
ordered m 8 100000 12244765 1916253 0.2
ordered m 8 1000000 160971448 22707598 0.1
ordered l 8 10000 859099 158184 0.2
ordered l 8 100000 12303855 1932700 0.2
ordered l 8 1000000 162272349 22793251 0.1
ordered d 8 10000 878319 163183 0.2
ordered d 8 100000 12501567 2017044 0.2
ordered d 8 1000000 163338570 23704897 0.1
ordered t 2 10000 359487 153931 0.4
ordered t 2 100000 5092321 8177886 1.6
ordered t 2 1000000 71484822 62354659 0.9
ordered s 2 10000 359428 155119 0.4
ordered s 2 100000 5106033 10950057 2.1
ordered s 2 1000000 71537350 61921390 0.9
limitedrange j 4 10000 358186 179856 0.5
limitedrange j 4 100000 6984247 4882684 0.7
limitedrange j 4 1000000 172545601 49335466 0.3
limitedrange i 4 10000 874782 188181 0.2
limitedrange i 4 100000 10790883 4756131 0.4
limitedrange i 4 1000000 218877264 52580556 0.2
limitedrange f 4 10000 596326 1542762 2.6
limitedrange f 4 100000 8772957 20883406 2.4
limitedrange f 4 1000000 117754398 245358517 2.1
limitedrange m 8 10000 1428448 182871 0.1
limitedrange m 8 100000 12575439 4792626 0.4
limitedrange m 8 1000000 58792036 50422311 0.9
limitedrange l 8 10000 1042609 184293 0.2
limitedrange l 8 100000 5569605 4849209 0.9
limitedrange l 8 1000000 60339919 50194309 0.8
limitedrange d 8 10000 976828 1465632 1.5
limitedrange d 8 100000 14539702 19981831 1.4
limitedrange d 8 1000000 189751693 238011439 1.3
limitedrange t 2 10000 240822 206838 0.9
limitedrange t 2 100000 8734189 4837590 0.6
limitedrange t 2 1000000 103943574 50969911 0.5
limitedrange s 2 10000 458860 180045 0.4
limitedrange s 2 100000 12315523 4791978 0.4
limitedrange s 2 1000000 103630599 49719604 0.5
natmaurice commented 1 year ago

Zen 4 based CPUs perform poorly because AMD's implementation of compressstoreu is extremely inefficient (~142 cycles latency / throughput of 54-72 according to uops.info. It is even worse than an 'emulated' version.

I've also run the benchmark on my 7700X CPU and also get extremely poor results for Zen 4.

Thankfully, replacing calls to compressstoreu with their emulated versions significantly improves the results. I'm going to push the fix on a fork and will make a PR.

gregy4 commented 1 year ago

Thanks for the explanation. It is good that the instruction can be emulated and results can be similar to intel speedup.

tarsa commented 1 week ago

The answer for slow performance of AVX512 version of x86-simd-sort on Zen 4 is most probably explained in AMD manuals which could be found at: https://www.amd.com/en/search/documentation/hub.html#q=software%20optimization%20guide%20for%20the%20amd%20microarchitecture&f-amd_document_type=Software%20Optimization%20Guides

Software Optimization Guide for the AMD Zen4 Microarchitecture has following remark in "2.11.2 Code recommendations" chapter:

Avoid the memory destination form of COMPRESS instructions. These forms are implemented using microcode and achieve a lower store bandwidth than their register destination forms which use fastpath macro ops.

Software Optimization Guide for the AMD Zen5 Microarchitecture doesn't have any remark about COMPRESS instructions.

Could you add some code that disables the AVX512 version on Zen4, but keeps it enabled on Zen5 and future Zen architectures?

ValeZAA commented 1 week ago

https://web.archive.org/web/20230319232625/https://github.com/natmaurice/x86-simd-sort/commit/41d03b2d8f3b62a2ee6a3a97a8da7f193a407026

OOPS

r-devulap commented 6 days ago

https://web.archive.org/web/20230319232625/https://github.com/natmaurice/x86-simd-sort/commit/41d03b2d8f3b62a2ee6a3a97a8da7f193a407026

@tarsa Will this fix work for you? You would need to compile this library with a macro enabled.

tarsa commented 3 days ago

@r-devulap I'm thinking more about adding Zen 4 AVX-512 exclusion to runtime dispatch, i.e. probably here: https://github.com/intel/x86-simd-sort/blob/59e298d8c9d1bee2cded744b9adbe31107ee220c/lib/x86simdsort.cpp

In short the idea is: if CPU has AVX-512, but is Zen 4, then use AVX2. Otherwise use AVX-512 on Zen 5 and future ones. That would have less negative performance consequences than solutions like https://github.com/openjdk/jdk/pull/16124

Restriction of the AVX512 sort acceleration to only Intel CPUs. A performance regression (due to micro-architectural differences) was reported for AMD Zen4 CPUs in the comments section of PR.

ValeZAA commented 3 days ago

But AVX512 is faster?? Why would you use AVX2?

tarsa commented 3 days ago

But AVX512 is faster?? Why would you use AVX2?

The patch from https://web.archive.org/web/20230319232625/https://github.com/natmaurice/x86-simd-sort/commit/41d03b2d8f3b62a2ee6a3a97a8da7f193a407026 is actually incomplete. It improves things if we only care about Zen 4, but the goal is to have single library that is well optimized for wide range of popular CPUs. If you define SW_VCOMPRESS then you gain performance on Zen 4, but lose performance on Zen 5 and other AVX-512 capable CPUs. If you don't define SW_VCOMPRESS then you get situation from this thread's original post. Therefore, to get the most optimized version in all cases, the code would need to be compiled twice, once with SW_VCOMPRESS defined and once without, and then the dispatch would need to choose between these two copies. That would increase maintenance costs, since there would need to be 2x testing of AVX-512 versions, and only improve performance for Zen 4, which is old generation of CPUs already. I'm not sure if it's worth the effort compared to the (at least simple sounding) fix described in https://github.com/intel/x86-simd-sort/issues/6#issuecomment-2506476505