performance on amd 7950x ...

gregy4 commented 1 year ago

Hello, I tried benchmark on 7950x cpu and performance is in some tests up to 2.3x faster but in other tests much slower (like 0.3x) compared to classical sorting. Is amd implementation of avx512 not so powerful (and your code is not suitable for zen4) or is it something else ?

Thanks, Jan

gregy4 commented 1 year ago

I even tried to use aocc compiler since support for zen4 is limited in gcc-12 but I ended with similar results.

array type	typeid name	dtype size	array size	avx512 sort	std sort	speed up
uniform random	j	4	10000	590233	1038136	1.8
uniform random	j	4	100000	8664066	13623417	1.6
uniform random	j	4	1000000	117262111	161208450	1.4
uniform random	i	4	10000	602964	1026301	1.7
uniform random	i	4	100000	8858911	13546989	1.5
uniform random	i	4	1000000	116628655	163222875	1.4
uniform random	f	4	10000	609786	1541317	2.5
uniform random	f	4	100000	8699332	20898121	2.4
uniform random	f	4	1000000	119333844	245811136	2.1
uniform random	m	8	10000	976378	1045651	1.1
uniform random	m	8	100000	14633266	13486549	0.9
uniform random	m	8	1000000	191530741	161761275	0.8
uniform random	l	8	10000	978412	1026855	1.0
uniform random	l	8	100000	14243476	13692690	1.0
uniform random	l	8	1000000	188651173	161367916	0.9
uniform random	d	8	10000	969286	1462486	1.5
uniform random	d	8	100000	14418094	20359714	1.4
uniform random	d	8	1000000	190162746	237778132	1.3
uniform random	t	2	10000	402268	1040013	2.6
uniform random	t	2	100000	5960407	13963662	2.3
uniform random	t	2	1000000	80146876	146804863	1.8
uniform random	s	2	10000	405211	1014493	2.5
uniform random	s	2	100000	5957181	13663327	2.3
uniform random	s	2	1000000	80590644	148901683	1.8
reverse	j	4	10000	542799	126346	0.2
reverse	j	4	100000	7797145	1489576	0.2
reverse	j	4	1000000	98511309	16375909	0.2
reverse	i	4	10000	520938	119398	0.2
reverse	i	4	100000	7450528	1429902	0.2
reverse	i	4	1000000	97871607	16308108	0.2
reverse	f	4	10000	542155	315031	0.6
reverse	f	4	100000	7771279	3420900	0.4
reverse	f	4	1000000	100643359	36369022	0.4
reverse	m	8	10000	866056	120451	0.1
reverse	m	8	100000	12288924	1437381	0.1
reverse	m	8	1000000	162711846	16352523	0.1
reverse	l	8	10000	865872	121270	0.1
reverse	l	8	100000	12402765	1483924	0.1
reverse	l	8	1000000	163603746	16913736	0.1
reverse	d	8	10000	898384	130437	0.1
reverse	d	8	100000	12923145	1649754	0.1
reverse	d	8	1000000	163598584	18106596	0.1
reverse	t	2	10000	364963	117297	0.3
reverse	t	2	100000	5114641	11812023	2.3
reverse	t	2	1000000	71573971	64989841	0.9
reverse	s	2	10000	364770	117121	0.3
reverse	s	2	100000	5100520	8060481	1.6
reverse	s	2	1000000	71520282	61821589	0.9
ordered	j	4	10000	516910	155038	0.3
ordered	j	4	100000	7505586	1914655	0.3
ordered	j	4	1000000	97838487	22667913	0.2
ordered	i	4	10000	515178	154467	0.3
ordered	i	4	100000	7497252	1917751	0.3
ordered	i	4	1000000	97166173	22689558	0.2
ordered	f	4	10000	534312	352012	0.7
ordered	f	4	100000	7743415	3883455	0.5
ordered	f	4	1000000	99754393	42380284	0.4
ordered	m	8	10000	854244	156010	0.2
ordered	m	8	100000	12244765	1916253	0.2
ordered	m	8	1000000	160971448	22707598	0.1
ordered	l	8	10000	859099	158184	0.2
ordered	l	8	100000	12303855	1932700	0.2
ordered	l	8	1000000	162272349	22793251	0.1
ordered	d	8	10000	878319	163183	0.2
ordered	d	8	100000	12501567	2017044	0.2
ordered	d	8	1000000	163338570	23704897	0.1
ordered	t	2	10000	359487	153931	0.4
ordered	t	2	100000	5092321	8177886	1.6
ordered	t	2	1000000	71484822	62354659	0.9
ordered	s	2	10000	359428	155119	0.4
ordered	s	2	100000	5106033	10950057	2.1
ordered	s	2	1000000	71537350	61921390	0.9
limitedrange	j	4	10000	358186	179856	0.5
limitedrange	j	4	100000	6984247	4882684	0.7
limitedrange	j	4	1000000	172545601	49335466	0.3
limitedrange	i	4	10000	874782	188181	0.2
limitedrange	i	4	100000	10790883	4756131	0.4
limitedrange	i	4	1000000	218877264	52580556	0.2
limitedrange	f	4	10000	596326	1542762	2.6
limitedrange	f	4	100000	8772957	20883406	2.4
limitedrange	f	4	1000000	117754398	245358517	2.1
limitedrange	m	8	10000	1428448	182871	0.1
limitedrange	m	8	100000	12575439	4792626	0.4
limitedrange	m	8	1000000	58792036	50422311	0.9
limitedrange	l	8	10000	1042609	184293	0.2
limitedrange	l	8	100000	5569605	4849209	0.9
limitedrange	l	8	1000000	60339919	50194309	0.8
limitedrange	d	8	10000	976828	1465632	1.5
limitedrange	d	8	100000	14539702	19981831	1.4
limitedrange	d	8	1000000	189751693	238011439	1.3
limitedrange	t	2	10000	240822	206838	0.9
limitedrange	t	2	100000	8734189	4837590	0.6
limitedrange	t	2	1000000	103943574	50969911	0.5
limitedrange	s	2	10000	458860	180045	0.4
limitedrange	s	2	100000	12315523	4791978	0.4
limitedrange	s	2	1000000	103630599	49719604	0.5

natmaurice commented 1 year ago

Zen 4 based CPUs perform poorly because AMD's implementation of compressstoreu is extremely inefficient (~142 cycles latency / throughput of 54-72 according to uops.info. It is even worse than an 'emulated' version.

I've also run the benchmark on my 7700X CPU and also get extremely poor results for Zen 4.

Thankfully, replacing calls to compressstoreu with their emulated versions significantly improves the results. I'm going to push the fix on a fork and will make a PR.

gregy4 commented 1 year ago

Thanks for the explanation. It is good that the instruction can be emulated and results can be similar to intel speedup.

tarsa commented 1 week ago

The answer for slow performance of AVX512 version of x86-simd-sort on Zen 4 is most probably explained in AMD manuals which could be found at: https://www.amd.com/en/search/documentation/hub.html#q=software%20optimization%20guide%20for%20the%20amd%20microarchitecture&f-amd_document_type=Software%20Optimization%20Guides

Software Optimization Guide for the AMD Zen4 Microarchitecture has following remark in "2.11.2 Code recommendations" chapter:

Avoid the memory destination form of COMPRESS instructions. These forms are implemented using microcode and achieve a lower store bandwidth than their register destination forms which use fastpath macro ops.

Software Optimization Guide for the AMD Zen5 Microarchitecture doesn't have any remark about COMPRESS instructions.

Could you add some code that disables the AVX512 version on Zen4, but keeps it enabled on Zen5 and future Zen architectures?

ValeZAA commented 1 week ago

https://web.archive.org/web/20230319232625/https://github.com/natmaurice/x86-simd-sort/commit/41d03b2d8f3b62a2ee6a3a97a8da7f193a407026

OOPS

r-devulap commented 6 days ago

https://web.archive.org/web/20230319232625/https://github.com/natmaurice/x86-simd-sort/commit/41d03b2d8f3b62a2ee6a3a97a8da7f193a407026

@tarsa Will this fix work for you? You would need to compile this library with a macro enabled.

tarsa commented 3 days ago

@r-devulap I'm thinking more about adding Zen 4 AVX-512 exclusion to runtime dispatch, i.e. probably here: https://github.com/intel/x86-simd-sort/blob/59e298d8c9d1bee2cded744b9adbe31107ee220c/lib/x86simdsort.cpp

In short the idea is: if CPU has AVX-512, but is Zen 4, then use AVX2. Otherwise use AVX-512 on Zen 5 and future ones. That would have less negative performance consequences than solutions like https://github.com/openjdk/jdk/pull/16124

Restriction of the AVX512 sort acceleration to only Intel CPUs. A performance regression (due to micro-architectural differences) was reported for AMD Zen4 CPUs in the comments section of PR.

ValeZAA commented 3 days ago

But AVX512 is faster?? Why would you use AVX2?

tarsa commented 3 days ago

But AVX512 is faster?? Why would you use AVX2?

The patch from https://web.archive.org/web/20230319232625/https://github.com/natmaurice/x86-simd-sort/commit/41d03b2d8f3b62a2ee6a3a97a8da7f193a407026 is actually incomplete. It improves things if we only care about Zen 4, but the goal is to have single library that is well optimized for wide range of popular CPUs. If you define SW_VCOMPRESS then you gain performance on Zen 4, but lose performance on Zen 5 and other AVX-512 capable CPUs. If you don't define SW_VCOMPRESS then you get situation from this thread's original post. Therefore, to get the most optimized version in all cases, the code would need to be compiled twice, once with SW_VCOMPRESS defined and once without, and then the dispatch would need to choose between these two copies. That would increase maintenance costs, since there would need to be 2x testing of AVX-512 versions, and only improve performance for Zen 4, which is old generation of CPUs already. I'm not sure if it's worth the effort compared to the (at least simple sounding) fix described in https://github.com/intel/x86-simd-sort/issues/6#issuecomment-2506476505

intel / x86-simd-sort

performance on amd 7950x ... #6