Open slavonnet opened 3 years ago
@slavonnet It should be possible to have a half-width (128 bits) version of the AVX2 code as VPMOVZXDQ is available in AVX; it would be near-identical to an hypothetical SSE4.1 version. Not sure if the SSE4.1 variant would significantly improve upon the SSE3 variant, but it would also support pre-AVX CPU such as Nehalem & Westmere and would probably have the same performance as a pure AVX version on Sandy Bridge and Ivy Bridge.
Did a quick test (performance only, might be buggy) using (V)PMOVZXDQ. Both the SSE4.1 version (that you can see on https://github.com/rdolbeau/zfs/tree/test_fletcher_sse41) and the AVX (three-operands VEX-encoded, otherwise identical) versions are slower than the SSSE3 on my Skylake. Unfortunately I can't test on more relevant cores at this time (Penryn, Nehalem, Westmere).
Mhm, not sure if I did something wrong here. Qemu+Libvirt, Intel i7 950 host. The copy host model checkbox is flaky and results in nehalem-ibrs + mitigations.
prawn@zfstest (SSH) ~/zfs % git status 2021-04-25 14:45:59
On branch test_fletcher_sse41
Your branch is up to date with 'origin/test_fletcher_sse41'.
nothing to commit, working tree clean
prawn@zfstest (SSH) ~/zfs % sudo modprobe zfs 2021-04-25 14:46:03
prawn@zfstest (SSH) ~/zfs % sudo dmesg | tail 2021-04-25 14:46:09
[ 3.880516] snd_hda_codec_generic hdaudioC0D0: hp_outs=0 (0x0/0x0/0x0/0x0/0x0)
[ 3.880516] snd_hda_codec_generic hdaudioC0D0: mono: mono_out=0x0
[ 3.880517] snd_hda_codec_generic hdaudioC0D0: inputs:
[ 3.880519] snd_hda_codec_generic hdaudioC0D0: Line=0x5
[ 3.991406] Adding 1046524k swap on /dev/vda5. Priority:-2 extents:1 across:1046524k FS
[ 57.839621] spl: loading out-of-tree module taints kernel.
[ 57.839702] spl: module verification failed: signature and/or required key missing - tainting kernel
[ 57.843599] znvpair: module license 'CDDL' taints kernel.
[ 57.843600] Disabling lock debugging due to kernel taint
[ 58.020532] ZFS: Loaded module v2.1.99-1, ZFS pool version 5000, ZFS filesystem version 5
prawn@zfstest (SSH) ~/zfs % cat /proc/spl/kstat/zfs/vdev_raidz_bench 2021-04-25 14:46:14
19 0 0x01 -1 0 57838749451 540677310219
implementation gen_p gen_pq gen_pqr rec_p rec_q rec_r rec_pq rec_pr rec_qr rec_pqr
original 380337826 170409366 58926396 733551197 158418062 23918681 61475849 10388186 10690930 6386729
scalar 843940973 201056038 103169881 828440439 289788706 195564322 133491180 106538859 81844513 64254287
sse2 1446517564 644942247 416882187 1654562117 664735990 489993791 324060407 332214787 176674345 98479344
ssse3 1504613526 708577283 417956118 1663217733 1064512845 585129534 580990950 446771209 324301481 284787549
fastest ssse3 ssse3 ssse3 ssse3 ssse3 ssse3 ssse3 ssse3 ssse3 ssse3
prawn@zfstest (SSH) ~/zfs % lscpu 2021-04-25 14:54:12
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 40 bits physical, 48 bits virtual
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 26
Model name: Intel Core i7 9xx (Nehalem Core i7, IBRS update)
Stepping: 3
CPU MHz: 3074.304
BogoMIPS: 6148.60
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
L3 cache: 16384K
NUMA node0 CPU(s): 0,1
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni ssse3 cx16 sse4_1 sse4_2 x2apic popcnt hypervisor lahf_lm cpuid_fault pti ibrs ibpb
Uhm, nevermind...
prawn@zfstest (SSH) ~/zfs % cat /proc/spl/kstat/zfs/fletcher_4_bench 2021-04-25 14:55:14
0 0 0x01 -1 0 57687143798 950948263021
implementation native byteswap
scalar 4249791401 3814151349
superscalar 5619224016 4319551557
superscalar4 5944738069 2336538552
sse2 9130490813 5083391760
ssse3 8759600437 8270570418
sse4_1 8447685480 6720833233
fastest sse2 ssse3
I can re-run this a bunch to make this more scientific if there's interest.
@InsanePrawn Thanks, Bloomfield is Nehalem and was a potential target, but the number are also not good. A bit surprised that sse4_1/byteswap is so low, but overall I didn't expect much; static analysis says it's a bad deal.
Innermost loop for ssse3_byteswap (courtesy of http://www.maqao.org/, using~/maqao.intel64.2.13.2/maqao.intel64 cqa proc=Nehalem_Core_i5i7 ./lib/libzpool/.libs/zfs_fletcher_sse.o conf=expert fct-loops=fletcher_4_ssse3_byteswap
, showing just the pipeline view):
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | Latency | Recip. throughput
-----------------------------------------------------------------------------------------------------------------
MOVDQU (%RSI),%XMM5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 2 | 1
PSHUFB %XMM7,%XMM5 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
MOVDQA %XMM5,%XMM6 | 1 | 0.33 | 0.33 | 0 | 0 | 0 | 0.33 | 1 | 0.33
PUNPCKLDQ %XMM4,%XMM5 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PUNPCKHDQ %XMM4,%XMM6 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PADDQ %XMM5,%XMM0 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PADDQ %XMM0,%XMM1 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PADDQ %XMM1,%XMM2 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PADDQ %XMM2,%XMM3 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PADDQ %XMM6,%XMM0 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PADDQ %XMM0,%XMM1 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PADDQ %XMM1,%XMM2 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PADDQ %XMM2,%XMM3 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
ADD $0x10,%RSI | 1 | 0.33 | 0.33 | 0 | 0 | 0 | 0.33 | 1 | 0.33
CMP %RSI,%RDX | 1 | 0.33 | 0.33 | 0 | 0 | 0 | 0.33 | 1 | 0.33
JA 170 <fletcher_4_ssse3_byteswap+0x30> | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2
and for sse41_byteswap:
Instruction | Nb FU | P0 | P1 | P2 | P3 | P4 | P5 | Latency | Recip. throughput
-----------------------------------------------------------------------------------------------------------------
PMOVZXDQ (%RSI),%XMM5 | 1 | 0.50 | 0 | 1 | 0 | 0 | 0.50 | 1 | 2
PMOVZXDQ 0x8(%RSI),%XMM6 | 1 | 0.50 | 0 | 1 | 0 | 0 | 0.50 | 1 | 2
PSHUFB %XMM7,%XMM5 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PSHUFB %XMM7,%XMM6 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PADDQ %XMM5,%XMM0 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PADDQ %XMM0,%XMM1 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PADDQ %XMM1,%XMM2 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PADDQ %XMM2,%XMM3 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PADDQ %XMM6,%XMM0 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PADDQ %XMM0,%XMM1 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PADDQ %XMM1,%XMM2 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
PADDQ %XMM2,%XMM3 | 1 | 0.50 | 0 | 0 | 0 | 0 | 0.50 | 1 | 0.50
ADD $0x10,%RSI | 1 | 0.33 | 0.33 | 0 | 0 | 0 | 0.33 | 1 | 0.33
CMP %RSI,%RDX | 1 | 0.33 | 0.33 | 0 | 0 | 0 | 0.33 | 1 | 0.33
JA 268 <fletcher_4_sse41_byteswap+0x28> | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2
In the end the dispatch is more expensive for the sse4_1 variant.
The PMOVZXDQ are too costly and ultimately slow things down. I get similar results for Westmere_Core_i3i5i7, Xeon_E5_v1 and Core_i7X_Xeon_E5E7_v2. Maqao also has a favorable view of the ssse3 code for Skylake_SP, as I observe.
SSE4.1/AVX-specific code doesn't look useful for Fletcher.
in my code have issue becouse i use intel syntax but zfs use at&t also i found another %ymm (256) variants on AVX and will test it
perfomance test must include parralel 2-8 threads test for SMID buffer tests
my system with 2 x CPU E5-2690 v1 and mitigations=off
[root@vm2 ~]# cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 24258688081 281261279095611
implementation native byteswap
scalar 4276507017 3143277029
superscalar 5100232862 3786241790
superscalar4 4556284887 3076965513
sse2 7762220122 4088350313
ssse3 7784347987 7103313199
fastest ssse3 ssse3
[root@vm2 ~]# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
stepping : 7
microcode : 0x71a
cpu MHz : 3292.417
cache size : 20480 KB
physical id : 0
siblings : 16
core id : 0
cpu cores : 8
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags : vnmi preemption_timer invvpid ept_x_only ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5785.86
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
[root@vm2 ~]# dmesg | head
[ 0.000000] microcode: microcode updated early to revision 0x71a, date = 2020-03-24
[ 0.000000] Linux version 5.11.16-1.el8.elrepo.x86_64 (mockbuild@f42943b2f9e24912992d9f9b7db73026) (gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), GNU ld version 2.30-79.el8) #1 SMP Mon Apr 19 19:16:48 EDT 2021
[ 0.000000] Command line: BOOT_IMAGE=(hd21,gpt2)/vmlinuz-5.11.16-1.el8.elrepo.x86_64 root=UUID=83065019-cf8c-47b5-90d2-d72ba16df093 ro crashkernel=auto rhgb quiet mitigations=off selinux=0 modprobe.blacklist=mgag200 msr.allow_writes=on pcie_aspm=force skew_tick=1
[root@vm2 ~]# cat /proc/spl/kstat/zfs/vdev_raidz_bench
19 0 0x01 -1 0 25258001729 281395275808357
implementation gen_p gen_pq gen_pqr rec_p rec_q rec_r rec_pq rec_pr rec_qr rec_pqr
original 369707027 139524961 55726791 612280552 129902033 16157233 46981353 9312673 9586634 6373360
scalar 873846034 206399629 93925673 870044070 287716950 180184898 126633031 97933168 67354075 51850443
sse2 1684885258 582927679 314949353 1695501872 601495478 433250179 354324812 309365266 201036194 136993355
ssse3 1706435127 584593287 312356393 1665893489 706222687 516015063 443601538 365662864 264869614 212480609
fastest ssse3 ssse3 sse2 sse2 ssse3 ssse3 ssse3 ssse3 ssse3 ssse3
@slavonnet Not sure what benefit you expect from AVX code in the RAID-Z computations over SSSE3; I don't think there's instructions that would help, so the gain would be just fewer movdqa
that are probably squashed during the register renaming stage of execution anyway.
Please look to https://github.com/simd-everywhere/simde Looks like good choise for backward capatability, Need simple reformat ASM code to C smid functions and use wrapper function if platform not support features. In additional its have vec and openmp defines
The critical loop with optimal AVX2 assembly looks like this:
.LBB0_2: # =>This Inner Loop Header: Depth=1
vpmovzxdq (%rsi), %ymm4 # ymm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
addq $16, %rsi
vpaddq %ymm4, %ymm0, %ymm0
vpaddq %ymm1, %ymm0, %ymm1
vpaddq %ymm2, %ymm1, %ymm2
vpaddq %ymm3, %ymm2, %ymm3
cmpq %rdx, %rsi
jb .LBB0_2
Changing the code to use AVX1 turns that into this:
.LBB0_2: # =>This Inner Loop Header: Depth=1
vpmovzxdq (%rsi), %xmm4 # xmm4 = mem[0],zero,mem[1],zero
vpmovzxdq 8(%rsi), %xmm5 # xmm5 = mem[0],zero,mem[1],zero
vextractf128 $1, %ymm0, %xmm6
vextractf128 $1, %ymm1, %xmm7
vextractf128 $1, %ymm2, %xmm8
addq $16, %rsi
vpaddq %xmm5, %xmm6, %xmm5
vpaddq %xmm4, %xmm0, %xmm4
vinsertf128 $1, %xmm5, %ymm4, %ymm0
vpaddq %xmm7, %xmm5, %xmm5
vextractf128 $1, %ymm3, %xmm7
vpaddq %xmm1, %xmm4, %xmm4
vinsertf128 $1, %xmm5, %ymm4, %ymm1
vpaddq %xmm2, %xmm4, %xmm4
vpaddq %xmm5, %xmm8, %xmm5
vinsertf128 $1, %xmm5, %ymm4, %ymm2
vpaddq %xmm3, %xmm4, %xmm3
vpaddq %xmm7, %xmm5, %xmm5
vinsertf128 $1, %xmm5, %ymm3, %ymm3
cmpq %rdx, %rsi
jb .LBB0_2
The problem is that Intel omitted support for vpmovzxdq and vpaddq operations on 256-bit registers, so while AVX1 has them, you need to do a vextractf128/vinsertf128 dance to use them.
A SSE4.1 version is more succinct:
.LBB0_2: # =>This Inner Loop Header: Depth=1
pmovzxdq 8(%rsi), %xmm1 # xmm1 = mem[0],zero,mem[1],zero
pmovzxdq (%rsi), %xmm9 # xmm9 = mem[0],zero,mem[1],zero
addq $16, %rsi
paddq %xmm1, %xmm3
paddq %xmm9, %xmm2
paddq %xmm3, %xmm5
paddq %xmm2, %xmm4
paddq %xmm4, %xmm6
paddq %xmm5, %xmm0
paddq %xmm0, %xmm8
paddq %xmm6, %xmm7
cmpq %rdx, %rsi
jb .LBB0_2
For completeness, here is a SSE2 version of the loop:
.LBB0_2: # =>This Inner Loop Header: Depth=1
movdqa (%rsi), %xmm7
addq $16, %rsi
movdqa %xmm7, %xmm1
punpckhdq %xmm8, %xmm1 # xmm1 = xmm1[2],xmm8[2],xmm1[3],xmm8[3]
punpckldq %xmm8, %xmm7 # xmm7 = xmm7[0],xmm8[0],xmm7[1],xmm8[1]
paddq %xmm1, %xmm3
paddq %xmm7, %xmm2
paddq %xmm3, %xmm5
paddq %xmm2, %xmm4
paddq %xmm4, %xmm6
paddq %xmm5, %xmm0
paddq %xmm0, %xmm9
paddq %xmm6, %xmm10
cmpq %rdx, %rsi
jb .LBB0_2
Note that I have a local version of the code that uses generic GNU C vector operations that I am feeding to Clang to produce these different versions. It will likely be in a PR in the near future. Interestingly, the AVX2 version above uses the same instructions as the hand-written Intel one, except the addq
is placed earlier in the loop than the Intel version placed it.
Anyway, it is fairly simple to answer the question of why Clang used various instructions different ways by looking them up:
https://www.felixcloutier.com/x86/pmovzx https://www.felixcloutier.com/x86/paddb:paddw:paddd:paddq
That said, since the most integer additions in parallel we can do at any time with AVX1 is 2, I would not expect it to perform much better than the SSEx versions.
Note that I have a local version of the code that uses generic GNU C vector operations that I am feeding to Clang to produce these different versions. It will likely be in a PR in the near future.
That PR is now open as #14234. It is trivial to modify the code to generate AVX assembly (remove the inline assembly first), but I doubt it will be an improvement over the current SSSE3 assembly.
I have devised an equation that predicts the maximum fletcher4 performance possible on a given microprocessor that we can use to predict whether there is any benefit from doing a AVX rewrite.
max fletcher4 performance = MIN(max single core memory bandwidth, number of parallel 64-bit additions per cycle * clock speed)
We add a 90% fudge factor to this to account for overhead from the mixing matrix and register state save/restore needed inside the linux kernel. So far, we have seen ~98% performance on the Ampere Altra and ~95% performance (+/- 1%) on Zen 3.
Uhm, nevermind...
prawn@zfstest (SSH) ~/zfs % cat /proc/spl/kstat/zfs/fletcher_4_bench 2021-04-25 14:55:14 0 0 0x01 -1 0 57687143798 950948263021 implementation native byteswap scalar 4249791401 3814151349 superscalar 5619224016 4319551557 superscalar4 5944738069 2336538552 sse2 9130490813 5083391760 ssse3 8759600437 8270570418 sse4_1 8447685480 6720833233 fastest sse2 ssse3
I can re-run this a bunch to make this more scientific if there's interest.
Above, we have numbers for the i7 950, so I will use that as the basis for these calculations.
https://ark.intel.com/content/www/us/en/ark/products/37150/intel-core-i7950-processor-8m-cache-3-06-ghz-4-80-gts-intel-qpi.html https://agner.org/optimize/instruction_tables.pdf
Unfortunately, without AVX2, we cannot perform parallel unsigned integer additions in AVX registers (correct me if I am wrong), so we are limited to SSE4.2 (and all prior SSE versions). Intel says it is Bloomfield, but wikipedia says that is Nehalem, so I will use the tables for Nehalem. The Reciprocal throughput according to Agnor Fog is 1.5. This means that we can process 3 parallel 64-bit additions per cycle using the xmm registers.
The boost clock according to Intel is 3.33GHz. This gives us a limit of 10GB/sec per core before considering memory bandwidth. Intel's memory bandwidth numbers are for the entire CPU rather than what individual cores are capable of doing. I cannot find published data for the single core memory bandwdith, but let us assume that it can do that much. Then we apply our 90% fudge factor to set a limtus test to 9.0GB/sec, where we might be able to improve performance if the benchmark shows lower performance than this. The above fletcher benchmark shows 9.1GB/sec. That means that there is no room left here for improvement.
The previous post made a mistake. I used the number for a horizontal add, rather than a regular SIMD add. I also misinterpreted the meaning of the number.
The hardware can perform 4 parallel 64-bit additions per cycle using the xmm registers. That gives us an upper limit of 13.32GB/sec before the memory bandwidth limit or the fudge factor. As for the single core memory bandwidth, I cannot find a number for it, but I can find one for the very similar i7-940, which is 11.1GB/sec:
https://web.archive.org/web/20090612193656/http://techreport.com/articles.x/17023/3
With the fudge factor, that gives us 10GB/sec, and the i7-940 is doing a little below that, so perhaps there is room for improvement. At the moment, the code is using a 2 accumulator stream fletcher4 implementation that has been unrolled once. The current NEON code is using that and SSE2 and NEON are very similar SIMD instructions since they both operate on 128-bit vectors. In #14219, we found that switching to a carefully written 4-stream version boosted performance by 50% on the latest ARM NEON hardware. I suspect doing the same here would give a small boost.
Note that the assembly I gave above was for the 4-stream version. Unrolling the 2-stream version so that two iterations are done in one loop iteration makes it difficult to distinguish from the 4-stream version at a glance. It was not until my equation showed that we had room for improvement that I scrutinized it more closely and noticed the difference.
Avx2 and avx not so different. Please add support avx only cpu (Sunny Bridge, Ivy Bridge). I think you need very small changes to support avx.