Rewrite AVX2 to support AVX(-128) only feature

openzfs / zfs

OpenZFS on Linux and FreeBSD

https://openzfs.github.io/openzfs-docs

Other

10.68k stars 1.75k forks source link

Rewrite AVX2 to support AVX(-128) only feature #11788

Open slavonnet opened 3 years ago

slavonnet commented 3 years ago

Avx2 and avx not so different. Please add support avx only cpu (Sunny Bridge, Ivy Bridge). I think you need very small changes to support avx.

slavonnet commented 3 years ago

https://github.com/openzfs/zfs/pull/11909

rdolbeau commented 3 years ago

@slavonnet It should be possible to have a half-width (128 bits) version of the AVX2 code as VPMOVZXDQ is available in AVX; it would be near-identical to an hypothetical SSE4.1 version. Not sure if the SSE4.1 variant would significantly improve upon the SSE3 variant, but it would also support pre-AVX CPU such as Nehalem & Westmere and would probably have the same performance as a pure AVX version on Sandy Bridge and Ivy Bridge.

rdolbeau commented 3 years ago

Did a quick test (performance only, might be buggy) using (V)PMOVZXDQ. Both the SSE4.1 version (that you can see on https://github.com/rdolbeau/zfs/tree/test_fletcher_sse41) and the AVX (three-operands VEX-encoded, otherwise identical) versions are slower than the SSSE3 on my Skylake. Unfortunately I can't test on more relevant cores at this time (Penryn, Nehalem, Westmere).

InsanePrawn commented 3 years ago

Mhm, not sure if I did something wrong here. Qemu+Libvirt, Intel i7 950 host. The copy host model checkbox is flaky and results in nehalem-ibrs + mitigations.

prawn@zfstest (SSH) ~/zfs % git status                                                                                                                                                                                                2021-04-25 14:45:59
On branch test_fletcher_sse41
Your branch is up to date with 'origin/test_fletcher_sse41'.

nothing to commit, working tree clean
prawn@zfstest (SSH) ~/zfs % sudo modprobe zfs                                                                                                                                                                                         2021-04-25 14:46:03
prawn@zfstest (SSH) ~/zfs % sudo dmesg | tail                                                                                                                                                                                         2021-04-25 14:46:09
[    3.880516] snd_hda_codec_generic hdaudioC0D0:    hp_outs=0 (0x0/0x0/0x0/0x0/0x0)
[    3.880516] snd_hda_codec_generic hdaudioC0D0:    mono: mono_out=0x0
[    3.880517] snd_hda_codec_generic hdaudioC0D0:    inputs:
[    3.880519] snd_hda_codec_generic hdaudioC0D0:      Line=0x5
[    3.991406] Adding 1046524k swap on /dev/vda5.  Priority:-2 extents:1 across:1046524k FS
[   57.839621] spl: loading out-of-tree module taints kernel.
[   57.839702] spl: module verification failed: signature and/or required key missing - tainting kernel
[   57.843599] znvpair: module license 'CDDL' taints kernel.
[   57.843600] Disabling lock debugging due to kernel taint
[   58.020532] ZFS: Loaded module v2.1.99-1, ZFS pool version 5000, ZFS filesystem version 5
prawn@zfstest (SSH) ~/zfs % cat /proc/spl/kstat/zfs/vdev_raidz_bench                                                                                                                                                                  2021-04-25 14:46:14
19 0 0x01 -1 0 57838749451 540677310219
implementation   gen_p           gen_pq          gen_pqr         rec_p           rec_q           rec_r           rec_pq          rec_pr          rec_qr          rec_pqr         
original         380337826       170409366       58926396        733551197       158418062       23918681        61475849        10388186        10690930        6386729         
scalar           843940973       201056038       103169881       828440439       289788706       195564322       133491180       106538859       81844513        64254287        
sse2             1446517564      644942247       416882187       1654562117      664735990       489993791       324060407       332214787       176674345       98479344        
ssse3            1504613526      708577283       417956118       1663217733      1064512845      585129534       580990950       446771209       324301481       284787549       
fastest          ssse3           ssse3           ssse3           ssse3           ssse3           ssse3           ssse3           ssse3           ssse3           ssse3           
prawn@zfstest (SSH) ~/zfs % lscpu                                                                                                                                                                                                     2021-04-25 14:54:12
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       40 bits physical, 48 bits virtual
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           2
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               26
Model name:          Intel Core i7 9xx (Nehalem Core i7, IBRS update)
Stepping:            3
CPU MHz:             3074.304
BogoMIPS:            6148.60
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0,1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni ssse3 cx16 sse4_1 sse4_2 x2apic popcnt hypervisor lahf_lm cpuid_fault pti ibrs ibpb

InsanePrawn commented 3 years ago

Uhm, nevermind...

prawn@zfstest (SSH) ~/zfs % cat /proc/spl/kstat/zfs/fletcher_4_bench                                                                                                                                                                  2021-04-25 14:55:14
0 0 0x01 -1 0 57687143798 950948263021
implementation   native         byteswap       
scalar           4249791401     3814151349     
superscalar      5619224016     4319551557     
superscalar4     5944738069     2336538552     
sse2             9130490813     5083391760     
ssse3            8759600437     8270570418     
sse4_1           8447685480     6720833233     
fastest          sse2           ssse3

I can re-run this a bunch to make this more scientific if there's interest.

rdolbeau commented 3 years ago

@InsanePrawn Thanks, Bloomfield is Nehalem and was a potential target, but the number are also not good. A bit surprised that sse4_1/byteswap is so low, but overall I didn't expect much; static analysis says it's a bad deal.

Innermost loop for ssse3_byteswap (courtesy of http://www.maqao.org/, using~/maqao.intel64.2.13.2/maqao.intel64 cqa proc=Nehalem_Core_i5i7 ./lib/libzpool/.libs/zfs_fletcher_sse.o conf=expert fct-loops=fletcher_4_ssse3_byteswap, showing just the pipeline view):

Instruction                             | Nb FU | P0   | P1   | P2 | P3 | P4 | P5   | Latency | Recip. throughput                                                                                                                                                              
-----------------------------------------------------------------------------------------------------------------                                                                                                                                                              
MOVDQU (%RSI),%XMM5                     | 1     | 0    | 0    | 1  | 0  | 0  | 0    | 2       | 1                                                                                                                                                                              
PSHUFB %XMM7,%XMM5                      | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50                                                                                                                                                                           
MOVDQA %XMM5,%XMM6                      | 1     | 0.33 | 0.33 | 0  | 0  | 0  | 0.33 | 1       | 0.33                                                                                                                                                                           
PUNPCKLDQ %XMM4,%XMM5                   | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50                                                                                                                                                                           
PUNPCKHDQ %XMM4,%XMM6                   | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50                                                                                                                                                                           
PADDQ %XMM5,%XMM0                       | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50                                                                                                                                                                           
PADDQ %XMM0,%XMM1                       | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50                                                                                                                                                                           
PADDQ %XMM1,%XMM2                       | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50                                                                                                                                                                           
PADDQ %XMM2,%XMM3                       | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50                                                                                                                                                                           
PADDQ %XMM6,%XMM0                       | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50                                                                                                                                                                           
PADDQ %XMM0,%XMM1                       | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50                                                                                                                                                                           
PADDQ %XMM1,%XMM2                       | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50                                                                                                                                                                           
PADDQ %XMM2,%XMM3                       | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50                                                                                                                                                                           
ADD $0x10,%RSI                          | 1     | 0.33 | 0.33 | 0  | 0  | 0  | 0.33 | 1       | 0.33                                                                                                                                                                           
CMP %RSI,%RDX                           | 1     | 0.33 | 0.33 | 0  | 0  | 0  | 0.33 | 1       | 0.33                                                                                                                                                                           
JA 170 <fletcher_4_ssse3_byteswap+0x30> | 1     | 0    | 0    | 0  | 0  | 0  | 1    | 0       | 2

and for sse41_byteswap:

Instruction                             | Nb FU | P0   | P1   | P2 | P3 | P4 | P5   | Latency | Recip. throughput
-----------------------------------------------------------------------------------------------------------------
PMOVZXDQ (%RSI),%XMM5                   | 1     | 0.50 | 0    | 1  | 0  | 0  | 0.50 | 1       | 2
PMOVZXDQ 0x8(%RSI),%XMM6                | 1     | 0.50 | 0    | 1  | 0  | 0  | 0.50 | 1       | 2
PSHUFB %XMM7,%XMM5                      | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50
PSHUFB %XMM7,%XMM6                      | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50
PADDQ %XMM5,%XMM0                       | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50
PADDQ %XMM0,%XMM1                       | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50
PADDQ %XMM1,%XMM2                       | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50
PADDQ %XMM2,%XMM3                       | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50
PADDQ %XMM6,%XMM0                       | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50
PADDQ %XMM0,%XMM1                       | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50
PADDQ %XMM1,%XMM2                       | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50
PADDQ %XMM2,%XMM3                       | 1     | 0.50 | 0    | 0  | 0  | 0  | 0.50 | 1       | 0.50
ADD $0x10,%RSI                          | 1     | 0.33 | 0.33 | 0  | 0  | 0  | 0.33 | 1       | 0.33
CMP %RSI,%RDX                           | 1     | 0.33 | 0.33 | 0  | 0  | 0  | 0.33 | 1       | 0.33
JA 268 <fletcher_4_sse41_byteswap+0x28> | 1     | 0    | 0    | 0  | 0  | 0  | 1    | 0       | 2

In the end the dispatch is more expensive for the sse4_1 variant.

The PMOVZXDQ are too costly and ultimately slow things down. I get similar results for Westmere_Core_i3i5i7, Xeon_E5_v1 and Core_i7X_Xeon_E5E7_v2. Maqao also has a favorable view of the ssse3 code for Skylake_SP, as I observe.

SSE4.1/AVX-specific code doesn't look useful for Fletcher.

slavonnet commented 3 years ago

in my code have issue becouse i use intel syntax but zfs use at&t also i found another %ymm (256) variants on AVX and will test it

perfomance test must include parralel 2-8 threads test for SMID buffer tests

slavonnet commented 3 years ago

my system with 2 x CPU E5-2690 v1 and mitigations=off

[root@vm2 ~]# cat /proc/spl/kstat/zfs/fletcher_4_bench
0 0 0x01 -1 0 24258688081 281261279095611
implementation   native         byteswap
scalar           4276507017     3143277029
superscalar      5100232862     3786241790
superscalar4     4556284887     3076965513
sse2             7762220122     4088350313
ssse3            7784347987     7103313199
fastest          ssse3          ssse3

[root@vm2 ~]# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz
stepping        : 7
microcode       : 0x71a
cpu MHz         : 3292.417
cache size      : 20480 KB
physical id     : 0
siblings        : 16
core id         : 0
cpu cores       : 8
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
vmx flags       : vnmi preemption_timer invvpid ept_x_only ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips        : 5785.86
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

[root@vm2 ~]# dmesg | head
[    0.000000] microcode: microcode updated early to revision 0x71a, date = 2020-03-24
[    0.000000] Linux version 5.11.16-1.el8.elrepo.x86_64 (mockbuild@f42943b2f9e24912992d9f9b7db73026) (gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), GNU ld version 2.30-79.el8) #1 SMP Mon Apr 19 19:16:48 EDT 2021
[    0.000000] Command line: BOOT_IMAGE=(hd21,gpt2)/vmlinuz-5.11.16-1.el8.elrepo.x86_64 root=UUID=83065019-cf8c-47b5-90d2-d72ba16df093 ro crashkernel=auto rhgb quiet mitigations=off selinux=0 modprobe.blacklist=mgag200 msr.allow_writes=on pcie_aspm=force skew_tick=1

[root@vm2 ~]# cat /proc/spl/kstat/zfs/vdev_raidz_bench
19 0 0x01 -1 0 25258001729 281395275808357
implementation   gen_p           gen_pq          gen_pqr         rec_p           rec_q           rec_r           rec_pq          rec_pr          rec_qr          rec_pqr
original         369707027       139524961       55726791        612280552       129902033       16157233        46981353        9312673         9586634         6373360
scalar           873846034       206399629       93925673        870044070       287716950       180184898       126633031       97933168        67354075        51850443
sse2             1684885258      582927679       314949353       1695501872      601495478       433250179       354324812       309365266       201036194       136993355
ssse3            1706435127      584593287       312356393       1665893489      706222687       516015063       443601538       365662864       264869614       212480609
fastest          ssse3           ssse3           sse2            sse2            ssse3           ssse3           ssse3           ssse3           ssse3           ssse3

rdolbeau commented 3 years ago

@slavonnet Not sure what benefit you expect from AVX code in the RAID-Z computations over SSSE3; I don't think there's instructions that would help, so the gain would be just fewer movdqa that are probably squashed during the register renaming stage of execution anyway.

slavonnet commented 3 years ago

Please look to https://github.com/simd-everywhere/simde Looks like good choise for backward capatability, Need simple reformat ASM code to C smid functions and use wrapper function if platform not support features. In additional its have vec and openmp defines

ryao commented 2 years ago

The critical loop with optimal AVX2 assembly looks like this:

.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vpmovzxdq       (%rsi), %ymm4           # ymm4 = mem[0],zero,mem[1],zero,mem[2],zero,mem[3],zero
        addq    $16, %rsi
        vpaddq  %ymm4, %ymm0, %ymm0
        vpaddq  %ymm1, %ymm0, %ymm1
        vpaddq  %ymm2, %ymm1, %ymm2
        vpaddq  %ymm3, %ymm2, %ymm3
        cmpq    %rdx, %rsi
        jb      .LBB0_2

Changing the code to use AVX1 turns that into this:

.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vpmovzxdq       (%rsi), %xmm4           # xmm4 = mem[0],zero,mem[1],zero
        vpmovzxdq       8(%rsi), %xmm5          # xmm5 = mem[0],zero,mem[1],zero
        vextractf128    $1, %ymm0, %xmm6
        vextractf128    $1, %ymm1, %xmm7
        vextractf128    $1, %ymm2, %xmm8
        addq    $16, %rsi
        vpaddq  %xmm5, %xmm6, %xmm5
        vpaddq  %xmm4, %xmm0, %xmm4
        vinsertf128     $1, %xmm5, %ymm4, %ymm0
        vpaddq  %xmm7, %xmm5, %xmm5
        vextractf128    $1, %ymm3, %xmm7
        vpaddq  %xmm1, %xmm4, %xmm4
        vinsertf128     $1, %xmm5, %ymm4, %ymm1
        vpaddq  %xmm2, %xmm4, %xmm4
        vpaddq  %xmm5, %xmm8, %xmm5
        vinsertf128     $1, %xmm5, %ymm4, %ymm2
        vpaddq  %xmm3, %xmm4, %xmm3
        vpaddq  %xmm7, %xmm5, %xmm5
        vinsertf128     $1, %xmm5, %ymm3, %ymm3
        cmpq    %rdx, %rsi
        jb      .LBB0_2

The problem is that Intel omitted support for vpmovzxdq and vpaddq operations on 256-bit registers, so while AVX1 has them, you need to do a vextractf128/vinsertf128 dance to use them.

A SSE4.1 version is more succinct:

.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        pmovzxdq        8(%rsi), %xmm1                  # xmm1 = mem[0],zero,mem[1],zero
        pmovzxdq        (%rsi), %xmm9                   # xmm9 = mem[0],zero,mem[1],zero
        addq    $16, %rsi
        paddq   %xmm1, %xmm3
        paddq   %xmm9, %xmm2
        paddq   %xmm3, %xmm5
        paddq   %xmm2, %xmm4
        paddq   %xmm4, %xmm6
        paddq   %xmm5, %xmm0
        paddq   %xmm0, %xmm8
        paddq   %xmm6, %xmm7
        cmpq    %rdx, %rsi
        jb      .LBB0_2

For completeness, here is a SSE2 version of the loop:

.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        movdqa  (%rsi), %xmm7
        addq    $16, %rsi
        movdqa  %xmm7, %xmm1
        punpckhdq       %xmm8, %xmm1            # xmm1 = xmm1[2],xmm8[2],xmm1[3],xmm8[3]
        punpckldq       %xmm8, %xmm7            # xmm7 = xmm7[0],xmm8[0],xmm7[1],xmm8[1]
        paddq   %xmm1, %xmm3
        paddq   %xmm7, %xmm2
        paddq   %xmm3, %xmm5
        paddq   %xmm2, %xmm4
        paddq   %xmm4, %xmm6
        paddq   %xmm5, %xmm0
        paddq   %xmm0, %xmm9
        paddq   %xmm6, %xmm10
        cmpq    %rdx, %rsi
        jb      .LBB0_2

Note that I have a local version of the code that uses generic GNU C vector operations that I am feeding to Clang to produce these different versions. It will likely be in a PR in the near future. Interestingly, the AVX2 version above uses the same instructions as the hand-written Intel one, except the addq is placed earlier in the loop than the Intel version placed it.

Anyway, it is fairly simple to answer the question of why Clang used various instructions different ways by looking them up:

https://www.felixcloutier.com/x86/pmovzx https://www.felixcloutier.com/x86/paddb:paddw:paddd:paddq

That said, since the most integer additions in parallel we can do at any time with AVX1 is 2, I would not expect it to perform much better than the SSEx versions.

ryao commented 1 year ago

Note that I have a local version of the code that uses generic GNU C vector operations that I am feeding to Clang to produce these different versions. It will likely be in a PR in the near future.

That PR is now open as #14234. It is trivial to modify the code to generate AVX assembly (remove the inline assembly first), but I doubt it will be an improvement over the current SSSE3 assembly.

ryao commented 1 year ago

I have devised an equation that predicts the maximum fletcher4 performance possible on a given microprocessor that we can use to predict whether there is any benefit from doing a AVX rewrite.

max fletcher4 performance = MIN(max single core memory bandwidth, number of parallel 64-bit additions per cycle * clock speed)

We add a 90% fudge factor to this to account for overhead from the mixing matrix and register state save/restore needed inside the linux kernel. So far, we have seen ~98% performance on the Ampere Altra and ~95% performance (+/- 1%) on Zen 3.

Uhm, nevermind...

prawn@zfstest (SSH) ~/zfs % cat /proc/spl/kstat/zfs/fletcher_4_bench                                                                                                                                                                  2021-04-25 14:55:14
0 0 0x01 -1 0 57687143798 950948263021
implementation   native         byteswap       
scalar           4249791401     3814151349     
superscalar      5619224016     4319551557     
superscalar4     5944738069     2336538552     
sse2             9130490813     5083391760     
ssse3            8759600437     8270570418     
sse4_1           8447685480     6720833233     
fastest          sse2           ssse3

I can re-run this a bunch to make this more scientific if there's interest.

Above, we have numbers for the i7 950, so I will use that as the basis for these calculations.

https://ark.intel.com/content/www/us/en/ark/products/37150/intel-core-i7950-processor-8m-cache-3-06-ghz-4-80-gts-intel-qpi.html https://agner.org/optimize/instruction_tables.pdf

Unfortunately, without AVX2, we cannot perform parallel unsigned integer additions in AVX registers (correct me if I am wrong), so we are limited to SSE4.2 (and all prior SSE versions). Intel says it is Bloomfield, but wikipedia says that is Nehalem, so I will use the tables for Nehalem. The Reciprocal throughput according to Agnor Fog is 1.5. This means that we can process 3 parallel 64-bit additions per cycle using the xmm registers.

The boost clock according to Intel is 3.33GHz. This gives us a limit of 10GB/sec per core before considering memory bandwidth. Intel's memory bandwidth numbers are for the entire CPU rather than what individual cores are capable of doing. I cannot find published data for the single core memory bandwdith, but let us assume that it can do that much. Then we apply our 90% fudge factor to set a limtus test to 9.0GB/sec, where we might be able to improve performance if the benchmark shows lower performance than this. The above fletcher benchmark shows 9.1GB/sec. That means that there is no room left here for improvement.

ryao commented 1 year ago

The previous post made a mistake. I used the number for a horizontal add, rather than a regular SIMD add. I also misinterpreted the meaning of the number.

The hardware can perform 4 parallel 64-bit additions per cycle using the xmm registers. That gives us an upper limit of 13.32GB/sec before the memory bandwidth limit or the fudge factor. As for the single core memory bandwidth, I cannot find a number for it, but I can find one for the very similar i7-940, which is 11.1GB/sec:

https://web.archive.org/web/20090612193656/http://techreport.com/articles.x/17023/3

With the fudge factor, that gives us 10GB/sec, and the i7-940 is doing a little below that, so perhaps there is room for improvement. At the moment, the code is using a 2 accumulator stream fletcher4 implementation that has been unrolled once. The current NEON code is using that and SSE2 and NEON are very similar SIMD instructions since they both operate on 128-bit vectors. In #14219, we found that switching to a carefully written 4-stream version boosted performance by 50% on the latest ARM NEON hardware. I suspect doing the same here would give a small boost.

Note that the assembly I gave above was for the 4-stream version. Unrolling the 2-stream version so that two iterations are done in one loop iteration makes it difficult to distinguish from the 4-stream version at a glance. It was not until my equation showed that we had room for improvement that I scrutinized it more closely and noticed the difference.