Non-VEX instructions are generated for F64.4-hsum

bpecsek commented 2 years ago

Hi Marco,

I've been testing sb-simd with the nbody benchmark code (https://github.com/hanabi1224/Programming-Language-Benchmarks/blob/main/bench/algorithm/nbody/5.cl) and picked up that the F64.4-hsum function defined in define-fake-vops.lisp generates non-VEX instruction.

When I comment out the f64.4-hsum function in define-fake-vops.lisp and therefore use the one defined in vector list the code runs about 5-8% faster.

The one defines in the %f64-hsum VOP is generating VEX only instructions and therefore runs quite a bit faster.

If you check the disassemble of the advance function you see that when the code in the define-fake-vops is used it injects quite a few non-VEX instructions (movsd addsd etc) in the hot loop, while when the one in the vector.lisp is used only VEX instructions are generated.

f64.4-hsum in define-fake-vops.lisp used:

; 7C4:       488B5B01         MOV RBX, [RBX+1]
; 7C8:       C5FD104101       VMOVUPD YMM0, [RCX+1]
; 7CD:       C5FD104A01       VMOVUPD YMM1, [RDX+1]
; 7D2:       C5FD5CD9         VSUBPD YMM3, YMM0, YMM1
; 7D6:       C5E559C3         VMULPD YMM0, YMM3, YMM3
; 7DA:       C4E37D19C100     VEXTRACTF128 YMM1, YMM0, 0
; 7E0:       C4E37D19C001     VEXTRACTF128 YMM0, YMM0, 1
; 7E6:       C5F158C8         VADDPD XMM1, XMM1, XMM0
; 7EA:       C5FC28C1         VMOVAPS YMM0, YMM1
; 7EE:       C5E957D2         VXORPD XMM2, XMM2, XMM2
; 7F2:       F20F10D0         MOVSD XMM2, XMM0               <<------------
; 7F6:       C4E37905C101     VPERMILPD XMM0, XMM1, 1
; 7FC:       C5F157C9         VXORPD XMM1, XMM1, XMM1
; 800:       F20F10C8         MOVSD XMM1, XMM0               <<------------
; 804:       F20F58CA         ADDSD XMM1, XMM2               <<------------
; 808:       C4E27D19C1       VBROADCASTSD YMM0, XMM1
; 80D:       C5FD51C8         VSQRTPD YMM1, YMM0
; 811:       C5FD59C1         VMULPD YMM0, YMM0, YMM1
; 815:       C5FC100DE3FEFFFF VMOVUPS YMM1, [RIP-285]         ; [#x53AF8700]
; 81D:       C5F55EC0         VDIVPD YMM0, YMM1, YMM0
; 821:       C5FD105121       VMOVUPD YMM2, [RCX+33]
; 826:       C5FB104A41       VMOVSD XMM1, [RDX+65]
; 82B:       C4E27D19C9       VBROADCASTSD YMM1, XMM1
; 830:       C5F559C8         VMULPD YMM1, YMM1, YMM0
; 834:       C5E559C9         VMULPD YMM1, YMM3, YMM1
; 838:       C5ED5CC9         VSUBPD YMM1, YMM2, YMM1
; 83C:       C5FD114921       VMOVUPD [RCX+33], YMM1
; 841:       C5FD105221       VMOVUPD YMM2, [RDX+33]
; 846:       C5FB104941       VMOVSD XMM1, [RCX+65]
; 84B:       C4E27D19C9       VBROADCASTSD YMM1, XMM1
; 850:       C5F559C0         VMULPD YMM0, YMM1, YMM0
; 854:       C5E559C0         VMULPD YMM0, YMM3, YMM0
; 858:       C5ED58C0         VADDPD YMM0, YMM2, YMM0
; 85C:       C5FD114221       VMOVUPD [RDX+33], YMM0

f64.4-hsum in vector.lisp used:
; 760: L6:   488B53F9         MOV RDX, [RBX-7]
; 764:       488B5B01         MOV RBX, [RBX+1]
; 768:       C5FD104101       VMOVUPD YMM0, [RCX+1]
; 76D:       C5FD104A01       VMOVUPD YMM1, [RDX+1]
; 772:       C5FD5CD9         VSUBPD YMM3, YMM0, YMM1
; 776:       C5E559D3         VMULPD YMM2, YMM3, YMM3
; 77A:       C5F928C2         VMOVAPD XMM0, XMM2
; 77E:       C4E37D19D101     VEXTRACTF128 YMM1, YMM2, 1
; 784:       C5F958C1         VADDPD XMM0, XMM0, XMM1
; 788:       C5F915C8         VUNPCKHPD XMM1, XMM0, XMM0
; 78C:       C5FB58D1         VADDSD XMM2, XMM0, XMM1
; 790:       C4E27D19C2       VBROADCASTSD YMM0, XMM2
; 795:       C5FD51C8         VSQRTPD YMM1, YMM0
; 799:       C5FD59C1         VMULPD YMM0, YMM0, YMM1
; 79D:       C5FC100DFBFEFFFF VMOVUPS YMM1, [RIP-261]         ;[#x536736A0]
; 7A5:       C5F55EC0         VDIVPD YMM0, YMM1, YMM0
; 7A9:       C5FD105121       VMOVUPD YMM2, [RCX+33]
; 7AE:       C5FB104A41       VMOVSD XMM1, [RDX+65]
; 7B3:       C4E27D19C9       VBROADCASTSD YMM1, XMM1
; 7B8:       C5F559C8         VMULPD YMM1, YMM1, YMM0
; 7BC:       C5E559C9         VMULPD YMM1, YMM3, YMM1
; 7C0:       C5ED5CC9         VSUBPD YMM1, YMM2, YMM1
; 7C4:       C5FD114921       VMOVUPD [RCX+33], YMM1
; 7C9:       C5FD105221       VMOVUPD YMM2, [RDX+33]
; 7CE:       C5FB104941       VMOVSD XMM1, [RCX+65]
; 7D3:       C4E27D19C9       VBROADCASTSD YMM1, XMM1
; 7D8:       C5F559C0         VMULPD YMM0, YMM1, YMM0
; 7DC:       C5E559C0         VMULPD YMM0, YMM3, YMM0
; 7E0:       C5ED58C0         VADDPD YMM0, YMM2, YMM0
; 7E4:       C5FD114221       VMOVUPD [RDX+33], YMM0

Would it be possible to re-code the f64.4-hsum function in define-fake-vops.lisp to use the same code as defined in vector.lisp or just use the one in vector.lisp?

Kindest Regards, Bela

marcoheisig commented 2 years ago

Hi Bela,

the MOVSD and the ADDSD instructions in your code are independent issues.

The non-VEX moves occur because VMOVSD and VMOVSS were still broken as of SBCL 2.1.11 so I couldn't use them in the custom VOPs like sb-simd-avx::f64!-from-p256. Good news is I already reported the bug (https://bugs.launchpad.net/sbcl/+bug/1953483) and a fix has already been committed. The only question is whether I should wait with using them until the release of SBCL 2.1.12, or whether I should use them right now at the price that you once again have to depend on a particular commit of SBCL. And thoughts?

The other issue is simple - you are using incf instead of f64-incf, so you get SBCL's non-VEX encoded + function instead of the current instruction set's VEX encoded f64+ function. Admittedly, I hadn't exported those modifier macros until a minute ago (see https://github.com/marcoheisig/sb-simd/commit/d06cdee22151f178e7732b6737c7f9b08b678870).

Best regards, Marco

bpecsek commented 2 years ago

EDITED! Got it, kind of. The only difference to generate the above assy’s was the use of f64.4-hsum in vector.lisp instead of the one in define-fake-vops.lisp yet the resulted generated code was VEX only. No movsd’s and addsd’s. Please note that there is no incf used in the advance function only f64.4-incf.

;; Helper functions
(declaim (ftype (function (f64.4 f64.4) f64) dot)
         (inline dot length-sq length_))
(defun dot (a b)
  (f64.4-hsum (f64.4* a b)))

(declaim (ftype (function (f64.4) f64) length-sq  length_))
(defun length-sq (a)
  (dot a a))

(declaim (ftype (function (list u32) null) advance))
(defun advance (system n)
  (loop repeat n do
    (loop for (bi . rest) on system do
      (dolist (bj rest)
        (let* ((pd  (f64.4- (pos bi) (pos bj)))
               (dsq (f64.4  (length-sq pd)))
               (dst (f64.4-sqrt dsq))
               (mag (f64.4/ (f64.4* dsq dst)))
               (pd-mag (f64.4* pd mag)))
          (f64.4-decf (vel bi) (f64.4* pd-mag (mass bj)))
          (f64.4-incf (vel bj) (f64.4* pd-mag (mass bi))))))
    (loop for b in system do
      (f64.4-incf (pos b) (vel b)))))

CL-USER> (disassemble 'f64.4-hsum) ; as defined in define-fake-vops.lisp
; disassembly for F64.4-HSUM
; Size: 111 bytes. Origin: #x53AEE862                         ; F64.4-HSUM
; 62:       C4E37D19C100     VEXTRACTF128 YMM1, YMM0, 0
; 68:       C4E37D19C001     VEXTRACTF128 YMM0, YMM0, 1
; 6E:       C5F158C8         VADDPD XMM1, XMM1, XMM0
; 72:       C5FC28C1         VMOVAPS YMM0, YMM1
; 76:       C5E957D2         VXORPD XMM2, XMM2, XMM2
; 7A:       F20F10D0         MOVSD XMM2, XMM0     <<-----
; 7E:       C4E37905C101     VPERMILPD XMM0, XMM1, 1
; 84:       C5F157C9         VXORPD XMM1, XMM1, XMM1
; 88:       F20F10C8         MOVSD XMM1, XMM0     <<-----
; 8C:       F20F58CA         ADDSD XMM1, XMM2     <<-----
; 90:       4D896D28         MOV [R13+40], R13                ; thread.pseudo-atomic-bits
; 94:       498B5548         MOV RDX, [R13+72]                ; thread.boxed-tlab
; 98:       4883C210         ADD RDX, 16
; 9C:       493B5550         CMP RDX, [R13+80]
; A0:       7722             JNBE L2
; A2:       49895548         MOV [R13+72], RDX                ; thread.boxed-tlab
; A6:       4883C2FF         ADD RDX, -1
; AA: L0:   66C742F11D01     MOV WORD PTR [RDX-15], 285
; B0:       4D316D28         XOR [R13+40], R13                ; thread.pseudo-atomic-bits
; B4:       7401             JEQ L1
; B6:       F1               ICEBP
; B7: L1:   F20F114AF9       MOVSD [RDX-7], XMM1
; BC:       488BE5           MOV RSP, RBP
; BF:       F8               CLC
; C0:       5D               POP RBP
; C1:       C3               RET
; C2:       CC10             INT3 16                          ; Invalid argument count trap
; C4: L2:   6A10             PUSH 16
; C6:       E8A51BF1FE       CALL #x52A00470                  ; ALLOC-TRAMP
; CB:       5A               POP RDX
; CC:       80CA0F           OR DL, 15
; CF:       EBD9             JMP L0

CL-USER> (disassemble 'f64.4-hsum) ; as defined in vector.lisp
; disassembly for F64.4-HSUM
; Size: 87 bytes. Origin: #x53AFBE0E                          ; F64.4-HSUM
; 0E:       C5F928C2         VMOVAPD XMM0, XMM2
; 12:       C4E37D19D101     VEXTRACTF128 YMM1, YMM2, 1
; 18:       C5F958C1         VADDPD XMM0, XMM0, XMM1
; 1C:       C5F915C8         VUNPCKHPD XMM1, XMM0, XMM0
; 20:       C5FB58D1         VADDSD XMM2, XMM0, XMM1
; 24:       4D896D28         MOV [R13+40], R13                ; thread.pseudo-atomic-bits
; 28:       498B5548         MOV RDX, [R13+72]                ; thread.boxed-tlab
; 2C:       4883C210         ADD RDX, 16
; 30:       493B5550         CMP RDX, [R13+80]
; 34:       7722             JNBE L2
; 36:       49895548         MOV [R13+72], RDX                ; thread.boxed-tlab
; 3A:       4883C2FF         ADD RDX, -1
; 3E: L0:   66C742F11D01     MOV WORD PTR [RDX-15], 285
; 44:       4D316D28         XOR [R13+40], R13                ; thread.pseudo-atomic-bits
; 48:       7401             JEQ L1
; 4A:       F1               ICEBP
; 4B: L1:   F20F1152F9       MOVSD [RDX-7], XMM2
; 50:       488BE5           MOV RSP, RBP
; 53:       F8               CLC
; 54:       5D               POP RBP
; 55:       C3               RET
; 56:       CC10             INT3 16                          ; Invalid argument count trap
; 58: L2:   6A10             PUSH 16
; 5A:       E81146F0FE       CALL #x52A00470                  ; ALLOC-TRAMP
; 5F:       5A               POP RDX
; 60:       80CA0F           OR DL, 15
; 63:       EBD9             JMP L0

Both movsd's and addsd instructions are clearly coming from f64.4-hsum define in define-fake-vops.lisp. They appear everywhere where f64.4-hsum is used and disappear when the one in vector.lisp is used.

I would however wait until the next version of SBCL is out with the fixes, though in the meanwhile could you please comment out f64.4-hsum in define-fake-vops.lisp to use the one in vector.lisp instead? However, could you please also remove the unnecessary (vzeroupper) from the %fxy.z-hsum VOPs?

bpecsek commented 2 years ago

I might have slightly misunderstood you regarding the "other issue". I did use incf and decf since I was not aware of the fXY-incf and fXY-decf functions:) I have corrected them in the other functions where they are used, however please note that those functions are executed only ones but the advance function is hot where the f64.4-hsum is executed 10 x n times.

bpecsek commented 2 years ago

Hi Marco,

I have written two replies on the GitHub issues thread. Most likely you’ve gotten them via email as well but I have edited the first one considerably. However, I am nor sure if you get notified about message edit thus this email.

Kindest Regards, Bela

Sent from my iPad

On 11 Dec 2021, at 13:27, Marco Heisig @.***> wrote:

Hi Bela,

the MOVSD and the ADDSD instructions in your code are independent issues.

The non-VEX moves occur because VMOVSD and VMOVSS were still broken as of SBCL 2.1.11 so I couldn't use them in the custom VOPs like sb-simd-avx::f64!-from-p256. Good news is I already reported the bug (https://bugs.launchpad.net/sbcl/+bug/1953483) and a fix has already been committed. The only question is whether I should wait with using them until the release of SBCL 2.1.12, or whether I should use them right now at the price that you once again have to depend on a particular commit of SBCL. And thoughts?

The other issue is simple - you are using incf instead of f64-incf, so you get SBCL's non-VEX encoded + function instead of the current instruction set's VEX encoded f64+ function. Admittedly, I hadn't exported those modifier macros until a minute ago (see d06cdee).

Best regards, Marco

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

bpecsek commented 2 years ago

Hi Marco,

Any chance for my above request?:)

Kindest Regards, Bela

marcoheisig commented 2 years ago

I just found (and fixed) the non-VEX ADDSD instruction.

bpecsek commented 2 years ago

Hi Marco,

It is still not really good. It still generates 10 instructions, 2 of which are still non-VEX MOVSD.

CL-USER> (disassemble 'sb-simd-avx:f64.4-hsum)
; disassembly for F64.4-HSUM
; Size: 111 bytes. Origin: #x53A1F732                         ; F64.4-HSUM
; 32:       C4E37D19C100     VEXTRACTF128 YMM1, YMM0, 0
; 38:       C4E37D19C001     VEXTRACTF128 YMM0, YMM0, 1
; 3E:       C5F158C8         VADDPD XMM1, XMM1, XMM0
; 42:       C5FC28C1         VMOVAPS YMM0, YMM1
; 46:       C5E957D2         VXORPD XMM2, XMM2, XMM2
; 4A:       F20F10D0         MOVSD XMM2, XMM0          <-----------
; 4E:       C4E37905C101     VPERMILPD XMM0, XMM1, 1
; 54:       C5F157C9         VXORPD XMM1, XMM1, XMM1
; 58:       F20F10C8         MOVSD XMM1, XMM0          <----------
; 5C:       C5EB58C1         VADDSD XMM0, XMM2, XMM1
; 60:       4D896D28         MOV [R13+40], R13                ; thread.pseudo-atomic-bits
; 64:       498B5548         MOV RDX, [R13+72]                ; thread.boxed-tlab
; 68:       4883C210         ADD RDX, 16
; 6C:       493B5550         CMP RDX, [R13+80]
; 70:       7722             JNBE L2
; 72:       49895548         MOV [R13+72], RDX                ; thread.boxed-tlab
; 76:       4883C2FF         ADD RDX, -1
; 7A: L0:   66C742F11D01     MOV WORD PTR [RDX-15], 285
; 80:       4D316D28         XOR [R13+40], R13                ; thread.pseudo-atomic-bits
; 84:       7401             JEQ L1
; 86:       F1               ICEBP
; 87: L1:   F20F1142F9       MOVSD [RDX-7], XMM0
; 8C:       488BE5           MOV RSP, RBP
; 8F:       F8               CLC
; 90:       5D               POP RBP
; 91:       C3               RET
; 92:       CC10             INT3 16                          ; Invalid argument count trap
; 94: L2:   6A10             PUSH 16
; 96:       E8D50CFEFE       CALL #x52A00470                  ; ALLOC-TRAMP
; 9B:       5A               POP RDX
; 9C:       80CA0F           OR DL, 15
; 9F:       EBD9             JMP L0

The one in vector.lisp (see generated assy a few message up) is the fastest possible code for Intel as well as AMD CPUs with 5 instructions only.

Kindest Regards, Bela

bpecsek commented 2 years ago

Hi Marco,

Happy New Year to you and your loved ones.

Any progress on this?

Should I send a PR that comments out the f64.4-hsum defined in define-fake-vops.lisp and also remove those unwanted (inst vzeroupper)'s in vector.lisp?

marcoheisig commented 2 years ago

I just pushed your version of f64.4-hsum in c9d90e89a455b41cdc0621150f4ec45d03639884.

I'd also like to make use the vmovsd instruction that was just fixed in SBCL 2.2.0, which means sb-simd won't run on older versions of SBCL afterwards. Can you make sure that the programming language benchmarks use SBCL 2.2.0?

bpecsek commented 2 years ago

Hi Marco,

Thanks for the patch. Are you planning to use the same approach for the other fx.y-hsum functions as well.

I sent a PR to the benchmark site for sbcl-2.2.0 6 days ago but the site maintainer has not accepted it yet.

I keep asking him but he is not replying to me at all.

I’ll try it again.

Kindest Regards, Bela

Sent from my iPad

On 6 Jan 2022, at 15:07, Marco Heisig @.***> wrote:

I just pushed your version of f64.4-hsum in c9d90e8.

I'd also like to make use the vmovsd instruction that was just fixed in SBCL 2.2.0, which means sb-simd won't run on older versions of SBCL afterwards. Can you make sure that the programming language benchmarks use SBCL 2.2.0?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

marcoheisig commented 2 years ago

Looks like sbcl-2.2.0 is now used on the programming language benchmark site. I assume that means I can finally use vmovsd instructions. I'm closing this issue.

As for the other hsum functions, I haven't decided whether I'll have a dedicated vop for them or not. With the vmov instructions, a fake vop should also work.

bpecsek commented 2 years ago

Hi Marco,

The site does use sbcl-2.2.0 so anything can go.

However, there is a serious quarrel/issue with the site maintainer that I hope can be resolver soon.

He has accused SBCL doing something without any proof and no matter how much proof is provided to him that he is wrong by quite a few people including the SBCL maintainer he is not accepting it. His only argument is that SBCL is too fast so it must be doing half heap allocation.

We’ll see what’s gonna happen but you can make the changes since at least the existing codes can take advantage of it.

Kindest Regards, Bela

Sent from my iPad

On 19 Jan 2022, at 14:46, Marco Heisig @.***> wrote: Looks like sbcl-2.2.0 is now used on the programming language benchmark site. I assume that means I can finally use vmovsd instructions. I'm closing this issue.

As for the other hsum functions, I haven't decided whether I'll have a dedicated vop for them or not. With the vmov instructions, a fake vop should also work.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

marcoheisig / sb-simd

Non-VEX instructions are generated for F64.4-hsum #16