Closed bpecsek closed 2 years ago
Hi Bela,
the MOVSD and the ADDSD instructions in your code are independent issues.
The non-VEX moves occur because VMOVSD and VMOVSS were still broken as of SBCL 2.1.11 so I couldn't use them in the custom VOPs like sb-simd-avx::f64!-from-p256
. Good news is I already reported the bug (https://bugs.launchpad.net/sbcl/+bug/1953483) and a fix has already been committed. The only question is whether I should wait with using them until the release of SBCL 2.1.12, or whether I should use them right now at the price that you once again have to depend on a particular commit of SBCL. And thoughts?
The other issue is simple - you are using incf
instead of f64-incf
, so you get SBCL's non-VEX encoded +
function instead of the current instruction set's VEX encoded f64+
function. Admittedly, I hadn't exported those modifier macros until a minute ago (see https://github.com/marcoheisig/sb-simd/commit/d06cdee22151f178e7732b6737c7f9b08b678870).
Best regards, Marco
EDITED! Got it, kind of. The only difference to generate the above assy’s was the use of f64.4-hsum in vector.lisp instead of the one in define-fake-vops.lisp yet the resulted generated code was VEX only. No movsd’s and addsd’s. Please note that there is no incf used in the advance function only f64.4-incf.
;; Helper functions
(declaim (ftype (function (f64.4 f64.4) f64) dot)
(inline dot length-sq length_))
(defun dot (a b)
(f64.4-hsum (f64.4* a b)))
(declaim (ftype (function (f64.4) f64) length-sq length_))
(defun length-sq (a)
(dot a a))
(declaim (ftype (function (list u32) null) advance))
(defun advance (system n)
(loop repeat n do
(loop for (bi . rest) on system do
(dolist (bj rest)
(let* ((pd (f64.4- (pos bi) (pos bj)))
(dsq (f64.4 (length-sq pd)))
(dst (f64.4-sqrt dsq))
(mag (f64.4/ (f64.4* dsq dst)))
(pd-mag (f64.4* pd mag)))
(f64.4-decf (vel bi) (f64.4* pd-mag (mass bj)))
(f64.4-incf (vel bj) (f64.4* pd-mag (mass bi))))))
(loop for b in system do
(f64.4-incf (pos b) (vel b)))))
CL-USER> (disassemble 'f64.4-hsum) ; as defined in define-fake-vops.lisp
; disassembly for F64.4-HSUM
; Size: 111 bytes. Origin: #x53AEE862 ; F64.4-HSUM
; 62: C4E37D19C100 VEXTRACTF128 YMM1, YMM0, 0
; 68: C4E37D19C001 VEXTRACTF128 YMM0, YMM0, 1
; 6E: C5F158C8 VADDPD XMM1, XMM1, XMM0
; 72: C5FC28C1 VMOVAPS YMM0, YMM1
; 76: C5E957D2 VXORPD XMM2, XMM2, XMM2
; 7A: F20F10D0 MOVSD XMM2, XMM0 <<-----
; 7E: C4E37905C101 VPERMILPD XMM0, XMM1, 1
; 84: C5F157C9 VXORPD XMM1, XMM1, XMM1
; 88: F20F10C8 MOVSD XMM1, XMM0 <<-----
; 8C: F20F58CA ADDSD XMM1, XMM2 <<-----
; 90: 4D896D28 MOV [R13+40], R13 ; thread.pseudo-atomic-bits
; 94: 498B5548 MOV RDX, [R13+72] ; thread.boxed-tlab
; 98: 4883C210 ADD RDX, 16
; 9C: 493B5550 CMP RDX, [R13+80]
; A0: 7722 JNBE L2
; A2: 49895548 MOV [R13+72], RDX ; thread.boxed-tlab
; A6: 4883C2FF ADD RDX, -1
; AA: L0: 66C742F11D01 MOV WORD PTR [RDX-15], 285
; B0: 4D316D28 XOR [R13+40], R13 ; thread.pseudo-atomic-bits
; B4: 7401 JEQ L1
; B6: F1 ICEBP
; B7: L1: F20F114AF9 MOVSD [RDX-7], XMM1
; BC: 488BE5 MOV RSP, RBP
; BF: F8 CLC
; C0: 5D POP RBP
; C1: C3 RET
; C2: CC10 INT3 16 ; Invalid argument count trap
; C4: L2: 6A10 PUSH 16
; C6: E8A51BF1FE CALL #x52A00470 ; ALLOC-TRAMP
; CB: 5A POP RDX
; CC: 80CA0F OR DL, 15
; CF: EBD9 JMP L0
CL-USER> (disassemble 'f64.4-hsum) ; as defined in vector.lisp
; disassembly for F64.4-HSUM
; Size: 87 bytes. Origin: #x53AFBE0E ; F64.4-HSUM
; 0E: C5F928C2 VMOVAPD XMM0, XMM2
; 12: C4E37D19D101 VEXTRACTF128 YMM1, YMM2, 1
; 18: C5F958C1 VADDPD XMM0, XMM0, XMM1
; 1C: C5F915C8 VUNPCKHPD XMM1, XMM0, XMM0
; 20: C5FB58D1 VADDSD XMM2, XMM0, XMM1
; 24: 4D896D28 MOV [R13+40], R13 ; thread.pseudo-atomic-bits
; 28: 498B5548 MOV RDX, [R13+72] ; thread.boxed-tlab
; 2C: 4883C210 ADD RDX, 16
; 30: 493B5550 CMP RDX, [R13+80]
; 34: 7722 JNBE L2
; 36: 49895548 MOV [R13+72], RDX ; thread.boxed-tlab
; 3A: 4883C2FF ADD RDX, -1
; 3E: L0: 66C742F11D01 MOV WORD PTR [RDX-15], 285
; 44: 4D316D28 XOR [R13+40], R13 ; thread.pseudo-atomic-bits
; 48: 7401 JEQ L1
; 4A: F1 ICEBP
; 4B: L1: F20F1152F9 MOVSD [RDX-7], XMM2
; 50: 488BE5 MOV RSP, RBP
; 53: F8 CLC
; 54: 5D POP RBP
; 55: C3 RET
; 56: CC10 INT3 16 ; Invalid argument count trap
; 58: L2: 6A10 PUSH 16
; 5A: E81146F0FE CALL #x52A00470 ; ALLOC-TRAMP
; 5F: 5A POP RDX
; 60: 80CA0F OR DL, 15
; 63: EBD9 JMP L0
Both movsd's and addsd instructions are clearly coming from f64.4-hsum define in define-fake-vops.lisp. They appear everywhere where f64.4-hsum is used and disappear when the one in vector.lisp is used.
I would however wait until the next version of SBCL is out with the fixes, though in the meanwhile could you please comment out f64.4-hsum in define-fake-vops.lisp to use the one in vector.lisp instead? However, could you please also remove the unnecessary (vzeroupper) from the %fxy.z-hsum VOPs?
I might have slightly misunderstood you regarding the "other issue". I did use incf and decf since I was not aware of the fXY-incf and fXY-decf functions:) I have corrected them in the other functions where they are used, however please note that those functions are executed only ones but the advance function is hot where the f64.4-hsum is executed 10 x n times.
Hi Marco,
I have written two replies on the GitHub issues thread. Most likely you’ve gotten them via email as well but I have edited the first one considerably. However, I am nor sure if you get notified about message edit thus this email.
Kindest Regards, Bela
Sent from my iPad
On 11 Dec 2021, at 13:27, Marco Heisig @.***> wrote:
Hi Bela,
the MOVSD and the ADDSD instructions in your code are independent issues.
The non-VEX moves occur because VMOVSD and VMOVSS were still broken as of SBCL 2.1.11 so I couldn't use them in the custom VOPs like sb-simd-avx::f64!-from-p256. Good news is I already reported the bug (https://bugs.launchpad.net/sbcl/+bug/1953483) and a fix has already been committed. The only question is whether I should wait with using them until the release of SBCL 2.1.12, or whether I should use them right now at the price that you once again have to depend on a particular commit of SBCL. And thoughts?
The other issue is simple - you are using incf instead of f64-incf, so you get SBCL's non-VEX encoded + function instead of the current instruction set's VEX encoded f64+ function. Admittedly, I hadn't exported those modifier macros until a minute ago (see d06cdee).
Best regards, Marco
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Hi Marco,
Any chance for my above request?:)
Kindest Regards, Bela
I just found (and fixed) the non-VEX ADDSD instruction.
Hi Marco,
It is still not really good. It still generates 10 instructions, 2 of which are still non-VEX MOVSD.
CL-USER> (disassemble 'sb-simd-avx:f64.4-hsum)
; disassembly for F64.4-HSUM
; Size: 111 bytes. Origin: #x53A1F732 ; F64.4-HSUM
; 32: C4E37D19C100 VEXTRACTF128 YMM1, YMM0, 0
; 38: C4E37D19C001 VEXTRACTF128 YMM0, YMM0, 1
; 3E: C5F158C8 VADDPD XMM1, XMM1, XMM0
; 42: C5FC28C1 VMOVAPS YMM0, YMM1
; 46: C5E957D2 VXORPD XMM2, XMM2, XMM2
; 4A: F20F10D0 MOVSD XMM2, XMM0 <-----------
; 4E: C4E37905C101 VPERMILPD XMM0, XMM1, 1
; 54: C5F157C9 VXORPD XMM1, XMM1, XMM1
; 58: F20F10C8 MOVSD XMM1, XMM0 <----------
; 5C: C5EB58C1 VADDSD XMM0, XMM2, XMM1
; 60: 4D896D28 MOV [R13+40], R13 ; thread.pseudo-atomic-bits
; 64: 498B5548 MOV RDX, [R13+72] ; thread.boxed-tlab
; 68: 4883C210 ADD RDX, 16
; 6C: 493B5550 CMP RDX, [R13+80]
; 70: 7722 JNBE L2
; 72: 49895548 MOV [R13+72], RDX ; thread.boxed-tlab
; 76: 4883C2FF ADD RDX, -1
; 7A: L0: 66C742F11D01 MOV WORD PTR [RDX-15], 285
; 80: 4D316D28 XOR [R13+40], R13 ; thread.pseudo-atomic-bits
; 84: 7401 JEQ L1
; 86: F1 ICEBP
; 87: L1: F20F1142F9 MOVSD [RDX-7], XMM0
; 8C: 488BE5 MOV RSP, RBP
; 8F: F8 CLC
; 90: 5D POP RBP
; 91: C3 RET
; 92: CC10 INT3 16 ; Invalid argument count trap
; 94: L2: 6A10 PUSH 16
; 96: E8D50CFEFE CALL #x52A00470 ; ALLOC-TRAMP
; 9B: 5A POP RDX
; 9C: 80CA0F OR DL, 15
; 9F: EBD9 JMP L0
The one in vector.lisp (see generated assy a few message up) is the fastest possible code for Intel as well as AMD CPUs with 5 instructions only.
Kindest Regards, Bela
Hi Marco,
Happy New Year to you and your loved ones.
Any progress on this?
Should I send a PR that comments out the f64.4-hsum defined in define-fake-vops.lisp and also remove those unwanted (inst vzeroupper)'s in vector.lisp?
I just pushed your version of f64.4-hsum
in c9d90e89a455b41cdc0621150f4ec45d03639884.
I'd also like to make use the vmovsd
instruction that was just fixed in SBCL 2.2.0, which means sb-simd won't run on older versions of SBCL afterwards. Can you make sure that the programming language benchmarks use SBCL 2.2.0?
Hi Marco,
Thanks for the patch. Are you planning to use the same approach for the other fx.y-hsum functions as well.
I sent a PR to the benchmark site for sbcl-2.2.0 6 days ago but the site maintainer has not accepted it yet.
I keep asking him but he is not replying to me at all.
I’ll try it again.
Kindest Regards, Bela
Sent from my iPad
On 6 Jan 2022, at 15:07, Marco Heisig @.***> wrote:
I just pushed your version of f64.4-hsum in c9d90e8.
I'd also like to make use the vmovsd instruction that was just fixed in SBCL 2.2.0, which means sb-simd won't run on older versions of SBCL afterwards. Can you make sure that the programming language benchmarks use SBCL 2.2.0?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.
Looks like sbcl-2.2.0 is now used on the programming language benchmark site. I assume that means I can finally use vmovsd
instructions. I'm closing this issue.
As for the other hsum functions, I haven't decided whether I'll have a dedicated vop for them or not. With the vmov
instructions, a fake vop should also work.
Hi Marco,
The site does use sbcl-2.2.0 so anything can go.
However, there is a serious quarrel/issue with the site maintainer that I hope can be resolver soon.
He has accused SBCL doing something without any proof and no matter how much proof is provided to him that he is wrong by quite a few people including the SBCL maintainer he is not accepting it. His only argument is that SBCL is too fast so it must be doing half heap allocation.
We’ll see what’s gonna happen but you can make the changes since at least the existing codes can take advantage of it.
Kindest Regards, Bela
Sent from my iPad
On 19 Jan 2022, at 14:46, Marco Heisig @.***> wrote: Looks like sbcl-2.2.0 is now used on the programming language benchmark site. I assume that means I can finally use vmovsd instructions. I'm closing this issue.
As for the other hsum functions, I haven't decided whether I'll have a dedicated vop for them or not. With the vmov instructions, a fake vop should also work.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.
Hi Marco,
I've been testing sb-simd with the nbody benchmark code (https://github.com/hanabi1224/Programming-Language-Benchmarks/blob/main/bench/algorithm/nbody/5.cl) and picked up that the F64.4-hsum function defined in define-fake-vops.lisp generates non-VEX instruction.
When I comment out the f64.4-hsum function in define-fake-vops.lisp and therefore use the one defined in vector list the code runs about 5-8% faster.
The one defines in the %f64-hsum VOP is generating VEX only instructions and therefore runs quite a bit faster.
If you check the disassemble of the advance function you see that when the code in the define-fake-vops is used it injects quite a few non-VEX instructions (movsd addsd etc) in the hot loop, while when the one in the vector.lisp is used only VEX instructions are generated.
f64.4-hsum in define-fake-vops.lisp used:
Would it be possible to re-code the f64.4-hsum function in define-fake-vops.lisp to use the same code as defined in vector.lisp or just use the one in vector.lisp?
Kindest Regards, Bela