Closed bpecsek closed 2 years ago
Hi Bela,
I introduced one major change in SBCL regarding the handling of primops that could be the culprit. I will try to look into it this weekend.
Best regards, Marco
Thanks. Please note that sbcl 2.1.10 is about to be released mid next week if I am not mistaken. It would be bad to end up with a situation that a simd code runs slower than the one without simd that runs in about 2 seconds. Kindest Regards, Bela
Hi Marco,
Any progress finding the root cause of the extreme slowdown?
Best regards, Bela
Hi Marco, I've just done a quick check and the commit 14b60aa20c07481b64756b4a250fe8e6eed307b2 on which my code is still running fast was done on Oct 14, quite a few day after your commits on Oct 4. So the slowdown might be due to something completely different. Kindest Regards, Bela
Hi Marco, I've just found that the slowdown was introduced by Douglas' commit ad988fc61311cb81f53b538b1ae85ee73431eb0b. Kind Regards, Bela
Hi Bela,
I tried to narrow the problem down this weekend, but I go the same performance for SBCL 1.2.9 and the current HEAD. Then other things kept me too busy, unfortunately. (I was on vacation that for the last few days). But now I am back at work and have time to clean this up.
Have you already contacted Douglas about this? Or should I do that? And do you have any further information on why there is such a slowdown? Is it because of code generated for eval-a-times-u
or is it because of how f64.4-vdot
is translated?
Best regards, Marco
Hi Marco,
I have contacted only you since I feel it would be better if you discuss this with Douglas. The code generated for eval-a-times-u and spectralnorm are almost identical on old (fast) and new (slow) head with only 2 bytes difference. First I thought that I messed something up on my laptop but I am getting the same slow speed on GitHub Action too with the current SBCL head on https://github.com/bpecsek/Programming-Language-Benchmarks/blob/main/bench/algorithm/spectral-norm/2.cl , however, https://github.com/bpecsek/Programming-Language-Benchmarks/blob/main/bench/algorithm/spectral-norm/3.cl is running fast. I don't get what is going on to be frank. Best Regards, Bela
Just a note that I can't build the latest sb-simd on sbcl-2.1.9. I am getting this: CL-USER> (ql:quickload :sb-simd) To load "sb-simd": Load 1 ASDF system: sb-simd ; Loading "sb-simd" [package sb-simd-internals]....................... [package sb-simd]................................. [package sb-simd-x86-64].......................... [package sb-simd-sse]............................. [package sb-simd-sse2]............................ [package sb-simd-sse3]............................ [package sb-simd-ssse3]........................... [package sb-simd-sse4.1].......................... [package sb-simd-sse4.2].......................... [package sb-simd-avx]............................. [package sb-simd-avx2]............................ .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. ......................
The value NIL is not of type INTEGER [Condition of type SB-KERNEL:CASE-FAILURE]
Restarts: 0: [RETRY] Retry compiling #<CL-SOURCE-FILE "sb-simd" "define-vop-functions">. 1: [ACCEPT] Continue, treating compiling #<CL-SOURCE-FILE "sb-simd" "define-vop-functions"> as having been successful. 2: [RETRY] Retry ASDF operation. 3: [CLEAR-CONFIGURATION-AND-RETRY] Retry ASDF operation after resetting the configuration. 4: [RETRY] Retry ASDF operation. 5: [CLEAR-CONFIGURATION-AND-RETRY] Retry ASDF operation after resetting the configuration. --more--
Backtrace: 0: (ASH NIL 2) 1: (SB-X86-64-ASM::EMIT-THREE-BYTE-VEX #<SB-ASSEM:SEGMENT {101B056633}> 0 0 0 15 1 0 NIL NIL) 2: (SB-X
The problem with EMIT-THREE-BYTE-VEX was fixed already (https://github.com/sbcl/sbcl/commit/df0fbe66d8141312b0f901434f8015c28061fa1d). I know this is problematic, but at the moment, while I fix about one SIMD bug per week, I decided to assume that all sb-simd users use the current HEAD.
I will contact Doug about your slowdown.
I’ve just tried this after you mentioned that you’ve got the same performance using 2.1.9 and the current head. I wanted to try it but I couldn’t build sb-simd on 2.1.9. I hope Doug can figure it out quickly.
Just a note that I can't build the latest sb-simd on sbcl-2.1.9. I am getting this: CL-USER> (ql:quickload :sb-simd) To load "sb-simd": Load 1 ASDF system: sb-simd ; Loading "sb-simd" [package sb-simd-internals]....................... [package sb-simd]................................. [package sb-simd-x86-64].......................... [package sb-simd-sse]............................. [package sb-simd-sse2]............................ [package sb-simd-sse3]............................ [package sb-simd-ssse3]........................... [package sb-simd-sse4.1].......................... [package sb-simd-sse4.2].......................... [package sb-simd-avx]............................. [package sb-simd-avx2]............................ .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. ......................
The value NIL is not of type INTEGER [Condition of type SB-KERNEL:CASE-FAILURE]
Restarts: 0: [RETRY] Retry compiling #<CL-SOURCE-FILE "sb-simd" "define-vop-functions">. 1: [ACCEPT] Continue, treating compiling #<CL-SOURCE-FILE "sb-simd" "define-vop-functions"> as having been successful. 2: [RETRY] Retry ASDF operation. 3: [CLEAR-CONFIGURATION-AND-RETRY] Retry ASDF operation after resetting the configuration. 4: [RETRY] Retry ASDF operation. 5: [CLEAR-CONFIGURATION-AND-RETRY] Retry ASDF operation after resetting the configuration. --more--
Backtrace: 0: (ASH NIL 2) 1: (SB-X86-64-ASM::EMIT-THREE-BYTE-VEX #<SB-ASSEM:SEGMENT {101B056633}> 0 0 0 15 1 0 NIL NIL) 2: (SB-X
I've just run a GitHub Action on https://programming-language-benchmarks.vercel.app/ with old and recent sbcl and the speed difference is shown bellow. Even the 1.cl spectralnorm code that is not using sb-simd is way faster.
Hi Marco,
Sorry for bothering you again but have you had a chance to talk to Doug regarding the slowdown?
Kindest Regards, Bela
Hi Bela,
I started a discussion about this problem on sb-devel: https://sourceforge.net/p/sbcl/mailman/sbcl-devel/thread/048b2595b53a49c9af9540e7c0ecbb98%40fau.de/#msg37374543
There is definitely some problematic behavior, but nobody has managed to track it down so far. I think the developers would be grateful for any further input. (I haven't been able to devote much time to this because I am still working on automatic vectorization, which takes 110% of my mental capacity)
Best regards, Marco
Hi Marco,
I’ve just subscribed to the sbcl-devel mailing list but since I have not yet received email from that discussion thread I can not comment yet.
I’ve just read the thread and there are some strange things going on. Stas’ speed is creasy slow. It is 20x slower then on my cpu. About 0.7s for n=11000 on 4 threads on mine and 14 seconds on his.
I hope I can comment when I get the first email.
Kindest Regards, Bela
Is there other ways to comment/repply to the mailing list thread?
Yesterday using the fast sbcl head the 2.cl code was the fastest code on https://programming-language-benchmarks.vercel.app/problem/spectral-norm Earlier today the site maintainer installed sbcl 2.1.10 and now the same code is the slowest multi threading code by far. Doesn’t look good. I hope the sbcl maintainers can fix it. Though according to Stas it is not a regression but we were lucky before. I disagree but who am I to complain.
If I understand the mailing list discussion correctly, one problem is with scalar non-VEX encoded loads and stores (e.g. MOVSD instead of VMOVSD), which incurs a SSE<->AVX transition penalty. Is that correct?
If so, I can easily define all scalar loads and stores in sb-simd. The infrastructure is already there. Once they are defined, you just have to make sure to use sb-simd-avx:f64-aref
instead of aref
.
I have tried Stas’ commit that he sent to use vmovsd for arefs instead of movsd but it actually slowed the code slightly down so I don’t know what to think anymore. Nevertheless, I feel it might be a good idea to implement scalar arefs in sb-simd regardless if VEX non-VEX is in play. What I can see is that a commit killed the speed and Stas is stating that it is not a regression but it was fast by luck. I don’t buy it. I hope there is a quick fix for it to get the amazing speed that we used to have back.
I am confident that we can get that speed back, soon.
By the way, I just finished my automatic vectorizer, and none of my benchmarks were affected negatively by the most recent release of SBCL. In fact, I just managed to squeeze a whopping 8.5 GFLOP/s out of a numerical benchmark where even the vectorizer of GCC only managed to obtain 9 GFLOP/s. So maybe the slowdown is somehow specifically related to what you do.
This is good news. I've seen that you've committed but I am still yet to figure out how to use it. The throughput number is amazingly high. How did you measure it?
I can only hope too that you can figure out what went wrong and fix it soon. I was using the benchmark site to promote sbcl and now I feel a bit ashamed.
But just a sad reminder how bad the speed drop is.
Speed with sbcl fast head yesterday: Speed with sbcl-2.1.10 now:
I am trying to figure out how to use the auto vectorizer. How would I convert this function?
(declaim (ftype (function (f64vec f64vec u32 u32 u32) null) eval-A-times-u))
(defun eval-A-times-u (src dst begin end length)
(loop for i from begin below end by 4
with src-0 of-type f64 = (aref src 0)
do (let* ((ti (f64.4+ i (make-f64.4 0 1 2 3)))
(eA (eval-A ti (f64.4 0)))
(sum (f64.4/ src-0 eA)))
(loop for j from 1 below length
do (let ((src-j (aref src j))
(idx (f64.4+ eA ti j)))
(setf eA idx)
(f64.4-incf sum (f64.4/ src-j idx))))
(setf (f64.4-aref dst i) sum))))
Hi Marco,
Could you please accept Stas’ PR to correct the speed issue?
Kindest Regards, Bela
I just merged the PR from Stas. Does that mean the speed issue is fixed?
I'm not sure whether your code is simple enough for automatic vectorization. So far, the vectorizer only works on the inner loop of a problem. I'll give it some thought.
The speed issue is indeed fixed. After the change speed was back as it used to be and Stas was correct again. It is shocking that such a small bug (but tricky one to track down) can cause such a big speed drop. Stas is a magician to track down issues like this.
I am trying to set up the benchmark site to use stable and nightly sbcl builds to catch regressions.
I would appreciate any guidance on auto vectorization.
How about this code from the original spectralnorm without simd.
(deftype uint31 () '(unsigned-byte 31))
(deftype d+ () '(double-float 0d0))
(deftype array-d+ () '(simple-array d+))
(defmacro eval-A (i j)
`(let* ((i+1 (1+ ,i))
(i+j (+ ,i ,j))
(i+j+1 (+ i+1 ,j)))
(declare (type uint31 i+1 i+j i+j+1))
(/ (float (+ (ash (* i+j i+j+1) -1) i+1) 0d0))))
(declaim (ftype (function (array-d+ uint31 array-d+ uint31 uint31) null)
eval-At-times-u eval-A-times-u))
(defun eval-At-times-u (u n Au start end)
(loop for i from start below end do
(setf (aref Au i)
(loop for j below n
summing (* (aref u j) (eval-A j i)) of-type d+))))
OK, I will need to teach the vectorizer about reductions and about integer to float casts before you can use it for this benchmark. Give me a few more days :)
I'm closing this issue now.
Please take your time. No rush whatsoever from me.
Hi Marco,
I am experiencing very slow sb-simd speed using the latest development branch of sbcl that is about to be released as 2.1.10.
Have you noticed it as well or I am doing something wrong?
I was on sbcl commit 14b60aa20c07481b64756b4a250fe8e6eed307b2 and when pulled the latest 657c85db42a6f8cb6823131ba7c90cf71f693bc6 the speed of the avx spectralnorm code
went from:
to:
The slowdown is immense. The speed on the latest sbcl is about 1/4th.
Could you please check it?
Kindest Regards,
Bela