marcoheisig / sb-simd

A convenient SIMD interface for SBCL.
MIT License
78 stars 5 forks source link

Slow sb-simd speed on latest sbcl development branch #11

Closed bpecsek closed 2 years ago

bpecsek commented 3 years ago

Hi Marco,

I am experiencing very slow sb-simd speed using the latest development branch of sbcl that is about to be released as 2.1.10.

Have you noticed it as well or I am doing something wrong?

I was on sbcl commit 14b60aa20c07481b64756b4a250fe8e6eed307b2 and when pulled the latest 657c85db42a6f8cb6823131ba7c90cf71f693bc6 the speed of the avx spectralnorm code

went from:

to:

The slowdown is immense. The speed on the latest sbcl is about 1/4th.

Could you please check it?

Kindest Regards,

Bela

marcoheisig commented 3 years ago

Hi Bela,

I introduced one major change in SBCL regarding the handling of primops that could be the culprit. I will try to look into it this weekend.

Best regards, Marco

bpecsek commented 3 years ago

Thanks. Please note that sbcl 2.1.10 is about to be released mid next week if I am not mistaken. It would be bad to end up with a situation that a simd code runs slower than the one without simd that runs in about 2 seconds. Kindest Regards, Bela

bpecsek commented 3 years ago

Hi Marco,

Any progress finding the root cause of the extreme slowdown?

Best regards, Bela

bpecsek commented 3 years ago

Hi Marco, I've just done a quick check and the commit 14b60aa20c07481b64756b4a250fe8e6eed307b2 on which my code is still running fast was done on Oct 14, quite a few day after your commits on Oct 4. So the slowdown might be due to something completely different. Kindest Regards, Bela

bpecsek commented 3 years ago

Hi Marco, I've just found that the slowdown was introduced by Douglas' commit ad988fc61311cb81f53b538b1ae85ee73431eb0b. Kind Regards, Bela

marcoheisig commented 3 years ago

Hi Bela,

I tried to narrow the problem down this weekend, but I go the same performance for SBCL 1.2.9 and the current HEAD. Then other things kept me too busy, unfortunately. (I was on vacation that for the last few days). But now I am back at work and have time to clean this up.

Have you already contacted Douglas about this? Or should I do that? And do you have any further information on why there is such a slowdown? Is it because of code generated for eval-a-times-u or is it because of how f64.4-vdot is translated?

Best regards, Marco

bpecsek commented 3 years ago

Hi Marco,

I have contacted only you since I feel it would be better if you discuss this with Douglas. The code generated for eval-a-times-u and spectralnorm are almost identical on old (fast) and new (slow) head with only 2 bytes difference. First I thought that I messed something up on my laptop but I am getting the same slow speed on GitHub Action too with the current SBCL head on https://github.com/bpecsek/Programming-Language-Benchmarks/blob/main/bench/algorithm/spectral-norm/2.cl , however, https://github.com/bpecsek/Programming-Language-Benchmarks/blob/main/bench/algorithm/spectral-norm/3.cl is running fast. I don't get what is going on to be frank. Best Regards, Bela

bpecsek commented 3 years ago

Just a note that I can't build the latest sb-simd on sbcl-2.1.9. I am getting this: CL-USER> (ql:quickload :sb-simd) To load "sb-simd": Load 1 ASDF system: sb-simd ; Loading "sb-simd" [package sb-simd-internals]....................... [package sb-simd]................................. [package sb-simd-x86-64].......................... [package sb-simd-sse]............................. [package sb-simd-sse2]............................ [package sb-simd-sse3]............................ [package sb-simd-ssse3]........................... [package sb-simd-sse4.1].......................... [package sb-simd-sse4.2].......................... [package sb-simd-avx]............................. [package sb-simd-avx2]............................ .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. ......................

The value NIL is not of type INTEGER [Condition of type SB-KERNEL:CASE-FAILURE]

Restarts: 0: [RETRY] Retry compiling #<CL-SOURCE-FILE "sb-simd" "define-vop-functions">. 1: [ACCEPT] Continue, treating compiling #<CL-SOURCE-FILE "sb-simd" "define-vop-functions"> as having been successful. 2: [RETRY] Retry ASDF operation. 3: [CLEAR-CONFIGURATION-AND-RETRY] Retry ASDF operation after resetting the configuration. 4: [RETRY] Retry ASDF operation. 5: [CLEAR-CONFIGURATION-AND-RETRY] Retry ASDF operation after resetting the configuration. --more--

Backtrace: 0: (ASH NIL 2) 1: (SB-X86-64-ASM::EMIT-THREE-BYTE-VEX #<SB-ASSEM:SEGMENT {101B056633}> 0 0 0 15 1 0 NIL NIL) 2: (SB-X

marcoheisig commented 3 years ago

The problem with EMIT-THREE-BYTE-VEX was fixed already (https://github.com/sbcl/sbcl/commit/df0fbe66d8141312b0f901434f8015c28061fa1d). I know this is problematic, but at the moment, while I fix about one SIMD bug per week, I decided to assume that all sb-simd users use the current HEAD.

I will contact Doug about your slowdown.

bpecsek commented 3 years ago

I’ve just tried this after you mentioned that you’ve got the same performance using 2.1.9 and the current head. I wanted to try it but I couldn’t build sb-simd on 2.1.9. I hope Doug can figure it out quickly.

bpecsek commented 3 years ago

Just a note that I can't build the latest sb-simd on sbcl-2.1.9. I am getting this: CL-USER> (ql:quickload :sb-simd) To load "sb-simd": Load 1 ASDF system: sb-simd ; Loading "sb-simd" [package sb-simd-internals]....................... [package sb-simd]................................. [package sb-simd-x86-64].......................... [package sb-simd-sse]............................. [package sb-simd-sse2]............................ [package sb-simd-sse3]............................ [package sb-simd-ssse3]........................... [package sb-simd-sse4.1].......................... [package sb-simd-sse4.2].......................... [package sb-simd-avx]............................. [package sb-simd-avx2]............................ .................................................. .................................................. .................................................. .................................................. .................................................. .................................................. ......................

The value NIL is not of type INTEGER [Condition of type SB-KERNEL:CASE-FAILURE]

Restarts: 0: [RETRY] Retry compiling #<CL-SOURCE-FILE "sb-simd" "define-vop-functions">. 1: [ACCEPT] Continue, treating compiling #<CL-SOURCE-FILE "sb-simd" "define-vop-functions"> as having been successful. 2: [RETRY] Retry ASDF operation. 3: [CLEAR-CONFIGURATION-AND-RETRY] Retry ASDF operation after resetting the configuration. 4: [RETRY] Retry ASDF operation. 5: [CLEAR-CONFIGURATION-AND-RETRY] Retry ASDF operation after resetting the configuration. --more--

Backtrace: 0: (ASH NIL 2) 1: (SB-X86-64-ASM::EMIT-THREE-BYTE-VEX #<SB-ASSEM:SEGMENT {101B056633}> 0 0 0 15 1 0 NIL NIL) 2: (SB-X

bpecsek commented 3 years ago

I've just run a GitHub Action on https://programming-language-benchmarks.vercel.app/ with old and recent sbcl and the speed difference is shown bellow. Even the 1.cl spectralnorm code that is not using sb-simd is way faster. spectralnorm-fast spectralnorm-slow

bpecsek commented 3 years ago

Hi Marco,

Sorry for bothering you again but have you had a chance to talk to Doug regarding the slowdown?

Kindest Regards, Bela

marcoheisig commented 3 years ago

Hi Bela,

I started a discussion about this problem on sb-devel: https://sourceforge.net/p/sbcl/mailman/sbcl-devel/thread/048b2595b53a49c9af9540e7c0ecbb98%40fau.de/#msg37374543

There is definitely some problematic behavior, but nobody has managed to track it down so far. I think the developers would be grateful for any further input. (I haven't been able to devote much time to this because I am still working on automatic vectorization, which takes 110% of my mental capacity)

Best regards, Marco

bpecsek commented 3 years ago

Hi Marco,

I’ve just subscribed to the sbcl-devel mailing list but since I have not yet received email from that discussion thread I can not comment yet.

I’ve just read the thread and there are some strange things going on. Stas’ speed is creasy slow. It is 20x slower then on my cpu. About 0.7s for n=11000 on 4 threads on mine and 14 seconds on his.

I hope I can comment when I get the first email.

Kindest Regards, Bela

bpecsek commented 3 years ago

Is there other ways to comment/repply to the mailing list thread?

bpecsek commented 2 years ago

Yesterday using the fast sbcl head the 2.cl code was the fastest code on https://programming-language-benchmarks.vercel.app/problem/spectral-norm Earlier today the site maintainer installed sbcl 2.1.10 and now the same code is the slowest multi threading code by far. Doesn’t look good. I hope the sbcl maintainers can fix it. Though according to Stas it is not a regression but we were lucky before. I disagree but who am I to complain.

marcoheisig commented 2 years ago

If I understand the mailing list discussion correctly, one problem is with scalar non-VEX encoded loads and stores (e.g. MOVSD instead of VMOVSD), which incurs a SSE<->AVX transition penalty. Is that correct?

If so, I can easily define all scalar loads and stores in sb-simd. The infrastructure is already there. Once they are defined, you just have to make sure to use sb-simd-avx:f64-aref instead of aref.

bpecsek commented 2 years ago

I have tried Stas’ commit that he sent to use vmovsd for arefs instead of movsd but it actually slowed the code slightly down so I don’t know what to think anymore. Nevertheless, I feel it might be a good idea to implement scalar arefs in sb-simd regardless if VEX non-VEX is in play. What I can see is that a commit killed the speed and Stas is stating that it is not a regression but it was fast by luck. I don’t buy it. I hope there is a quick fix for it to get the amazing speed that we used to have back.

marcoheisig commented 2 years ago

I am confident that we can get that speed back, soon.

By the way, I just finished my automatic vectorizer, and none of my benchmarks were affected negatively by the most recent release of SBCL. In fact, I just managed to squeeze a whopping 8.5 GFLOP/s out of a numerical benchmark where even the vectorizer of GCC only managed to obtain 9 GFLOP/s. So maybe the slowdown is somehow specifically related to what you do.

bpecsek commented 2 years ago

This is good news. I've seen that you've committed but I am still yet to figure out how to use it. The throughput number is amazingly high. How did you measure it?

I can only hope too that you can figure out what went wrong and fix it soon. I was using the benchmark site to promote sbcl and now I feel a bit ashamed.

But just a sad reminder how bad the speed drop is.

Speed with sbcl fast head yesterday: Screenshot from 2021-11-03 Speed with sbcl-2.1.10 now: Screenshot from 2021-11-04

bpecsek commented 2 years ago

I am trying to figure out how to use the auto vectorizer. How would I convert this function?

(declaim (ftype (function (f64vec f64vec u32 u32 u32) null) eval-A-times-u))
(defun eval-A-times-u (src dst begin end length)
  (loop for i from begin below end by 4
        with src-0 of-type f64 = (aref src 0)
        do (let* ((ti    (f64.4+ i (make-f64.4 0 1 2 3)))
                  (eA    (eval-A ti (f64.4 0)))
          (sum   (f64.4/ src-0 eA)))
         (loop for j from 1 below length
           do (let ((src-j (aref src j))
                            (idx (f64.4+ eA ti j)))
            (setf eA idx)
            (f64.4-incf sum (f64.4/ src-j idx))))
         (setf (f64.4-aref dst i) sum))))
bpecsek commented 2 years ago

Hi Marco,

Could you please accept Stas’ PR to correct the speed issue?

Kindest Regards, Bela

marcoheisig commented 2 years ago

I just merged the PR from Stas. Does that mean the speed issue is fixed?

I'm not sure whether your code is simple enough for automatic vectorization. So far, the vectorizer only works on the inner loop of a problem. I'll give it some thought.

bpecsek commented 2 years ago

The speed issue is indeed fixed. After the change speed was back as it used to be and Stas was correct again. It is shocking that such a small bug (but tricky one to track down) can cause such a big speed drop. Stas is a magician to track down issues like this.

I am trying to set up the benchmark site to use stable and nightly sbcl builds to catch regressions.

I would appreciate any guidance on auto vectorization.

How about this code from the original spectralnorm without simd.

(deftype uint31 () '(unsigned-byte 31))
(deftype d+ () '(double-float 0d0))
(deftype array-d+ () '(simple-array d+))

(defmacro eval-A (i j)
  `(let* ((i+1   (1+ ,i))
          (i+j   (+ ,i ,j))
          (i+j+1 (+ i+1 ,j)))
     (declare (type uint31 i+1 i+j i+j+1))
     (/ (float (+ (ash (* i+j i+j+1) -1) i+1) 0d0))))

(declaim (ftype (function (array-d+ uint31 array-d+ uint31 uint31) null)
                eval-At-times-u eval-A-times-u))
(defun eval-At-times-u (u n Au start end)
  (loop for i from start below end do
    (setf (aref Au i)
          (loop for j below n
                summing (* (aref u j) (eval-A j i)) of-type d+))))
marcoheisig commented 2 years ago

OK, I will need to teach the vectorizer about reductions and about integer to float casts before you can use it for this benchmark. Give me a few more days :)

I'm closing this issue now.

bpecsek commented 2 years ago

Please take your time. No rush whatsoever from me.