Closed digikar99 closed 2 years ago
Hi @digikar99,
it is great to hear from you! Thanks for stating your requirements, it is good to have those cases in mind during development.
I agree, we'll need a plethora of non-hardware SIMD instructions eventually. But first I will try to finish the core library. There are still some known bugs, and there is a total lack of systematic testing.
Best regards, Marco
Oh yes, no hurries! In case I do get the time, I could look into this some time.
Getting at least the hardware instructions up working with proper testing definitely stands more important.
This (sb-simd) should also be later useful in numcl since it has its einsum-backend abstracted out.
Okay... Had a sudden inspiration to try this out; details might need to be worked out for the general case, but this works:
(cffi:load-foreign-library #P"sleef/build/lib/libsleef.so")
(in-package :sb-vm)
(defknown (my-sin)
((simd-pack-256 single-float))
(simd-pack-256 single-float)
(movable flushable always-translatable)
:overwrite-fndb-silently t)
(define-vop (my-sin)
(:translate my-sin)
(:policy :fast-safe)
(:args (a :scs (single-avx2-reg)))
(:arg-types simd-pack-256-single)
(:results (dest :scs (single-avx2-reg)))
(:result-types simd-pack-256-single)
(:generator 1
(inst call (make-fixup "Sleef_sinf8_u10" :foreign))))
(defun my-sin-user (a)
(declare (type (simd-pack-256 single-float) a)
(optimize speed))
(my-sin a))
And...
SB-VM> (mapcar #'sin '(1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0))
(0.84147096 0.9092974 0.14112 -0.7568025 -0.9589243 -0.2794155 0.6569866
0.98935825)
SB-VM> (%simd-pack-256-singles
(my-sin-user (%make-simd-pack-256-single 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0)))
0.84147096
0.9092974
0.14112
-0.7568025
-0.9589243
-0.2794155
0.6569866
0.98935825
So, yayayay!
This is great indeed! Looks like an easy integration. How is the speed penalty calling foreign function?
Edit: I've just run a few test and it seems to be quite fast.
Have you tried this same approach with the functions included in GNU libc? It wouldn't need additional library to be installed.
Turns out one of the crucial things about this happens to be backing up of the relevant registers, before CALL-ing. I also recently found some time to have a discussion on SBCL-Help[3], where Stas suggested the use of (:save-p t)
in DEFINE-VOP which seems to back up everything, and while costly it seems to work.
So, the actual penalty depends on how much you backup. If you backup only the right registers[1], the cost is 0 overhead compared to the CFFI approach I rewrote (WIP) numericals with. But otherwise if you backup using (:save-p t)
, the time taken[2] is about 0.07 seconds for 100 runs of 1,000,000 sized single-float random array using CFFI vs 0.09 with (:save-p t)
, for the same glibc's _ZGVdN8v_sinf
. For Sleef_tanf8_u10avx2
, the difference was about 0.27 vs 0.30 seconds.
Since my goal with this is reviving with-elementwise-operations
, I'm hoping the cost with (:save-p t)
would offset the cost of allocating (and accessing!) the temporary arrays in between; though it'd be a while to implement this and then actually test and benchmark this out across various functions.
Have you tried this same approach with the functions included in GNU libc? It wouldn't need additional library to be installed.
Yes, but glibc provides only a few of the functions provided by sleef. For sin and cos that glibc provides though; the glibc provided versions happen to be 2-3 times as fast as the sleef version on my i7-8750H.
The other thing to worry about this might be the handling of 2 argument functions; I haven't done any checks with them yet.
Notes
If I go by the x86 calling conventions, wouldn't it be enough if I backup only EAX, EDX, and ECX into some other registers (not even the stack)? A brief test suggested this works; but I'd be sure only once I get all the other functions to work without errors-due-to-unintended-register-overwrites. (Basically, check the non/existence of MOV instructions before the relevant CALL instructions in the disassembly, if someone wants to look into this.)
I currently run into The value XMM0 is not of type SB-C:TN
while trying to load sb-simd with SBCL 2.1.8-WIP. I'm hoping to post the exact code once I rebuild SBCL with version 2.1.9 with Marco's fixes. But basically, one aspect is the correctness, and another performance. EDIT: Obtaining the RAX register can be done using (:temporary (:sc unsigned-reg :offset rax-offset :to :result) rax)
; similarly for RDX and RCX.
I'm going to need to sort out my email-client for cleaner emails.
Got it. Thanks
If I go by the x86 calling conventions, wouldn't it be enough if I backup only EAX, EDX, and ECX into some other registers (not even the stack)?
Probably not. That is a 32-bit calling convention; scroll down a bit for the 64-bit calling conventions, which have more caller-saved registers. Probably you just got lucky and sleef doesn't use a lot of GPRs (wouldn't be surprising). If you really want to minimise the cost, you could look at their generated code and see what registers they actually use. I would rather have a pure-cl math library.
I'm closing this issue because sb-simd is now part of SBCL, and this repository is being archived.
Great work @bpecsek and @marcoheisig these recent weeks, thanks a lot!
I hope to incorporate sb-simd in
numericals
later once I find the time. I'm not sure when and how I moved away fromsb-simd
project last year; but eventually ended up getting tangled in a couple of other projects that were needed fornumericals
and am currently relying on CFFI for better portability across architectures and CL-implementations + non-hardware SIMD instructions via sleef. However, this approach currently renders things unusable for something likewith-elementwise-operations
that is supposed to avoid repeatedly evicting the cache while using multiple operations, and for this, I realize I do needsb-simd
(or cl-simd).However, in the current state, even if I do use sb-simd, I lose out on the non-hardware instructions (see sleef above); so I was wondering if anyone of you or any passers-by have any ideas for calling foreign functions like
Sleef_sinf8_u10avx2
with SIMD operands, may be inside a VOP itself. (This might even be automated using cl-autowrap - or may be it too requires updates for working with SIMD operands once the method for doing it is figured out.) I suspect I don't have the time to look into this, but just dropping this here in case it piques someone's interest.