SIMD Foreign Function (VOPs?)

digikar99 commented 3 years ago

Great work @bpecsek and @marcoheisig these recent weeks, thanks a lot!

I hope to incorporate sb-simd in numericals later once I find the time. I'm not sure when and how I moved away from sb-simd project last year; but eventually ended up getting tangled in a couple of other projects that were needed for numericals and am currently relying on CFFI for better portability across architectures and CL-implementations + non-hardware SIMD instructions via sleef. However, this approach currently renders things unusable for something like with-elementwise-operations that is supposed to avoid repeatedly evicting the cache while using multiple operations, and for this, I realize I do need sb-simd (or cl-simd).

However, in the current state, even if I do use sb-simd, I lose out on the non-hardware instructions (see sleef above); so I was wondering if anyone of you or any passers-by have any ideas for calling foreign functions like Sleef_sinf8_u10avx2 with SIMD operands, may be inside a VOP itself. (This might even be automated using cl-autowrap - or may be it too requires updates for working with SIMD operands once the method for doing it is figured out.) I suspect I don't have the time to look into this, but just dropping this here in case it piques someone's interest.

marcoheisig commented 3 years ago

Hi @digikar99,

it is great to hear from you! Thanks for stating your requirements, it is good to have those cases in mind during development.

I agree, we'll need a plethora of non-hardware SIMD instructions eventually. But first I will try to finish the core library. There are still some known bugs, and there is a total lack of systematic testing.

Best regards, Marco

digikar99 commented 3 years ago

Oh yes, no hurries! In case I do get the time, I could look into this some time.

Getting at least the hardware instructions up working with proper testing definitely stands more important.

This (sb-simd) should also be later useful in numcl since it has its einsum-backend abstracted out.

digikar99 commented 3 years ago

Okay... Had a sudden inspiration to try this out; details might need to be worked out for the general case, but this works:

(cffi:load-foreign-library #P"sleef/build/lib/libsleef.so")

(in-package :sb-vm)

(defknown (my-sin)
    ((simd-pack-256 single-float))
    (simd-pack-256 single-float)
    (movable flushable always-translatable)
  :overwrite-fndb-silently t)

(define-vop (my-sin)
  (:translate my-sin)
  (:policy :fast-safe)
  (:args (a :scs (single-avx2-reg)))
  (:arg-types simd-pack-256-single)
  (:results (dest :scs (single-avx2-reg)))
  (:result-types simd-pack-256-single)
  (:generator 1
              (inst call (make-fixup "Sleef_sinf8_u10" :foreign))))

(defun my-sin-user (a)
  (declare (type (simd-pack-256 single-float) a)
           (optimize speed))
  (my-sin a))

And...

SB-VM> (mapcar #'sin '(1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0))
(0.84147096 0.9092974 0.14112 -0.7568025 -0.9589243 -0.2794155 0.6569866
 0.98935825)
SB-VM> (%simd-pack-256-singles 
        (my-sin-user (%make-simd-pack-256-single 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0)))
0.84147096
0.9092974
0.14112
-0.7568025
-0.9589243
-0.2794155
0.6569866
0.98935825

So, yayayay!

bpecsek commented 3 years ago

This is great indeed! Looks like an easy integration. How is the speed penalty calling foreign function?

Edit: I've just run a few test and it seems to be quite fast.

Have you tried this same approach with the functions included in GNU libc? It wouldn't need additional library to be installed.

digikar99 commented 3 years ago

Turns out one of the crucial things about this happens to be backing up of the relevant registers, before CALL-ing. I also recently found some time to have a discussion on SBCL-Help[3], where Stas suggested the use of (:save-p t) in DEFINE-VOP which seems to back up everything, and while costly it seems to work.

So, the actual penalty depends on how much you backup. If you backup only the right registers[1], the cost is 0 overhead compared to the CFFI approach I rewrote (WIP) numericals with. But otherwise if you backup using (:save-p t), the time taken[2] is about 0.07 seconds for 100 runs of 1,000,000 sized single-float random array using CFFI vs 0.09 with (:save-p t), for the same glibc's _ZGVdN8v_sinf. For Sleef_tanf8_u10avx2, the difference was about 0.27 vs 0.30 seconds.

Since my goal with this is reviving with-elementwise-operations, I'm hoping the cost with (:save-p t) would offset the cost of allocating (and accessing!) the temporary arrays in between; though it'd be a while to implement this and then actually test and benchmark this out across various functions.

Have you tried this same approach with the functions included in GNU libc? It wouldn't need additional library to be installed.

Yes, but glibc provides only a few of the functions provided by sleef. For sin and cos that glibc provides though; the glibc provided versions happen to be 2-3 times as fast as the sleef version on my i7-8750H.

The other thing to worry about this might be the handling of 2 argument functions; I haven't done any checks with them yet.

Notes

If I go by the x86 calling conventions, wouldn't it be enough if I backup only EAX, EDX, and ECX into some other registers (not even the stack)? A brief test suggested this works; but I'd be sure only once I get all the other functions to work without errors-due-to-unintended-register-overwrites. (Basically, check the non/existence of MOV instructions before the relevant CALL instructions in the disassembly, if someone wants to look into this.)
I currently run into The value XMM0 is not of type SB-C:TN while trying to load sb-simd with SBCL 2.1.8-WIP. I'm hoping to post the exact code once I rebuild SBCL with version 2.1.9 with Marco's fixes. But basically, one aspect is the correctness, and another performance. EDIT: Obtaining the RAX register can be done using (:temporary (:sc unsigned-reg :offset rax-offset :to :result) rax); similarly for RDX and RCX.
I'm going to need to sort out my email-client for cleaner emails.

bpecsek commented 3 years ago

Got it. Thanks

moon-chilled commented 2 years ago

If I go by the x86 calling conventions, wouldn't it be enough if I backup only EAX, EDX, and ECX into some other registers (not even the stack)?

Probably not. That is a 32-bit calling convention; scroll down a bit for the 64-bit calling conventions, which have more caller-saved registers. Probably you just got lucky and sleef doesn't use a lot of GPRs (wouldn't be surprising). If you really want to minimise the cost, you could look at their generated code and see what registers they actually use. I would rather have a pure-cl math library.

marcoheisig commented 2 years ago

I'm closing this issue because sb-simd is now part of SBCL, and this repository is being archived.

marcoheisig / sb-simd

SIMD Foreign Function (VOPs?) #9