sarah-quinones / gemm

MIT License
76 stars 11 forks source link

F16 intrinsics standalone #14

Closed Narsil closed 1 year ago

Narsil commented 1 year ago

This is very dirty PR more a POC than anything else at this point.

It seems to work and be correct. (It passes in every scenario I tried.)
It is faster than without.

half-rs is using a fork https://github.com/starkat99/half-rs/pull/98 to get some currently non existing intrinsics for pure f16 computing.

Then hackilishly added them into gemm:

Copy-pasted the code for f16 gemm (which does f16 -> f32simd -> matmul -> f16) to do purely f16 -> f16.

The code requires black_box atm for the compiler to be happy. This is most likely an error of mine in half-rs intrinsics implementation (I used arm! macro but do no understand how that affects the compiler).

I didn't re-optimize this afterwards to make sure cache lines were adapted or anything of the sort.

Current results:

GGML WITHOUT ACCELERATE (f32xf16) -> f32 : 220ms (1 thread) - 197ms (8 threads) GEMM (f16xf16x) -> f16: This is very dirty PR more a POC than anything else at this point.

It seems to work and be correct. (It passes in every scenario I tried.)
It is faster than without.

half-rs is using a fork https://github.com/starkat99/half-rs/pull/98 to get some currently non existing intrinsics for pure f16 computing.

Then hackilishly added them into gemm:

Copy-pasted the code for f16 gemm (which does f16 -> f32simd -> matmul -> f16) to do purely f16 -> f16.

The code requires black_box atm for the compiler to be happy. This is most likely an error of mine in half-rs intrinsics implementation (I used arm! macro but do no understand how that affects the compiler).

I didn't re-optimize this afterwards to make sure cache lines were adapted or anything of the sort.

Current results:

GGML WITHOUT ACCELERATE (f32xf16) -> f32 : 220ms (1 thread) - 197ms (8 threads) GEMM (f16xf16x) -> f16: 134ms (thread) - 110ms (8 threads) M, N, K : 4096 x 128 x 11108

For reference Accelerate seems to do ~25ms for the same op and threading seems to decrease performance on it , which I guess is because Accelerate already uses threading underneath). (1 thread) - 68ms (8 threads) M, N, K : 4096 x 128 x 11108

For reference Accelerate seems to do ~25ms for the same op and threading seems to decrease performance on it , which I guess is because Accelerate already uses threading underneath).