tairov / llama2.mojo

Inference Llama 2 in one file of pure 🔥
https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov
MIT License
2.09k stars 139 forks source link

improve speed with fused matrix multiplications #69

Closed mikowals closed 10 months ago

mikowals commented 10 months ago

There are a couple of sets of matrix multiplications where the A matrix can be shared. Some loops and reads are removed by this fusing fusing them.

On M1 Pro Macbook the tokens / sec improvement I saw was roughly:

I use -j 6 which is 6 out of 10 cores which seems optimal on all the models.

This change is inspired by one of the Zig implementations. Single worker of stories15M improved from 470 -> 530 tokens / sec so It closes some of the performance gap with Zig in that comparison.

It is slightly ugly to strip the types by passing A.data(), B.data(), and C.data() directly into the new batch_matmul function in transformer. StaticTuple requires homogeneous types though and the fusing sometimes uses both Tensor and TensorSlice.

tairov commented 10 months ago

Hi @mikowals , cool! As always you're ahead of everyone. unroll is a great finding!

I think we're beating llama.cpp now 😄

comparison report: https://engiware.com/benchmark/hypertune-report-cf7a.html

image

We're really close to 1000 tok/s

image
mikowals commented 10 months ago

Thanks for the update. Good Stuff!

Not for the PR but try running with 4 * simdwidthof[DType.float32](). On my M1 machine that improves performance a bit more. I had a stories15M run that hit 902 toks / sec.

But setting that requires autotune I think because it is a compile time value and it will be very OS / architecture dependent.

tairov commented 10 months ago

Indeed, threads = 6 -- is fastest, it shows 1024 tok/s (!) at max

image
tairov commented 10 months ago

Multiplier 4 for simdwidth works slightly better on Mac:

image
tairov commented 10 months ago

BTW, haven't you tried vectorize_unroll ?

mikowals commented 10 months ago

I think unrolling needs a lot of iterations to make a difference and our matrices aren't very big. My understanding is that it just skips incrementing the counter and checking if the end reached. Those aren't very expensive ops.

If you get the matmul example from the modular repo and set the number of workers correctly (6 or 8 on M1) you can also see that the tiling and unrolling impact looks negative. Not sure why those optimizations show mild improvement when many workers are used but setting workers correctly seems better.

tairov commented 10 months ago

Thank you!

tairov commented 10 months ago

The readability is just "brr.." now, though 😃

andresnowak commented 10 months ago

If I could ask @mikowals, I also tried fusing the matmul operations, but I was doing it manually, like instead of putting the unroll for a batch of size 3, I was writing the 3 operations manually instead, while also not doing the extra memory reads for the A tensor, but the speeds where the same as the original version, I thought that it was the same speed, because it was basically the same amount of operations, minus the reads for the A matrix. So I wanted to ask how is it that doing it this way we get a speedup, is it because of the static tuple being done at compile time?

mikowals commented 10 months ago

@andresnowak , I expect manually writing out the loops to work exactly the same, since that is what I think 'unroll' with iterations set at compile time does.

StaticTuple probably allows some optimizations over a DynamicVector or other generic array types. But any implementation with size and type known at compile time should get the same benefits. So passing known types as arguments should be as good or better.

So I would be curious to see your implementation and try to learn why it might be slower.

andresnowak commented 10 months ago

This is the code. I didn't remember that AnyType was the "Traits" thing for now, so i was passing each matrix as a different argument.


@always_inline
fn matmul_parallelized(
    C_1: BufferPtrFloat32,
    C_2: BufferPtrFloat32,
    A_1: BufferPtrFloat32,
    B_1: BufferPtrFloat32,
    B_2: BufferPtrFloat32,
    rows: Int,
    cols: Int,
):
    @parameter
    fn compute_row(i: Int):
        var tmp_1 = SIMD[DType.float32, nelts](0)
        var tmp_2 = SIMD[DType.float32, nelts](0)

        @parameter
        fn dot[_nelts: Int](j: Int):
            if _nelts < nelts:  # take care of tail array elements with length <  nelts
                let A = A_1.simd_load[_nelts](j)
                tmp_1[0] += (
                    A * B_1.simd_load[_nelts](i * cols + j)
                ).reduce_add()
                tmp_2[0] += (
                    A * B_2.simd_load[_nelts](i * cols + j)
                ).reduce_add()
            else:
                let A = A_1.simd_load[nelts](j)
                tmp_1 += A * B_1.simd_load[nelts](i * cols + j)
                tmp_2 += A * B_2.simd_load[nelts](i * cols + j)

        vectorize[nelts, dot](cols)
        C_1.store(i, tmp_1.reduce_add())
        C_2.store(i, tmp_2.reduce_add())

    parallelize[compute_row](rows, workers)```
mikowals commented 10 months ago

@andresnowak , I ran some quick benchmarks and your version seems to perform identically to my version. And both outperform unfused matmul substantially. It may be that if you only tried by fusing 2 mammals the impact got hidden in the performance noise of running the whole network.

These are the results. I label your version 'hardcoded' and my version 'unrolled'. I timed the 3 varieties (fused hardcoded, fused unrolled, and separate matmuls) when doing 2 matmuls. And then did additional runs of unrolled and and separate matmuls when fusing 3 matmuls. Fusing more ops and loops together is a bigger win. The tests are on M1 Pro with rows=288, cols=288 cols, workers = 6, and nelts=16. So 'stories15M' size.

fused 2 - hard coded:  13135 nanoseconds
fused 2 - unrolled:  13071 nanoseconds
2 separate matmuls:  20838 nanoseconds
fused 3 - unrolled:  15506 nanoseconds
3 separate matmuls:  32092 nanoseconds

The file to reproduce is attached as .txt fused_matmul.txt

andresnowak commented 10 months ago

Hmm, yeah it's true, I'm seeing the same results in my computer. When i was testing i was using just time.now() with lets say the last two matmuls separated and then together, and then running the stories15M size and seeing the time output of this matmuls, and there i wasn't seeing a difference, but here with the benchmark way i am seeing a difference, or maybe i was reading the numbers incorrectly when i was doing it my way and i wasn't seeing the difference.

Thank you for your help @mikowals , at least i learned that fusing matmuls that have the same size and at least one of the tensors is being shared (so as to reduce the amount of reads to memory or reads and write), does have an impact in performance.