Optimize transformer loops

tairov / llama2.mojo

Inference Llama 2 in one file of pure 🔥

https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov

MIT License

2.09k stars 140 forks source link

Optimize transformer loops #39

Closed mikowals closed 11 months ago

mikowals commented 11 months ago

Marked as draft because I built off of the PR #38 branch.

This gets about a 25% speed up by using Mojo's parallelize and vectorize on the loops in the transformer forward pass. I also removed the division of num_cores() by two but the speed up was apples to apples comparisons of both versions running with all cores. This was on an 8 core Codespace instance.

Looking at the last commit (to ignore the shift to Tensors) you can see it is just applying the standard for-loop to parallelize (or vectorise) conversion in a some key places. The main impact is making the loops over attention heads parallel. Those loops have other loops nested within. So I believe the biggest impact is from changing 4 lines of code. So whether the shift to Tensors is adopted or not adding these optimisations is probably worth doing.

tairov commented 11 months ago

25% speedup is really impressive, I was considering the same speedup , since llama2.c also contain pragma omp on the same loop -- https://github.com/karpathy/llama2.c/blob/master/run.c#L290 But then my profilings over python code was showing that most of the time is consuming in matmul.

Seems that this forward pass is another hot-code section.

tairov commented 11 months ago

I was able to reproduce the speedup, but I had to set Runtime threads = num_cores() // 2 for my target host

llama2.mojo ( num_cores() )	llama2.mojo ( num_cores() // 2 )	llama2.c (OMP)
203.2 tok/s	440.8 tok/s	435 tok/s

mojo --version
mojo 0.3.1 (a3eed7c8)

tairov commented 11 months ago

Only 3 cores/vcores (out of 6) are busy for llama2.mojo & all 6 cores are in use for llama2.c

tinyllama performance :

achieved tok/s:  6.84

mikowals commented 11 months ago

We can progress to merge this PR only. I think the commit history should be squashed before merge but I will keep "replace Matrix with Tensor" and "vectorize and parallelize loops in transformer" as separate commits since they are unrelated changes.

tairov commented 11 months ago

@mikowals well, if you prefer the keep 2 commits only it would be better. Otherwise GH has a cool feature :)

mikowals commented 11 months ago

I just did the rebasing manually so there will be two separate commits. I think it might be useful for future optimisation efforts to easily see those changes separately.

Saddibek1 commented 11 months ago

Have you tested num_cores//1 ?)