Closed mikowals closed 11 months ago
25% speedup is really impressive, I was considering the same speedup , since llama2.c
also contain pragma omp
on the same loop -- https://github.com/karpathy/llama2.c/blob/master/run.c#L290
But then my profilings over python code was showing that most of the time is consuming in matmul.
Seems that this forward pass is another hot-code section.
I was able to reproduce the speedup, but I had to set Runtime threads = num_cores() // 2
for my target host
llama2.mojo ( num_cores() ) | llama2.mojo ( num_cores() // 2 ) | llama2.c (OMP) |
---|---|---|
203.2 tok/s | 440.8 tok/s | 435 tok/s |
mojo --version
mojo 0.3.1 (a3eed7c8)
Only 3 cores/vcores (out of 6) are busy for llama2.mojo
& all 6 cores are in use for llama2.c
tinyllama performance :
achieved tok/s: 6.84
We can progress to merge this PR only. I think the commit history should be squashed before merge but I will keep "replace Matrix with Tensor" and "vectorize and parallelize loops in transformer" as separate commits since they are unrelated changes.
@mikowals well, if you prefer the keep 2 commits only it would be better. Otherwise GH has a cool feature :)
I just did the rebasing manually so there will be two separate commits. I think it might be useful for future optimisation efforts to easily see those changes separately.
Have you tested num_cores//1 ?)
Marked as draft because I built off of the PR #38 branch.
This gets about a 25% speed up by using Mojo's
parallelize
andvectorize
on the loops in the transformer forward pass. I also removed the division ofnum_cores()
by two but the speed up was apples to apples comparisons of both versions running with all cores. This was on an 8 core Codespace instance.Looking at the last commit (to ignore the shift to Tensors) you can see it is just applying the standard for-loop to parallelize (or vectorise) conversion in a some key places. The main impact is making the loops over attention heads parallel. Those loops have other loops nested within. So I believe the biggest impact is from changing 4 lines of code. So whether the shift to Tensors is adopted or not adding these optimisations is probably worth doing.