Open certik opened 1 year ago
Thanks for taking a look. I changed it so the model can be specified on the command line now. With the 42M model I get about 67 tokens/sec on my thinkpad, vs 165 tokens/sec with the 15M. I'm going to try running the real LLaMA7B model.
@rbitr make sure to try the optimization options. On your hardware, you want to do gfortran -O3 -march=native -ffast-math -funroll-loops llama2.f90 -o llm
. Even faster would be to add -fexternal-blas
and link OpenBLAS with it, but since llama2.c
uses handwritten matmul (https://github.com/karpathy/llama2.c/blob/35deb5e0fa55f0a257040bcf1624ed8386e63dc7/run.c#L222), then I think it's fair to just use what gfortran can do without OpenBLAS.
With gfortran llama2.f90 -o llm
I get 166 tokens/sec and with gfortran -O3 -march=native -ffast-math -funroll-loops llama2.f90 -o llm
I get 69 tokens/sec. Not sure what's going on. This is with ubuntu, GNU Fortran 9.4 (maybe it's outdated?). I'll take a closer look.
Edit: on my Mac (2018) it looks like I get a small increase, from ~120 tok/sec with no arguments to ~130 with -O3 -march=native -ffast-math -funroll-loops
. I'll have to check against llama2.c later.
Current numbers (for 110M model now, Ubuntu on my Thinkpad):
llama2.f90 with gfortran llama2.f90 -o llm -fexternal-blas -lblas
: 28 tok/s (this is the fastest I can get with the different options)
llama2.c with gcc -Ofast -fopenmp -march=native run.c -lm -o run: 38 tok/s
I wonder if llama2.c's matmuls are getting parallelized better because they are all explicitly vector matrix products?
OK, I added a handwritten matmul like in llama2.c. Now, unless I missed something, compiling with gfortran -O3 -march=native -ffast-math -funroll-loops llama2.f90 -o llm
gives me 38 tok/s on the 110M model, ie same as llama2.c. There is another branch with the custom matmul: https://github.com/rbitr/llama2.f90/tree/manual_matmul
Excellent. That's a great starting point for trying various options, such as the intrinsic matmul and various BLAS, and compiler options.
I've noticed in the past that sometimes the intrinsic matmul is slower than a handwritten one. However I think you also have a very old version of GFortran, I think the version 9 is from 2019. What version of gcc do you have to test the llama2.c
? Here are the releases together with dates: https://gcc.gnu.org/releases.html
I installed gfortran-10 and I get the same speed. Still going to investigate different compiler options and I will write a BLAS matmul. I had compiled llama2.c with gcc 9.4. That is the version that ships with Ubuntu 20.04.
Very good. Thanks @rbitr. If you need more help, you can also ask at https://fortran-lang.discourse.group/, a lot of knowledgeable people there.
@certik FYI I was able to get more speedup by writing element-wise functions that compute the q,k,v projections together at the beginning of the transformer and the MLP+nonlinearity part at the end and then parallelizing with OMP. I'm running a 3B parameter model at ~0.8 tok/s on my computer now, up from about 0.1 tok/s a week ago. It should be much faster than llama2.c at this point, but is still slower than llama.cpp, I'm still trying to understand all the optimizations he's using.
Re above, I hadn't parallelized everything I could and now I can get e.g. 2.48 tok/sec vs 2.82 tok/s for llama.cpp on a 3B model (+/-, those are the numbers of the last runs I did). So it really is on par with llama.cpp which is very heavily optimized.
Very good, great job! Yes, Fortran is capable of matching the speed of the most optimized libraries.
A couple notes:
I had previously been using a lookup table for FP16->real(4) conversion as it was faster than calling the C function I had been using. In my experiments, the conversion represents ~50% of the total time in single threaded operation. I replaced the former FP16 conversion library with the code used in GGML and more importantly compiles with the -flto
option for link-time optimization and got some speedup while also dropping the external library I'd used. See the f16_convert
branch.
With these changes, w have the following performance (indicative numbers on my machine):
Program | 1 Thread | 12 Thread |
---|---|---|
f90 | .66 tok/s | 2.7 tok/s |
llamacpp | 2.38 tok/s | 2.82 tok/s |
Clearly there is a lot of room to improve on single thread performance but I'm surprised at how little different the additional threads give for llama.cpp. This gives us something to dig into anyway.
In single threaded operation, the breakdown of times is roughly as follows (the total should be roughly 1/(speed in tok/s), i.e. 1500 ms.
n | Description | time |
---|---|---|
1 | qkv projections | 376.000000 ms |
2 | position embeddings | 0.00000000 ms |
3 | attention | 5.66666651 ms |
4 | up/down projections | 1092.66663 ms |
5 | classifier head | 46.3333321 ms |
Excellent. I think llama.cpp gets speedup in parallel, however be careful that the 1 thread benchmark is truly 1 thread. A lot of time is spent in matmul, and OpenBLAS for example runs in parallel by default (I think). Overall I think this is already nicely competitive and we'll be able to match the performance.
Can llama.cpp run GPT-2? If so, we can test against fastGPT, where I understand the performance quite well.
Can llama.cpp run GPT-2?
llama.cpp doesn't appear to support GPT-2 directly. there is an old demo of using GPT-2 with the GGML library. See https://github.com/ggerganov/ggml/tree/239defe61dbe9dddc6304942e8a3d03d6a3c69ab#gpt-inference-example This is broken I had to do make gpt-2-batched
and it uses the older file formats etc so I don't know how fair a comparison it gives. I got it running, I'll try vs fastgpt.
however be careful that the 1 thread benchmark is truly 1 thread
I think it is, I only see one thread running in htop, and I got the same result compiling without BLAS. I was able to get some more speedup replacing the FP16 to float32 conversion with hard coded SIMD instuctions: https://github.com/rbitr/llama2.f90/blob/f16_convert/convert.c#L78 This brings the speed from 0.66 tok/s to 1 tok/s on a single thread. Notably it doesn't speed it up for multi-thread operation. I'm suspecting that on my machine there is some other bottleneck that is causing the .f90 version and .cpp version to be limited around 2.75 tok/s. I'm still looking at what else llama.cpp may be doing differently, but as you say I think it's already very competitive.
For GPT-2 and f32 model most of the performance comes from matmul. For llama I would expect a similar result. So one way to go forward is to use f32 model and benchmark that (if any exists). The point is to get something where we get the same or better performance. Then we can add other features (like reduced accuracy) back in and ensure, one by one, that they run at top speed.
Yes good idea. I did a comparison using all 32-bit with a 1.1B Llama model (due to memory size this is easier to work with than the 3B I was using for the other benchmarks and the quality seems the same as it's a newer model.) What I get after hacking a 32-bit version is 1.1B Parameter model @ 3.2 tok/s vs 4.2 tok/s with ggml. I have some stuff to work on to try and bring up the performance and like you say this is a good way to do a basic comparison without any confounding stuff like the fp16. Although ggml runs f16 faster than f32, I think because it is doing custom vectorization.
Very good. Let's do 32bit. It's now 3.2 vs 4.2 tok/s, so it's close, but not quite there yet. Let's get it to be exactly equal. On single core, the possible differences are:
Start with the first point, then we'll see. One way to go forward is to hack llama.cpp and simplify the algorithm to do some matmul but remove other operations (it will return non-sensical results of course) and do the same in llama2.f90, and do whatever it takes to get the same performance, possibly just matmul. Then keep adding back the other operations, one by one, in both codes and see what slows it down.
So good news: I've matched the speed of llama.cpp. I have (for example) 1.1B model @ 4.18 tok/s for llama.cpp and 4.21 toks/s with Fortran. Other than a bit of cleanup of some unneeded copying, the main things responsible for the speed were replacing matmul
with a loop over dot products (since it's all vector-matrix multiplication, there is a huge penalty for using matmul
directly I find) and where possible ensuring better memory contiguity for the qkv and ffn weights. I also found that there is a small speedup from hard coding the matrix dimensions as they are always known in advance.
For the comparison I didn't use BLAS (I compiled llama.cpp without BLAS as well) but in my experiments I don't see any material difference on my machine.
What I'm going to do now is clean it up and have this pared down fast version as master
for the time being as well as write a better readme and post describing it.
Beautiful! Great job. This was the hardest.
Now when you have a version that runs as fast, you can start adding back features, one by one, and always benchmark and ensure the new feature doesn't slow things down. Then you can also investigate parallelism, always one feature at a time.
Current performance with 16-bit quantization is 7.3 tok/s on one thread, vs 7.4 tok/s with llama.cpp. This uses a SIMD routine to convert from 16-bit to f32 and dot-product (as does llama.cpp) and also the -fprefetch-loop-arrays
compiler option which gives a small performance boost.
Thank you so much for writing this. We are now working on compiling it with LFortran, this is a great example.
Regarding performance on my Apple M1 Max with GFortran 11.3.0, I get about 240 tokens/s with the default gfortran options. With
-O3 -march=native -ffast-math -funroll-loops
I get about 277 tokens/s. Finally, withgfortran -O3 -march=native -ffast-math -funroll-loops -fexternal-blas llama2.f90 -o llm -framework Accelerate
which should be the fastest, I still only get about 270 tokens/s. I think this is too small of a model, one would have to try a larger version to take advantage of the accelerated linear algebra.