rbitr / llm.f90

LLM inference in Fortran
MIT License
54 stars 1 forks source link

Performance #3

Open certik opened 1 year ago

certik commented 1 year ago

Thank you so much for writing this. We are now working on compiling it with LFortran, this is a great example.

Regarding performance on my Apple M1 Max with GFortran 11.3.0, I get about 240 tokens/s with the default gfortran options. With -O3 -march=native -ffast-math -funroll-loops I get about 277 tokens/s. Finally, with gfortran -O3 -march=native -ffast-math -funroll-loops -fexternal-blas llama2.f90 -o llm -framework Accelerate which should be the fastest, I still only get about 270 tokens/s. I think this is too small of a model, one would have to try a larger version to take advantage of the accelerated linear algebra.

rbitr commented 1 year ago

Thanks for taking a look. I changed it so the model can be specified on the command line now. With the 42M model I get about 67 tokens/sec on my thinkpad, vs 165 tokens/sec with the 15M. I'm going to try running the real LLaMA7B model.

certik commented 1 year ago

@rbitr make sure to try the optimization options. On your hardware, you want to do gfortran -O3 -march=native -ffast-math -funroll-loops llama2.f90 -o llm. Even faster would be to add -fexternal-blas and link OpenBLAS with it, but since llama2.c uses handwritten matmul (https://github.com/karpathy/llama2.c/blob/35deb5e0fa55f0a257040bcf1624ed8386e63dc7/run.c#L222), then I think it's fair to just use what gfortran can do without OpenBLAS.

rbitr commented 1 year ago

With gfortran llama2.f90 -o llm I get 166 tokens/sec and with gfortran -O3 -march=native -ffast-math -funroll-loops llama2.f90 -o llm I get 69 tokens/sec. Not sure what's going on. This is with ubuntu, GNU Fortran 9.4 (maybe it's outdated?). I'll take a closer look.

Edit: on my Mac (2018) it looks like I get a small increase, from ~120 tok/sec with no arguments to ~130 with -O3 -march=native -ffast-math -funroll-loops. I'll have to check against llama2.c later.

rbitr commented 1 year ago

Current numbers (for 110M model now, Ubuntu on my Thinkpad): llama2.f90 with gfortran llama2.f90 -o llm -fexternal-blas -lblas: 28 tok/s (this is the fastest I can get with the different options) llama2.c with gcc -Ofast -fopenmp -march=native run.c -lm -o run: 38 tok/s

I wonder if llama2.c's matmuls are getting parallelized better because they are all explicitly vector matrix products?

rbitr commented 1 year ago

OK, I added a handwritten matmul like in llama2.c. Now, unless I missed something, compiling with gfortran -O3 -march=native -ffast-math -funroll-loops llama2.f90 -o llm gives me 38 tok/s on the 110M model, ie same as llama2.c. There is another branch with the custom matmul: https://github.com/rbitr/llama2.f90/tree/manual_matmul

certik commented 1 year ago

Excellent. That's a great starting point for trying various options, such as the intrinsic matmul and various BLAS, and compiler options.

I've noticed in the past that sometimes the intrinsic matmul is slower than a handwritten one. However I think you also have a very old version of GFortran, I think the version 9 is from 2019. What version of gcc do you have to test the llama2.c? Here are the releases together with dates: https://gcc.gnu.org/releases.html

rbitr commented 1 year ago

I installed gfortran-10 and I get the same speed. Still going to investigate different compiler options and I will write a BLAS matmul. I had compiled llama2.c with gcc 9.4. That is the version that ships with Ubuntu 20.04.

certik commented 1 year ago

Very good. Thanks @rbitr. If you need more help, you can also ask at https://fortran-lang.discourse.group/, a lot of knowledgeable people there.

rbitr commented 1 year ago

@certik FYI I was able to get more speedup by writing element-wise functions that compute the q,k,v projections together at the beginning of the transformer and the MLP+nonlinearity part at the end and then parallelizing with OMP. I'm running a 3B parameter model at ~0.8 tok/s on my computer now, up from about 0.1 tok/s a week ago. It should be much faster than llama2.c at this point, but is still slower than llama.cpp, I'm still trying to understand all the optimizations he's using.

rbitr commented 1 year ago

Re above, I hadn't parallelized everything I could and now I can get e.g. 2.48 tok/sec vs 2.82 tok/s for llama.cpp on a 3B model (+/-, those are the numbers of the last runs I did). So it really is on par with llama.cpp which is very heavily optimized.

certik commented 1 year ago

Very good, great job! Yes, Fortran is capable of matching the speed of the most optimized libraries.

rbitr commented 11 months ago

A couple notes: I had previously been using a lookup table for FP16->real(4) conversion as it was faster than calling the C function I had been using. In my experiments, the conversion represents ~50% of the total time in single threaded operation. I replaced the former FP16 conversion library with the code used in GGML and more importantly compiles with the -flto option for link-time optimization and got some speedup while also dropping the external library I'd used. See the f16_convert branch.

With these changes, w have the following performance (indicative numbers on my machine):

Program 1 Thread 12 Thread
f90 .66 tok/s 2.7 tok/s
llamacpp 2.38 tok/s 2.82 tok/s

Clearly there is a lot of room to improve on single thread performance but I'm surprised at how little different the additional threads give for llama.cpp. This gives us something to dig into anyway.

In single threaded operation, the breakdown of times is roughly as follows (the total should be roughly 1/(speed in tok/s), i.e. 1500 ms.

n Description time
1 qkv projections 376.000000 ms
2 position embeddings 0.00000000 ms
3 attention 5.66666651 ms
4 up/down projections 1092.66663 ms
5 classifier head 46.3333321 ms
certik commented 11 months ago

Excellent. I think llama.cpp gets speedup in parallel, however be careful that the 1 thread benchmark is truly 1 thread. A lot of time is spent in matmul, and OpenBLAS for example runs in parallel by default (I think). Overall I think this is already nicely competitive and we'll be able to match the performance.

Can llama.cpp run GPT-2? If so, we can test against fastGPT, where I understand the performance quite well.

rbitr commented 10 months ago

Can llama.cpp run GPT-2?

llama.cpp doesn't appear to support GPT-2 directly. there is an old demo of using GPT-2 with the GGML library. See https://github.com/ggerganov/ggml/tree/239defe61dbe9dddc6304942e8a3d03d6a3c69ab#gpt-inference-example This is broken I had to do make gpt-2-batched and it uses the older file formats etc so I don't know how fair a comparison it gives. I got it running, I'll try vs fastgpt.

however be careful that the 1 thread benchmark is truly 1 thread

I think it is, I only see one thread running in htop, and I got the same result compiling without BLAS. I was able to get some more speedup replacing the FP16 to float32 conversion with hard coded SIMD instuctions: https://github.com/rbitr/llama2.f90/blob/f16_convert/convert.c#L78 This brings the speed from 0.66 tok/s to 1 tok/s on a single thread. Notably it doesn't speed it up for multi-thread operation. I'm suspecting that on my machine there is some other bottleneck that is causing the .f90 version and .cpp version to be limited around 2.75 tok/s. I'm still looking at what else llama.cpp may be doing differently, but as you say I think it's already very competitive.

certik commented 10 months ago

For GPT-2 and f32 model most of the performance comes from matmul. For llama I would expect a similar result. So one way to go forward is to use f32 model and benchmark that (if any exists). The point is to get something where we get the same or better performance. Then we can add other features (like reduced accuracy) back in and ensure, one by one, that they run at top speed.

rbitr commented 10 months ago

Yes good idea. I did a comparison using all 32-bit with a 1.1B Llama model (due to memory size this is easier to work with than the 3B I was using for the other benchmarks and the quality seems the same as it's a newer model.) What I get after hacking a 32-bit version is 1.1B Parameter model @ 3.2 tok/s vs 4.2 tok/s with ggml. I have some stuff to work on to try and bring up the performance and like you say this is a good way to do a basic comparison without any confounding stuff like the fp16. Although ggml runs f16 faster than f32, I think because it is doing custom vectorization.

certik commented 10 months ago

Very good. Let's do 32bit. It's now 3.2 vs 4.2 tok/s, so it's close, but not quite there yet. Let's get it to be exactly equal. On single core, the possible differences are:

Start with the first point, then we'll see. One way to go forward is to hack llama.cpp and simplify the algorithm to do some matmul but remove other operations (it will return non-sensical results of course) and do the same in llama2.f90, and do whatever it takes to get the same performance, possibly just matmul. Then keep adding back the other operations, one by one, in both codes and see what slows it down.

rbitr commented 10 months ago

So good news: I've matched the speed of llama.cpp. I have (for example) 1.1B model @ 4.18 tok/s for llama.cpp and 4.21 toks/s with Fortran. Other than a bit of cleanup of some unneeded copying, the main things responsible for the speed were replacing matmul with a loop over dot products (since it's all vector-matrix multiplication, there is a huge penalty for using matmul directly I find) and where possible ensuring better memory contiguity for the qkv and ffn weights. I also found that there is a small speedup from hard coding the matrix dimensions as they are always known in advance.

For the comparison I didn't use BLAS (I compiled llama.cpp without BLAS as well) but in my experiments I don't see any material difference on my machine.

What I'm going to do now is clean it up and have this pared down fast version as master for the time being as well as write a better readme and post describing it.

certik commented 10 months ago

Beautiful! Great job. This was the hardest.

Now when you have a version that runs as fast, you can start adding back features, one by one, and always benchmark and ensure the new feature doesn't slow things down. Then you can also investigate parallelism, always one feature at a time.

rbitr commented 10 months ago

Current performance with 16-bit quantization is 7.3 tok/s on one thread, vs 7.4 tok/s with llama.cpp. This uses a SIMD routine to convert from 16-bit to f32 and dot-product (as does llama.cpp) and also the -fprefetch-loop-arrays compiler option which gives a small performance boost.