tairov / llama2.mojo

Inference Llama 2 in one file of pure 🔥
https://www.modular.com/blog/community-spotlight-how-i-built-llama2-by-aydyn-tairov
MIT License
2.09k stars 140 forks source link

Update llama2.mojo from MHA to GQA #24

Closed magician-blue closed 12 months ago

magician-blue commented 12 months ago

change from MHA to GQA

magician-blue commented 12 months ago

This did not bring about a significant inference speed improvement. Maybe there isn't a memory wall on my CPU.

magician-blue commented 12 months ago
Oh, I know the reason why there isn't any speed up. n_head == n_kv_heads for stories15M.bin, stories110M.bin. model dim n_layers n_heads n_kv_heads max context length parameters val loss download
260K 64 5 8 4 512 260K 1.297 stories260K
OG 288 6 6 6 256 15M 1.072 stories15M.bin
42M 512 8 8 8 1024 42M 0.847 stories42M.bin
110M 768 12 12 12 1024 110M 0.760 stories110M.bin
tairov commented 12 months ago

Hi @magician-blue , thans for sending PR. Could you please share some details , where is this GQA came from? As I understood istead of multi-head attention you're applying some other attention? What is the benefits ? Probably on bigger models it can show some perfromance boosts?

magician-blue commented 12 months ago

The GQA I implement here is based on run.c from line 244. GQA comes from this paper.

There are several kinds of attention. MHA(multihead attention which is implemented here), MQA(multiquery attention), GQA(group query attention, llama2). The GQA is a trade-off between MHA and MQA. image

In MHA,

In MQA, in each layers all the queries share the same k,v.

In GQA, in each layers all the queries share a group of k, v.

Therefore,

In the case of story15M/42M/110M.bin, n_kv_heads = n_heads. Thus, GQA is exactly the same as MHA.

magician-blue commented 12 months ago

There is something wrong in the code. Don't merge!

tairov commented 12 months ago

@magician-blue would you mind to share some thoughts how did you find out something was wrong ?

magician-blue commented 12 months ago

I have fix the bug and test it on stories260k which is GQA. It works fine. image

magician-blue commented 12 months ago

I found there's little difference between our encoder/decoder and that of llama.c.

tic-top commented 12 months ago

Now I see the tokenizer can output raw bytes. I test with mojo llama2.mojo stories260K.bin -tk t260.bin -s 100 -n 256 -t 0 -i "Llama is an animal" and ./run stories260K.bin -z t260.bin -i "Llama is an animal" -n 256 -t 0

Model name: intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
CPU(s): 12
Model n=256 t= 0 260K 15M 110M
llama2.c 2500-2800 51-52 7-8
llama2.c (run fast) 5500-6000 118-119 18-20
llama2.c (omp) 5000-6000 204-210 32-35
llama2.mojo 2700-3000 190-205 31-32
magician-blue commented 12 months ago

@magician-blue would you mind to share some thoughts how did you find out something was wrong ?

@tairov My output is different from that of llama.c even though I set the temperature to 0.

Then I check the intermediate variable of llama2.mojo with llama.c line by line.

magician-blue commented 12 months ago

Now I see the tokenizer can output raw bytes. I test with mojo llama2.mojo stories260K.bin -tk t260.bin -s 100 -n 256 -t 0 -i "Llama is an animal" and ./run stories260K.bin -z t260.bin -i "Llama is an animal" -n 256 -t 0

Model name: intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
CPU(s): 12

Model n=256 t= 0 260K 15M 110M llama2.c 2500-2800 51-52 7-8 llama2.c (run fast) 5500-6000 118-119 18-20 llama2.c (omp) 5000-6000 204-210 32-35 llama2.mojo 2700-3000 190-205 31-32

I'm not sure why the inference speed of my implementation on stories260K is slower than llama2.c(runfast)

tairov commented 12 months ago

@magician-blue could you please add 260k tokenizer as well?

BTW, don't you know why llama2.c works fine on stories260k without tokenizer?

magician-blue commented 12 months ago

@magician-blue could you please add 260k tokenizer as well?

BTW, don't you know why llama2.c works fine on stories260k without tokenizer?

Really? There is Segmentation fault if I dont add the tokenizer.

magician-blue commented 12 months ago

@tairov I have added the t260.bin. You can find it on huggingface. 260k

tairov commented 12 months ago

Shouldn't we add it to this repo ?

tairov commented 12 months ago

@magician-blue in your screen above ./run executing on stories260k without custom tokenizer option

magician-blue commented 12 months ago

@magician-blue in your screen above ./run executing on stories260k without custom tokenizer option

At that time, I set the t260.bin as the default tokenizer to test easier.

magician-blue commented 12 months ago

I think we should add this tokenizer to the repo.

tairov commented 12 months ago

Overall I'm happy to merge. Just let's have this tokenizer in the repo , since ther could be questions where get that . Thanks

tairov commented 12 months ago

By some reason it's still failing for me:

mojo llama2.mojo stories260K.bin -tk t260.bin -n 256 -t 0.0
num hardware threads:  6
SIMD vector width:  16
checkpoint size:  67056
[320417:320417:20230918,234008.132002:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[320417:320417:20230918,234008.132146:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.  Program arguments: mojo llama2.mojo stories260K.bin -tk t260.bin -n 256 -t 0.0
#0 0x000056013c1d3957 (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bb957)
#1 0x000056013c1d152e (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5b952e)
#2 0x000056013c1d402f (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bc02f)
#3 0x00007fb05b56f420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
#4 0x00007fafe4003840
[1]    320415 segmentation fault (core dumped)  mojo llama2.mojo stories260K.bin -tk t260.bin -n 256 -t 0.0
tairov commented 12 months ago

tried this options as well:

mojo llama2.mojo stories260K.bin -tk t260.bin -s 100 -n 256 -t 0 -i "Llama is an animal"
tairov commented 12 months ago

I see difference in checkpoint sizes on your example and my. UPD.. My bad, seems I put wrong file as model.

tairov commented 12 months ago

merged. Thank you!