Update llama2.mojo from MHA to GQA

magician-blue commented 12 months ago

change from MHA to GQA

magician-blue commented 12 months ago

This did not bring about a significant inference speed improvement. Maybe there isn't a memory wall on my CPU.

magician-blue commented 12 months ago

Oh, I know the reason why there isn't any speed up. n_head == n_kv_heads for stories15M.bin, stories110M.bin.	model	dim	n_layers	n_heads	n_kv_heads	max context length	parameters	val loss
260K	64	5	8	4	512	260K	1.297	stories260K
OG	288	6	6	6	256	15M	1.072	stories15M.bin
42M	512	8	8	8	1024	42M	0.847	stories42M.bin
110M	768	12	12	12	1024	110M	0.760	stories110M.bin

tairov commented 12 months ago

Hi @magician-blue , thans for sending PR. Could you please share some details , where is this GQA came from? As I understood istead of multi-head attention you're applying some other attention? What is the benefits ? Probably on bigger models it can show some perfromance boosts?

magician-blue commented 12 months ago

The GQA I implement here is based on run.c from line 244. GQA comes from this paper.

There are several kinds of attention. MHA(multihead attention which is implemented here), MQA(multiquery attention), GQA(group query attention, llama2). The GQA is a trade-off between MHA and MQA.

In MHA,

$q = [q_1, q2, ..., q{nheads}]$ (seq_len n_heads head_size)
$k = [k_1, k2, ..., k{nheads}]$ (seq_len n_heads head_size)
$v = [v_1, v2, ..., v{nheads}]$ (seq_len n_heads head_size)
$o = [softmax(q_1k_1^T)v_1, softmax(q_1k_1^T)v1,..., softmax(q{nheads}k{ nheads}^T)v{nheads}]W_o$ (seq_len n_heads head_size)=(seq_len (n_heads head_size)) = (seq_len * dim)
detail: $q_1,k_1,v_1\in\mathbb{R}^{seq_len \times head_size}$, $softmax(q_1k_1^T)v_1\in\mathbb{R}^{seq_len \times head_size}$.

In MQA, in each layers all the queries share the same k,v.

$q = [q_1, q2, ..., q{nheads}]$ (seq_len n_heads head_size)
$k = [k_1]$ (seq_len * head_size)
$v = [v_1]$ (seq_len * head_size)
$o = [softmax(q_1k_1^T)v_1, softmax(q_1k_1^T)v1,..., softmax(q{nheads}k_1^T)v_1,]W_o$ (seq_len n_heads head_size) This significantly reduce the memory use (to 1/3 + 2/n_heads). Then we can increase the batch size to get a higher throughput. However, MQA do harms to the performance.

In GQA, in each layers all the queries share a group of k, v.

$q = [q_1, q2, ..., q{nheads}]$ (seq_len n_heads head_size)
$k = [k_1, k2, ..., k{nkvheads}]$ (seq_len n_kv_heads head_size)
$v = [v_1, v2, ..., v{nkvheads}]$ (seq_len n_kv_heads head_size)
$o = [softmax(q_1k_1^T)v_1, softmax(q_1k_1^T)v1,..., softmax(q{nheads}k{ nkvheads}^T)v{nkvheads}]W_o$ (seq_len n_heads head_size) This means:
From $q1$ to $q{nheads/nkvheads}$, I will use $k_1, v_1$.
From $q{nheads/nkvheads+1}$ to $q{nheads/nkvheads*2}$, I will use $k_2, v_2$.
etc Memory use is reduces to (n_heads+2n_kv_heads)/(3 n_heads).

Therefore,

If n_kv_heads = 1, GQA = MQA.
If n_kv_heads = n_heads, GQA = MHA

In the case of story15M/42M/110M.bin, n_kv_heads = n_heads. Thus, GQA is exactly the same as MHA.

magician-blue commented 12 months ago

There is something wrong in the code. Don't merge!

tairov commented 12 months ago

@magician-blue would you mind to share some thoughts how did you find out something was wrong ?

magician-blue commented 12 months ago

I have fix the bug and test it on stories260k which is GQA. It works fine.

magician-blue commented 12 months ago

I found there's little difference between our encoder/decoder and that of llama.c.

Encoder: They sort the vocab to speed up the str_lookup.
Decoder: Their decoder can deal with <0x01>,<0x0A>, while our print_str can't.

tic-top commented 12 months ago

Now I see the tokenizer can output raw bytes. I test with mojo llama2.mojo stories260K.bin -tk t260.bin -s 100 -n 256 -t 0 -i "Llama is an animal" and ./run stories260K.bin -z t260.bin -i "Llama is an animal" -n 256 -t 0

Model name: intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
CPU(s): 12

Model n=256 t= 0	260K	15M	110M
llama2.c	2500-2800	51-52	7-8
llama2.c (run fast)	5500-6000	118-119	18-20
llama2.c (omp)	5000-6000	204-210	32-35
llama2.mojo	2700-3000	190-205	31-32

magician-blue commented 12 months ago

@magician-blue would you mind to share some thoughts how did you find out something was wrong ?

@tairov My output is different from that of llama.c even though I set the temperature to 0.

Then I check the intermediate variable of llama2.mojo with llama.c line by line.

magician-blue commented 12 months ago

Now I see the tokenizer can output raw bytes. I test with mojo llama2.mojo stories260K.bin -tk t260.bin -s 100 -n 256 -t 0 -i "Llama is an animal" and ./run stories260K.bin -z t260.bin -i "Llama is an animal" -n 256 -t 0
Model name: intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
CPU(s): 12
Model n=256 t= 0 260K 15M 110M llama2.c 2500-2800 51-52 7-8 llama2.c (run fast) 5500-6000 118-119 18-20 llama2.c (omp) 5000-6000 204-210 32-35 llama2.mojo 2700-3000 190-205 31-32

I'm not sure why the inference speed of my implementation on stories260K is slower than llama2.c(runfast)

tairov commented 12 months ago

@magician-blue could you please add 260k tokenizer as well?

BTW, don't you know why llama2.c works fine on stories260k without tokenizer?

magician-blue commented 12 months ago

@magician-blue could you please add 260k tokenizer as well?

BTW, don't you know why llama2.c works fine on stories260k without tokenizer?

Really? There is Segmentation fault if I dont add the tokenizer.

magician-blue commented 12 months ago

@tairov I have added the t260.bin. You can find it on huggingface. 260k

tairov commented 12 months ago

Shouldn't we add it to this repo ?

tairov commented 12 months ago

@magician-blue in your screen above ./run executing on stories260k without custom tokenizer option

magician-blue commented 12 months ago

@magician-blue in your screen above ./run executing on stories260k without custom tokenizer option

At that time, I set the t260.bin as the default tokenizer to test easier.

magician-blue commented 12 months ago

I think we should add this tokenizer to the repo.

tairov commented 12 months ago

Overall I'm happy to merge. Just let's have this tokenizer in the repo , since ther could be questions where get that . Thanks

tairov commented 12 months ago

By some reason it's still failing for me:

mojo llama2.mojo stories260K.bin -tk t260.bin -n 256 -t 0.0
num hardware threads:  6
SIMD vector width:  16
checkpoint size:  67056
[320417:320417:20230918,234008.132002:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[320417:320417:20230918,234008.132146:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.  Program arguments: mojo llama2.mojo stories260K.bin -tk t260.bin -n 256 -t 0.0
#0 0x000056013c1d3957 (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bb957)
#1 0x000056013c1d152e (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5b952e)
#2 0x000056013c1d402f (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bc02f)
#3 0x00007fb05b56f420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
#4 0x00007fafe4003840
[1]    320415 segmentation fault (core dumped)  mojo llama2.mojo stories260K.bin -tk t260.bin -n 256 -t 0.0

tairov commented 12 months ago

tried this options as well:

mojo llama2.mojo stories260K.bin -tk t260.bin -s 100 -n 256 -t 0 -i "Llama is an animal"

tairov commented 12 months ago

I see difference in checkpoint sizes on your example and my. UPD.. My bad, seems I put wrong file as model.

tairov commented 12 months ago

merged. Thank you!

tairov / llama2.mojo

Update llama2.mojo from MHA to GQA #24