Closed magician-blue closed 12 months ago
This did not bring about a significant inference speed improvement. Maybe there isn't a memory wall on my CPU.
Oh, I know the reason why there isn't any speed up. n_head == n_kv_heads for stories15M.bin, stories110M.bin. | model | dim | n_layers | n_heads | n_kv_heads | max context length | parameters | val loss | download |
---|---|---|---|---|---|---|---|---|---|
260K | 64 | 5 | 8 | 4 | 512 | 260K | 1.297 | stories260K | |
OG | 288 | 6 | 6 | 6 | 256 | 15M | 1.072 | stories15M.bin | |
42M | 512 | 8 | 8 | 8 | 1024 | 42M | 0.847 | stories42M.bin | |
110M | 768 | 12 | 12 | 12 | 1024 | 110M | 0.760 | stories110M.bin |
Hi @magician-blue , thans for sending PR. Could you please share some details , where is this GQA came from? As I understood istead of multi-head attention you're applying some other attention? What is the benefits ? Probably on bigger models it can show some perfromance boosts?
The GQA I implement here is based on run.c from line 244. GQA comes from this paper.
There are several kinds of attention. MHA(multihead attention which is implemented here), MQA(multiquery attention), GQA(group query attention, llama2). The GQA is a trade-off between MHA and MQA.
In MHA,
In MQA, in each layers all the queries share the same k,v.
In GQA, in each layers all the queries share a group of k, v.
Therefore,
In the case of story15M/42M/110M.bin, n_kv_heads = n_heads. Thus, GQA is exactly the same as MHA.
There is something wrong in the code. Don't merge!
@magician-blue would you mind to share some thoughts how did you find out something was wrong ?
I have fix the bug and test it on stories260k which is GQA. It works fine.
I found there's little difference between our encoder/decoder and that of llama.c.
str_lookup
.<0x01>,<0x0A>
, while our print_str
can't.Now I see the tokenizer can output raw bytes.
I test with mojo llama2.mojo stories260K.bin -tk t260.bin -s 100 -n 256 -t 0 -i "Llama is an animal"
and ./run stories260K.bin -z t260.bin -i "Llama is an animal" -n 256 -t 0
Model name: intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
CPU(s): 12
Model n=256 t= 0 | 260K | 15M | 110M |
---|---|---|---|
llama2.c | 2500-2800 | 51-52 | 7-8 |
llama2.c (run fast) | 5500-6000 | 118-119 | 18-20 |
llama2.c (omp) | 5000-6000 | 204-210 | 32-35 |
llama2.mojo | 2700-3000 | 190-205 | 31-32 |
@magician-blue would you mind to share some thoughts how did you find out something was wrong ?
@tairov My output is different from that of llama.c even though I set the temperature to 0.
Then I check the intermediate variable of llama2.mojo with llama.c line by line.
Now I see the tokenizer can output raw bytes. I test with
mojo llama2.mojo stories260K.bin -tk t260.bin -s 100 -n 256 -t 0 -i "Llama is an animal"
and./run stories260K.bin -z t260.bin -i "Llama is an animal" -n 256 -t 0
Model name: intel(R) Core(TM) i7-10710U CPU @ 1.10GHz CPU(s): 12
Model n=256 t= 0 260K 15M 110M llama2.c 2500-2800 51-52 7-8 llama2.c (run fast) 5500-6000 118-119 18-20 llama2.c (omp) 5000-6000 204-210 32-35 llama2.mojo 2700-3000 190-205 31-32
I'm not sure why the inference speed of my implementation on stories260K is slower than llama2.c(runfast)
@magician-blue could you please add 260k tokenizer as well?
BTW, don't you know why llama2.c
works fine on stories260k without tokenizer?
@magician-blue could you please add 260k tokenizer as well?
BTW, don't you know why
llama2.c
works fine on stories260k without tokenizer?
Really? There is Segmentation fault
if I dont add the tokenizer.
@tairov I have added the t260.bin. You can find it on huggingface. 260k
Shouldn't we add it to this repo ?
@magician-blue in your screen above ./run executing on stories260k without custom tokenizer option
@magician-blue in your screen above ./run executing on stories260k without custom tokenizer option
At that time, I set the t260.bin as the default tokenizer to test easier.
I think we should add this tokenizer to the repo.
Overall I'm happy to merge. Just let's have this tokenizer in the repo , since ther could be questions where get that . Thanks
By some reason it's still failing for me:
mojo llama2.mojo stories260K.bin -tk t260.bin -n 256 -t 0.0
num hardware threads: 6
SIMD vector width: 16
checkpoint size: 67056
[320417:320417:20230918,234008.132002:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq: No such file or directory (2)
[320417:320417:20230918,234008.132146:ERROR file_io_posix.cc:144] open /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: No such file or directory (2)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0. Program arguments: mojo llama2.mojo stories260K.bin -tk t260.bin -n 256 -t 0.0
#0 0x000056013c1d3957 (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bb957)
#1 0x000056013c1d152e (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5b952e)
#2 0x000056013c1d402f (/home/user/.modular/pkg/packages.modular.com_mojo/bin/mojo+0x5bc02f)
#3 0x00007fb05b56f420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420)
#4 0x00007fafe4003840
[1] 320415 segmentation fault (core dumped) mojo llama2.mojo stories260K.bin -tk t260.bin -n 256 -t 0.0
tried this options as well:
mojo llama2.mojo stories260K.bin -tk t260.bin -s 100 -n 256 -t 0 -i "Llama is an animal"
I see difference in checkpoint sizes on your example and my. UPD.. My bad, seems I put wrong file as model.
merged. Thank you!
change from MHA to GQA