turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.66k stars 214 forks source link

Speculative decoding? #218

Open bryanhpchiang opened 11 months ago

bryanhpchiang commented 11 months ago

https://github.com/dust-tt/llama-ssp

Any plans to implement speculative decoding? Would probably improve latency by at least 2x and seems not too difficult to implement.

SinanAkkoyun commented 11 months ago

If I may answer for turboderp, speculative decoding is planned in some time for exllama v2 I am also interested and would really like to implement it if turboderp has lots of other stuff to do :)

reference: https://github.com/turboderp/exllama/issues/149#issuecomment-1652408059

bryanhpchiang commented 11 months ago

Thanks for linking! I'm excited.

The main concern I have is for speculative decoding is that latency improvements bounded by the size of the model. Since exllama only seems to support Llama style architectures, I wonder if there are any ~1B Llama models out there that could be used.

SinanAkkoyun commented 11 months ago

@bryanhpchiang That's what the 3B is for :) In the end, if a 1B model with way worse performance (meaning quality performance, not speed) will result in the big model needing to reject the speculation all the time, speed will be much worse on average

bryanhpchiang commented 11 months ago

Makes sense! I think that'd be worth benchmarking: specifically, if you really care about latency, I think it's possible to finetune a 1B on a specific usecase to improve the error rate.

SinanAkkoyun commented 11 months ago

I totally agree, I am also looking for even smaller models for some custom stuff I am working on. Do you find the roughly 220 tokens/second of the 3B model limiting? In the end, the big model seems to make the most difference. For exllama v2, there also might be a significant speed increase, also for the 3B model

(my only concern with 1B is, that there is no pretrained Llama model iirc)

bryanhpchiang commented 11 months ago

Great to hear that the v2 is an improvement. For my usecase, the main metric I care about is time to first token. What does that look like for 3B?

For the last point, I think that’s why non-Llama models like OPT via other libraries like CT2 might make sense.

On Thu, Aug 3 2023 at 3:20 PM, Sinan < @.*** > wrote:

I totally agree, I am also looking for even smaller models for some custom stuff I am working on. Do you find the roughly 220 tokens/second of the 3B model limiting? In the end, the big model seems to make the most difference. For exllama v2, there also might be a significant speed increase, also for the 3B model

(my only concern with 1B is, that there is no pretrained Llama model iirc)

— Reply to this email directly, view it on GitHub ( https://github.com/turboderp/exllama/issues/218#issuecomment-1664712978 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AXMEJMB6EL7CAFICF2WATSTXTQPZFANCNFSM6AAAAAA3BX3GMI ). You are receiving this because you were mentioned. Message ID: <turboderp/exllama/issues/218/1664712978 @ github. com>

turboderp commented 11 months ago

For my usecase, the main metric I care about is time to first token. What does that look like for 3B?

Well, on the 4090 I'm getting about 16,500 tokens/second for 3B. So that's about 120 ms for a 2000-token prompt.

Of course, in speculative sampling you'd also have to do inference on the prompt with the full-scale model.

bryanhpchiang commented 11 months ago

Just to confirm: 16.5K tok/s for processing the prompt, not sampling?

My use case ideally requires < 50ms until a usable chunk is produced which is why smaller models are appealing. Will run some benchmarks with some other frameworks and let you know how that goes.

On Thu, Aug 03, 2023 at 10:14 PM, turboderp < @.*** > wrote:

For my usecase, the main metric I care about is time to first token. What does that look like for 3B?

Well, on the 4090 I'm getting about 16,500 tokens/second for 3B. So that's about 120 ms for a 2000-token prompt.

Of course, in speculative sampling you'd also have to do inference on the prompt with the full-scale model.

— Reply to this email directly, view it on GitHub ( https://github.com/turboderp/exllama/issues/218#issuecomment-1665007003 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AXMEJMH4YFAO6OAUXNU5RTDXTSANDANCNFSM6AAAAAA3BX3GMI ). You are receiving this because you were mentioned. Message ID: <turboderp/exllama/issues/218/1665007003 @ github. com>

SinanAkkoyun commented 11 months ago

https://github.com/turboderp/exllama/blob/master/README.md

There is a benchmark for all models, you can see new token generation and prompt speeds

SolsticeProjekt commented 11 months ago

Great to hear that the v2 is an improvement. For my usecase, the main metric I care about is time to first token. What does that look like for 3B? For the last point, I think that’s why non-Llama models like OPT via other libraries like CT2 might make sense. On Thu, Aug 3 2023 at 3:20 PM, Sinan < @.*** > wrote: I totally agree, I am also looking for even smaller models for some custom stuff I am working on. Do you find the roughly 220 tokens/second of the 3B model limiting? In the end, the big model seems to make the most difference. For exllama v2, there also might be a significant speed increase, also for the 3B model (my only concern with 1B is, that there is no pretrained Llama model iirc) — Reply to this email directly, view it on GitHub ( #218 (comment) ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AXMEJMB6EL7CAFICF2WATSTXTQPZFANCNFSM6AAAAAA3BX3GMI ). You are receiving this because you were mentioned. Message ID: <turboderp/exllama/issues/218/1664712978 @ github. com>

I haven't found a good 3B model for ExLLama yet. There's open_llama_3b_v2-8k-GPTQ, but it's not actually good, at least not compared to the orca-mini. 3B GGML are rare, 3B GPTQ for ExLLama seem to be even rarer more rare. I've successfully used "orca-mini-3b.ggmlv3.q4_1.bin" with llamacpp, in case it helps. 70+ tokens per second inference on my notebook's 3060 with 6 gigs (fully offloaded to the GPU), CPU set to one thread.

I can look up the prompt's t/sec if you want to, but reaction time is fast.

turboderp commented 11 months ago

Here's one. It's the one the results in the readme are based on. Seems to work alright.

SolsticeProjekt commented 11 months ago

Here's one. It's the one the results in the readme are based on. Seems to work alright.

Thanks. This is the result of test_benchmark_inference using "-p -ppl":

notebook, 5900HS, 3060 6gigs:

First Pass: Time, Inference: 0.68 seconds Speed: 2806.31 tokens/second -- Generating 128 tokens, 1920 token prompt... Speed: 49.32 tokens/second -- Generating 128 tokens, 4 token prompt... Speed: 72.90 tokens/second VRAM, Inference: [cuda:0] 522.08 MB VRAM, Total: [cuda:0] 3,128.43 MB -- Loading dataset... -- Testing 100 chunks.......... ** Perplexity: 7.8114

I don't know what all the passes are there for, but 72.9 t/sec is around what I get with llamacpp using the orcamini3B. This one performs a lot better in terms of perplexity at 7.81, compared to open_llama_3b_v2-8k-GPTQ at 8.2. Sadly there's no orca-mini 3B, except for one called "badtest" on hf, which I won't try for obvious reasons.

Thanks!

(Edit: These openllama models pale in comparison to orcamini, or my prompts are all wrong.)

SinanAkkoyun commented 11 months ago

@SolsticeProjekt

https://huggingface.co/SinanAkkoyun/orca_mini_3b_gptq_badtest :)

This is for actual chatting and not a base model. I quantized it myself, that's why it's called badtest, although it performs wonderfully and in some neiche tasks including listening to system prompts even impressed me more than the 7B chat in some cases

SolsticeProjekt commented 11 months ago

@SolsticeProjekt

https://huggingface.co/SinanAkkoyun/orca_mini_3b_gptq_badtest :)

This is for actual chatting and not a base model. I quantized it myself, that's why it's called badtest, although it performs wonderfully and in some neiche tasks including listening to system prompts even impressed me more than the 7B chat in some cases

Thanks, I'll give it a go. I'm trying to figure out how to quantize models myself, but this is going really off-topic now ... so thank you, I'll see what it can do. :D

SinanAkkoyun commented 11 months ago

I'm trying to figure out how to quantize models myself

Basically, install AutoGPT and look at my model README, you can quantize them with the other dataset too if you wanted to, it might be easier

SolsticeProjekt commented 11 months ago

I'm trying to figure out how to quantize models myself

Basically, install AutoGPT and look at my model README, you can quantize them with the other dataset too if you wanted to, it might be easier

Tried that already. Ended up not working without output or error message. Looked like it failed loading checkpoints that apparently weren't there, but it should have been working anyway because someone else used the exact same data.

Yours worked fine, except that it makes the same mistakes as all the others I've tested with exllama. I've learned to use the Orca-mini 3B I have as ggml as "the bar", because the results were really good and precise. Beginning to think the issue comes with exllama and has nothing to do with the models, but I can't cross-compare models in llamacpp so... there's that, I guess.

Anyhow. vOv

SinanAkkoyun commented 11 months ago

@SolsticeProjekt Very interesting, please tell me more about what exact issue that is in detai?