Open bryanhpchiang opened 11 months ago
If I may answer for turboderp, speculative decoding is planned in some time for exllama v2 I am also interested and would really like to implement it if turboderp has lots of other stuff to do :)
reference: https://github.com/turboderp/exllama/issues/149#issuecomment-1652408059
Thanks for linking! I'm excited.
The main concern I have is for speculative decoding is that latency improvements bounded by the size of the model. Since exllama only seems to support Llama style architectures, I wonder if there are any ~1B Llama models out there that could be used.
@bryanhpchiang That's what the 3B is for :) In the end, if a 1B model with way worse performance (meaning quality performance, not speed) will result in the big model needing to reject the speculation all the time, speed will be much worse on average
Makes sense! I think that'd be worth benchmarking: specifically, if you really care about latency, I think it's possible to finetune a 1B on a specific usecase to improve the error rate.
I totally agree, I am also looking for even smaller models for some custom stuff I am working on. Do you find the roughly 220 tokens/second of the 3B model limiting? In the end, the big model seems to make the most difference. For exllama v2, there also might be a significant speed increase, also for the 3B model
(my only concern with 1B is, that there is no pretrained Llama model iirc)
Great to hear that the v2 is an improvement. For my usecase, the main metric I care about is time to first token. What does that look like for 3B?
For the last point, I think that’s why non-Llama models like OPT via other libraries like CT2 might make sense.
On Thu, Aug 3 2023 at 3:20 PM, Sinan < @.*** > wrote:
I totally agree, I am also looking for even smaller models for some custom stuff I am working on. Do you find the roughly 220 tokens/second of the 3B model limiting? In the end, the big model seems to make the most difference. For exllama v2, there also might be a significant speed increase, also for the 3B model
(my only concern with 1B is, that there is no pretrained Llama model iirc)
— Reply to this email directly, view it on GitHub ( https://github.com/turboderp/exllama/issues/218#issuecomment-1664712978 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AXMEJMB6EL7CAFICF2WATSTXTQPZFANCNFSM6AAAAAA3BX3GMI ). You are receiving this because you were mentioned. Message ID: <turboderp/exllama/issues/218/1664712978 @ github. com>
For my usecase, the main metric I care about is time to first token. What does that look like for 3B?
Well, on the 4090 I'm getting about 16,500 tokens/second for 3B. So that's about 120 ms for a 2000-token prompt.
Of course, in speculative sampling you'd also have to do inference on the prompt with the full-scale model.
Just to confirm: 16.5K tok/s for processing the prompt, not sampling?
My use case ideally requires < 50ms until a usable chunk is produced which is why smaller models are appealing. Will run some benchmarks with some other frameworks and let you know how that goes.
On Thu, Aug 03, 2023 at 10:14 PM, turboderp < @.*** > wrote:
For my usecase, the main metric I care about is time to first token. What does that look like for 3B?
Well, on the 4090 I'm getting about 16,500 tokens/second for 3B. So that's about 120 ms for a 2000-token prompt.
Of course, in speculative sampling you'd also have to do inference on the prompt with the full-scale model.
— Reply to this email directly, view it on GitHub ( https://github.com/turboderp/exllama/issues/218#issuecomment-1665007003 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AXMEJMH4YFAO6OAUXNU5RTDXTSANDANCNFSM6AAAAAA3BX3GMI ). You are receiving this because you were mentioned. Message ID: <turboderp/exllama/issues/218/1665007003 @ github. com>
https://github.com/turboderp/exllama/blob/master/README.md
There is a benchmark for all models, you can see new token generation and prompt speeds
Great to hear that the v2 is an improvement. For my usecase, the main metric I care about is time to first token. What does that look like for 3B? For the last point, I think that’s why non-Llama models like OPT via other libraries like CT2 might make sense. … On Thu, Aug 3 2023 at 3:20 PM, Sinan < @.*** > wrote: I totally agree, I am also looking for even smaller models for some custom stuff I am working on. Do you find the roughly 220 tokens/second of the 3B model limiting? In the end, the big model seems to make the most difference. For exllama v2, there also might be a significant speed increase, also for the 3B model (my only concern with 1B is, that there is no pretrained Llama model iirc) — Reply to this email directly, view it on GitHub ( #218 (comment) ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AXMEJMB6EL7CAFICF2WATSTXTQPZFANCNFSM6AAAAAA3BX3GMI ). You are receiving this because you were mentioned. Message ID: <turboderp/exllama/issues/218/1664712978 @ github. com>
I haven't found a good 3B model for ExLLama yet. There's open_llama_3b_v2-8k-GPTQ, but it's not actually good, at least not compared to the orca-mini. 3B GGML are rare, 3B GPTQ for ExLLama seem to be even rarer more rare. I've successfully used "orca-mini-3b.ggmlv3.q4_1.bin" with llamacpp, in case it helps. 70+ tokens per second inference on my notebook's 3060 with 6 gigs (fully offloaded to the GPU), CPU set to one thread.
I can look up the prompt's t/sec if you want to, but reaction time is fast.
Here's one. It's the one the results in the readme are based on. Seems to work alright.
Here's one. It's the one the results in the readme are based on. Seems to work alright.
Thanks. This is the result of test_benchmark_inference using "-p -ppl":
notebook, 5900HS, 3060 6gigs:
First Pass: Time, Inference: 0.68 seconds Speed: 2806.31 tokens/second -- Generating 128 tokens, 1920 token prompt... Speed: 49.32 tokens/second -- Generating 128 tokens, 4 token prompt... Speed: 72.90 tokens/second VRAM, Inference: [cuda:0] 522.08 MB VRAM, Total: [cuda:0] 3,128.43 MB -- Loading dataset... -- Testing 100 chunks.......... ** Perplexity: 7.8114
I don't know what all the passes are there for, but 72.9 t/sec is around what I get with llamacpp using the orcamini3B. This one performs a lot better in terms of perplexity at 7.81, compared to open_llama_3b_v2-8k-GPTQ at 8.2. Sadly there's no orca-mini 3B, except for one called "badtest" on hf, which I won't try for obvious reasons.
Thanks!
(Edit: These openllama models pale in comparison to orcamini, or my prompts are all wrong.)
@SolsticeProjekt
https://huggingface.co/SinanAkkoyun/orca_mini_3b_gptq_badtest :)
This is for actual chatting and not a base model. I quantized it myself, that's why it's called badtest, although it performs wonderfully and in some neiche tasks including listening to system prompts even impressed me more than the 7B chat in some cases
@SolsticeProjekt
https://huggingface.co/SinanAkkoyun/orca_mini_3b_gptq_badtest :)
This is for actual chatting and not a base model. I quantized it myself, that's why it's called badtest, although it performs wonderfully and in some neiche tasks including listening to system prompts even impressed me more than the 7B chat in some cases
Thanks, I'll give it a go. I'm trying to figure out how to quantize models myself, but this is going really off-topic now ... so thank you, I'll see what it can do. :D
I'm trying to figure out how to quantize models myself
Basically, install AutoGPT and look at my model README, you can quantize them with the other dataset too if you wanted to, it might be easier
I'm trying to figure out how to quantize models myself
Basically, install AutoGPT and look at my model README, you can quantize them with the other dataset too if you wanted to, it might be easier
Tried that already. Ended up not working without output or error message. Looked like it failed loading checkpoints that apparently weren't there, but it should have been working anyway because someone else used the exact same data.
Yours worked fine, except that it makes the same mistakes as all the others I've tested with exllama. I've learned to use the Orca-mini 3B I have as ggml as "the bar", because the results were really good and precise. Beginning to think the issue comes with exllama and has nothing to do with the models, but I can't cross-compare models in llamacpp so... there's that, I guess.
Anyhow. vOv
@SolsticeProjekt Very interesting, please tell me more about what exact issue that is in detai?
https://github.com/dust-tt/llama-ssp
Any plans to implement speculative decoding? Would probably improve latency by at least 2x and seems not too difficult to implement.