turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.69k stars 283 forks source link

Added draft token count as parameter to chat.py #635

Closed SinanAkkoyun closed 1 month ago

SinanAkkoyun commented 1 month ago

Added -dn parameter to examples/chat.py to change the amount of drafted tokens better

turboderp commented 1 month ago

This seems fine, but do note that the chat example uses the deprecated streaming generator which will be removed at some point (or replaced with a wrapper). Either way the speculative decoding performance is better in the dynamic generator so I don't think it makes too much sense to finetune it in the old generator.