princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
https://arxiv.org/abs/2310.06694
MIT License
533 stars 39 forks source link

KV head count on princeton-nlp/Sheared-LLaMA-1.3B-ShareGPT ? #29

Closed SinanAkkoyun closed 9 months ago

SinanAkkoyun commented 9 months ago

Hi! While testing the model with vLLM https://github.com/vllm-project/vllm/issues/1913 I found out that the KV head count seemed strange, why is it 32 instead of the 16 like the base sheared llama?

Is it safe for me to just change that value to 16 and use the model that way?

Thanks for the great work! I've dreamed about just pruning LLMs for speculative decoding instead of training a separate model for a long time! :)

xiamengzhou commented 9 months ago

Hi! It seems to be a model config error :) I have updated the config files of the sheared-llama-sharegpt models. Sorry for the confusion!

SinanAkkoyun commented 9 months ago

Thanks! 😊