turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.67k stars 214 forks source link

Which LLama model do you use? Could you give a download link? #219

Closed sleepwalker2017 closed 11 months ago

sleepwalker2017 commented 11 months ago

I try several llama-13b, but can't run it ok.

Seems this project has a strict limit on model format. It doesn't support .bin file format? And also multiple files are also not supported. I merged some safetensors into one but it complains key error.

Could you give a way to download the model you use?

It's better Llama 13B.

Or can you tell me how to generate the file format it needed?

Thank you!

sleepwalker2017 commented 11 months ago

I run it ok using model from this repo: Wizard-Vicuna-13B-Uncensored-GPTQ

Thank you! The performance is much better than fp16!

I suggest you give the model needed or the format it needs in the README.

Another question: I see there is a tuning.h in the code.

Does it support tuning? Or it already used tuning for better performance ?

turboderp commented 11 months ago

I suggest you give the model needed or the format it needs in the README.

The readme does say it's an implementation for 4-bit GPTQ weights. There's also a list of models it's been tested with. Plenty to choose from, but any other 4-bit GPTQ model should work.

For 13B specifically, I'd recommend looking for something based on Llama-2, since it's a much better base model.

Does it support tuning? Or it already used tuning for better performance ?

There are tuning options you can set, like thresholds for when the fused operations and custom kernels kick in and out, but not something you'd normally need to fiddle with, unless it's for compatibility reasons.

sleepwalker2017 commented 11 months ago

thank you! I found the link in the repo readme.

https://github.com/turboderp/exllama/blob/master/doc/model_compatibility.md