turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 274 forks source link

Using HF Safetensors #252

Closed cdreetz closed 9 months ago

cdreetz commented 9 months ago

Does model_dir work if the dir is just a copied HF repo with the safetensor files? Specifically one of turboderp's HF model repos.

If not how do I convert the safetensors to a format that works with exllamav2?

turboderp commented 9 months ago

The model_dir should be a directory containing a complete model in HF format. Just dump all the model's files (config.json, *.safetensors, etc.) in one folder. You can use huggingface-cli to grab all the files from a particular branch like so:

huggingface-cli download turboderp/Llama2-7B-chat-exl2 --revision 4.0bpw --local-dir-use-symlinks False --local-dir my_model_dir

Models converted to EXL2 format still store the quantized weights in .safetensors files, but each .weight tensor is split into several smaller tensors labeled .q_weight, .q_perm, .q_scale etc. As long as the tensors are in either FP16, GPTQ or EXL2 format the loader should automatically handle all that. I guess BF16 and FP32 would work as well, although they'll be cast to FP16 as they're loaded.

cdreetz commented 9 months ago

Awesome that worked. Any way to put model straight onto gpu without saving to cpu lol, using a GPU vm that has 30gb vram but only 16gb ram, so I end up running out of room when downloading dependencies on Exllama if I have the model downloaded as well. Using the mixtral 3.0bpw so looks like

DocShotgun commented 9 months ago

It sounds like you might be having a disk space issue and not a RAM issue if you are running out of room while downloading files.

cdreetz commented 9 months ago

@DocShotgun yeah you're right i just had to set more disk on the VM

cdreetz commented 9 months ago

@turboderp so looks like i got it all working. just curious, is there a secret to the mixtral instruct clip you posted on X? i copied the code you had for generating and downloaded turboderp/Mixtral-8x7B-exl2 --revision 3.0bpw --local-dir-use-symlinks False --local-dir my_model_dir assuming to get similar behavior but it performs vastly different for me. as in just inputting Hi or Hello returns completely random responses and a lot of the time without stopping

https://github.com/cdreetz/exllama/blob/master/main.py is the code i'm using to run it for reference

turboderp commented 9 months ago

That script uses Llama style prompt formatting as used by Mixtral-instruct. You seem to be using the base Mixtral model which doesn't know how to interpret the [INST] tags. If you download the instruct-tuned version instead you should see better results.

(You can make a chatbot out of the base model, it just takes a different prompting style and it becomes much less reliable.)

cdreetz commented 9 months ago

@turboderp thank you!! got it working perfectly now