Closed cdreetz closed 9 months ago
The model_dir
should be a directory containing a complete model in HF format. Just dump all the model's files (config.json, *.safetensors, etc.) in one folder. You can use huggingface-cli
to grab all the files from a particular branch like so:
huggingface-cli download turboderp/Llama2-7B-chat-exl2 --revision 4.0bpw --local-dir-use-symlinks False --local-dir my_model_dir
Models converted to EXL2 format still store the quantized weights in .safetensors files, but each .weight tensor is split into several smaller tensors labeled .q_weight, .q_perm, .q_scale etc. As long as the tensors are in either FP16, GPTQ or EXL2 format the loader should automatically handle all that. I guess BF16 and FP32 would work as well, although they'll be cast to FP16 as they're loaded.
Awesome that worked. Any way to put model straight onto gpu without saving to cpu lol, using a GPU vm that has 30gb vram but only 16gb ram, so I end up running out of room when downloading dependencies on Exllama if I have the model downloaded as well. Using the mixtral 3.0bpw so looks like
It sounds like you might be having a disk space issue and not a RAM issue if you are running out of room while downloading files.
@DocShotgun yeah you're right i just had to set more disk on the VM
@turboderp so looks like i got it all working. just curious, is there a secret to the mixtral instruct clip you posted on X? i copied the code you had for generating and downloaded turboderp/Mixtral-8x7B-exl2 --revision 3.0bpw --local-dir-use-symlinks False --local-dir my_model_dir
assuming to get similar behavior but it performs vastly different for me. as in just inputting Hi or Hello returns completely random responses and a lot of the time without stopping
https://github.com/cdreetz/exllama/blob/master/main.py is the code i'm using to run it for reference
That script uses Llama style prompt formatting as used by Mixtral-instruct. You seem to be using the base Mixtral model which doesn't know how to interpret the [INST]
tags. If you download the instruct-tuned version instead you should see better results.
(You can make a chatbot out of the base model, it just takes a different prompting style and it becomes much less reliable.)
@turboderp thank you!! got it working perfectly now
Does model_dir work if the dir is just a copied HF repo with the safetensor files? Specifically one of turboderp's HF model repos.
If not how do I convert the safetensors to a format that works with exllamav2?