[ ] Guide to choosing quants and engines : r/LocalLLaMA

Guide to choosing quants and engines : r/LocalLLaMA

DESCRIPTION:
Ever wonder which type of quant to download for the same model, GPTQ or GGUF or exl2? And what app/runtime/inference engine you should use for this quant? Here's my guide.

TLDR:

If you have multiple gpus of the same type (3090x2, not 3090+3060), and the model can fit in your vram: Choose AWQ+Aphrodite (4 bit only) > GPTQ+Aphrodite > GGUF+Aphrodite;
If you have a single gpu and the model can fit in your vram, or multiple gpus with different vram sizes: Choose exl2+exllamav2 ≈ GPTQ+exllamav2 (4 bit only);
If you need to do offloading or your gpu does not support Aprodite/exllamav2, GGUF+llama.cpp is your only choice.

You want to use a model but cannot fit it in your vram in fp16, so you have to use quantization. When talking about quantization, there are two concept, First is the format, how the model is quantized, the math behind the method to compress the model in a lossy way; Second is the engine, how to run such a quantized model. Generally speaking, quantization of the same format at the same bitrate should have the exactly same quality, but when run on different engines the speed and memory consumption can differ dramatically.

Please note that I primarily use 4-8 bit quants on Linux and never go below 4, so my take on extremely tight quants of <=3 bit might be completely off.

Part I: review of quantization formats.

There are currently 4 most popular quant formats:

GPTQ: The old and good one. It is the first "smart" quantization method. It ultilizes a calibration dataset to improve quality at the same bitrate. Takes a lot time and vram+ram to make a GPTQ quant. Usually comes at 3, 4, or 8 bits. It is widely adapted to almost all kinds of model and can be run on may engines.
AWQ: An even "smarter" format than GPTQ. In theory it delivers better quality than GPTQ of the same bitrate. Usually comes at 4 bits. The recommended quantization format by vLLM and other mass serving engines.
GGUF: A simple quant format that doesn't require calibration, so it's basically round-to-nearest argumented with grouping. Fast and easy to quant but not the "smart" type. Recently imatrix was added to GGUF, which also ultilizes a calibration dataset to make it smarter like GPTQ. GGUFs with imatrix ususally has the "IQ" in name: like "name-IQ3_XS" vs the original "name-Q3_XS". However imatrix is usually applied to tight quants <= 3 and I don't see many larger GGUF quants made with imatrix.
EXL2: The quantization format used by exllamav2. EXL2 is based on the same optimization method as GPTQ. The major advantage of exl2 is that it allows mixing quantization levels within a model to achieve any average bitrate between 2 and 8 bits per weight. So you can tailor the bitrate to your vram: You can fit a 34B model in a single 4090 in 4.65 bpw at 4k context, improving a bit of quality over 4 bit. But if you want longer ctx you can lower the bpw to 4.35 or even 3.5.

So in terms of quality of the same bitrate, AWQ > GPTQ = EXL2 > GGUF. I don't know where should GGUF imatrix be put, I suppose it's at the same level as GPTQ.

Besides, the choice of calibration dataset has subtle effect on the quality of quants. Quants at lower bitrates have the tendency to overfit on the style of the calibration dataset. Early GPTQs used wikitext, making them slightly more "formal, dispassionate, machine-like". The default calibration dataset of exl2 is carefully picked by its author to contain a broad mix of different types of data. There are often also "-rpcal" flavours of exl2 calibrated on roleplay datasets to enhance RP experience.

Part II: review of runtime engines.

Different engines support different formats. I tried to make a table:

Comparison of quant formats and engines Pre-allocation: The engine pre-allocate the vram needed by activation and kv cache, effectively reducing vram usage and improving speed because pytorch handles vram allocation badly. However, pre-allocation means the engine need to take as much vram as your model's max ctx length requires at the start, even if you are not using it.

VRAM optimization: Efficient attention implementation like FlashAttention or PagedAttention to reduce memory usage, especially at long context.

One notable player here is the Aphrodite-engine (Aphrodite-engine). At first glance it looks like a replica of vLLM, which sounds less attractive for in-home usage when there are no concurrent requests. However after GGUF is supported and exl2 on the way, it could be a game changer. It supports tensor-parallel out of the box, that means if you have 2 or more gpus, you can run your (even quantized) model in parallel, and that is much faster than all the other engines where you can only use your gpus sequentially. I achieved 3x speed over llama.cpp running miqu using 4 2080 Ti!

Some personal notes:

If you are loading a 4 bit GPTQ model in hugginface transformer or AutoGPTQ, unless you specify otherwise, you will be using the exllama kernel, but not the other optimizations from exllama.
4 bit GPTQ over exllamav2 is the single fastest method without tensor parallel, even slightly faster than exl2 4.0bpw.
vLLM only supports 4 bit GPTQ but Aphrodite supports 2,3,4,8 bit GPTQ.
Lacking FlashAttention at the moment, llama.cpp is inefficient with prompt preprocessing when context is large, often taking several seconds or even minutes before it can start generation. The actual generation speed is not bad compared to exllamav2.
Even with one gpu, GGUF over Aphrodite can ultilize PagedAttention, possibly offering faster preprocessing speed than llama.cpp.

Update: shing3232 kindly pointed out that you can convert a AWQ model to GGUF and run it in llama.cpp. I never tried that so I cannot comment on the effectiveness of this approach.

URL: Guide to choosing quants and engines

Suggested labels

{'label-name': 'model-quantization-guide', 'label-description': 'Information on choosing quantization formats for machine learning models.', 'confidence': 65.61}

Related issues

304: GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga

### Details

Similarity score: 0.93 - [ ] [GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga](https://www.reddit.com/r/Oobabooga/comments/178yqmg/gptq_vs_exl2_vs_awq_vs_q4_k_m_model_sizes/ GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga) GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes Mod Post Size (mb) Model 16560 Phind_Phind-CodeLlama-34B-v2-EXL2-4.000b 17053 Phind_Phind-CodeLlama-34B-v2-EXL2-4.125b 17463 Phind-CodeLlama-34B-v2-AWQ-4bit-128g 17480 Phind-CodeLlama-34B-v2-GPTQ-4bit-128g-actorder 17548 Phind_Phind-CodeLlama-34B-v2-EXL2-4.250b 18143 Phind_Phind-CodeLlama-34B-v2-EXL2-4.400b 19133 Phind_Phind-CodeLlama-34B-v2-EXL2-4.650b 19284 phind-codellama-34b-v2.Q4_K_M.gguf 19320 Phind-CodeLlama-34B-v2-AWQ-4bit-32g 19337 Phind-CodeLlama-34B-v2-GPTQ-4bit-32g-actorder I created all these EXL2 quants to compare them to GPTQ and AWQ. The preliminary result is that EXL2 4.4b seems to outperform GPTQ-4bit-32g while EXL2 4.125b seems to outperform GPTQ-4bit-128g while using less VRAM in both cases. I couldn't test AWQ yet because my quantization ended up broken, possibly due to this particular model using NTK scaling, so I'll probably have to go through the fun of burning my GPU for 16 hours again to quantize and evaluate another model so that a conclusion can be reached. Also no idea if Phind-CodeLlama is actually good. WizardCoder-Python might be better. #### Suggested labels #### "LLM-Quantization"

389: AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT

### Details

Similarity score: 0.89 - [ ] [AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT](https://forum.opennmt.net/t/awq-quantization-support-new-generic-converter-for-all-hf-llama-like-models/5569) **Quantization and Acceleration** We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. Here's an example of the syntax: ```bash python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors ``` * `TheBloke/Nous-Hermes-Llama2-AWQ`: The name of the repository/model on the Hugging Face Hub. * `output`: Specifies the target directory and model name you want to save. * `format`: Optionally, you can save as safetensors. For llama-like models, we download the `tokenizer.model` and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model. After converting, you will need a config file to run `translate.py` or `run_mmlu_opnenmt.py`. Here's an example of the config: ```yaml transforms: [sentencepiece] #### Subword src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model" tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model" # Model info model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" # Inference # ... ``` When considering your priority: - For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes. - AWQ models are faster than FP16 for batch size 1. Please read more here: [GitHub - casper-hansen/AutoAWQ](https://github.com/casper-hansen/AutoAWQ) **Important Note:** - There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV. - The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ. - If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible. **Offline Quantizer Script:** - We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT. Enjoy! --- **VS**: Fast Inference with vLLM Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows: - Batch size 1: 80.5 tokens/second - Batch size 60: 98 tokens/second, with GEMV being 20-25% faster. This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time. #### Suggested labels #### { "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }

431: awq llama quantization

### Details

Similarity score: 0.89 - [ ] [awq llama quantization](huggingface.co) Quantization and Acceleration ---------------------------- We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. ### Model Conversion Here's an example of the syntax for converting a model: ```python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors ``` - `TheBloke/Nous-Hermes-Llama2-AWQ`: The name of the repository/model on the Hugging Face Hub. - `output`: Specifies the target directory and model name you want to save. - `format`: Optionally, you can save as safetensors. For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model. ### Config File After converting, you will need a config file to run `translate.py` or `run_mmlu_opnenmt.py`. Here's an example of the config: ```yaml transforms: [sentencepiece] Subword: src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model" tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model" Model info: model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" Inference: # ... ``` ### Priority When considering your priority: - For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes. - AWQ models are faster than FP16 for batch size 1. - Read more: [GitHub - casper-hansen/AutoAWQ](https://github.com/casper-hansen/AutoAWQ) ### Important Note - There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV. - The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ. - If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible. ### Offline Quantizer Script We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT. ### vLLM Performance Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows: - Batch size 1: 80.5 tokens/second - Batch size 60: 98 tokens/second, with GEMV being 20-25% faster. - This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time. #### Suggested labels #### null

504: AutoAWQ 4bit quantization

### Details

Similarity score: 0.87 - [ ] [Code search results](https://github.com/casper-hansen/AutoAWQ) ## **CONTENT**: ## TITLE: Code search results DESCRIPTION: - Add file - Folders and files - Name - Latest commit - casper-hansen - Fix n_samples (#326) - ebe8fc3 - History - .github/workflows - AMD ROCM Support (#315) - awq - Fix n_samples (#326) - examples - Marlin symmetric quantization and inference (#320) - scripts - Exclude download of CUDA wheels (#159) - tests - Torch only inference + any-device quantization (#319) - .gitignore - first commit - LICENSE - add LICENSE - README.md - AMD ROCM Support (#315) - setup.py - AMD ROCM Support (#315) Repository files navigation README MIT license AutoAWQ | Roadmap | Examples | Issues: Help Wanted | AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 3x and reduces memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT. Latest News 🔥 [2023/12] Mixtral, LLaVa, QWen, Baichuan model support. [2023/11] AutoAWQ inference has been integrated into 🤗 transformers. Now includes CUDA 12.1 wheels. [2023/10] Mistral (Fused Modules), Bigcode, Turing support, Memory Bug Fix (Saves 2GB VRAM) [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon). [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available [2023/08] PyPi package released and AutoModel class available Install Prerequisites NVIDIA: Your NVIDIA GPU(s) must be of Compute Capability 7.5. Turing and later architectures are supported. Your CUDA version must be CUDA 11.8 or later. AMD: Your ROCm version must be ROCm 5.6 or later. Install from PyPi To install the newest AutoAWQ from PyPi, you need CUDA 12.1 installed. pip install autoawq Build from source For CUDA 11.8, ROCm 5.6, and ROCm 5.7, you can install wheels from the release page: pip install autoawq@https://github.com/casper-hansen/AutoAWQ/releases/download/v0.2.0/autoawq-0.2.0+cu118-cp310-cp310-linux_x86_64.whl Or from the main branch directly: pip install autoawq@https://github.com/casper-hansen/AutoAWQ.git Or by cloning the repository and installing from source: git clone https://github.com/casper-hansen/AutoAWQ cd AutoAWQ pip install -e . All three methods will install the latest and correct kernels for your system from AutoAWQ_Kernels. If your system is not supported (i.e. not on the release page), you can build the kernels yourself by following the instructions in AutoAWQ_Kernels and then install AutoAWQ from source. Supported models The detailed support list: Models Sizes LLaMA-2 7B/13B/70B LLaMA 7B/13B/30B/65B Mistral 7B Vicuna 7B/13B MPT 7B/30B Falcon 7B/40B OPT 125m/1.3B/2.7B/6.7B/13B/30B Bloom 560m/3B/7B/ GPTJ 6.7B Aquila 7B Aquila2 7B/34B Yi 6B/34B Qwen 1.8B/7B/14B/72B BigCode 1B/7B/15B GPT NeoX 20B GPT-J 6B LLaVa 7B/13B Mixtral 8x7B Baichuan 7B/13B QWen 1.8B/7B/14/72B Usage Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models. INT4 GEMM vs INT4 GEMV vs FP16 There are two versions of AWQ: GEMM #### Suggested labels #### null

457: I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA

### Details

Similarity score: 0.87 - [ ] [I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1abihou/i_keep_running_out_of_memory_whats_the_biggest/) Here's the reformatted text in Markdown format: ```markdown # GPU and Model Recommendations **GPU Only:** - You can use 7B Models at 8 bpw with 8K context, or maybe up to 12k context. - If you wish to use 13B models, then you have to use 4bpw and limit yourself to 2K Context. **GPU + CPU:** - Use `.gguf` files to offload part of the model to VRAM. - Check the disk usage when inferencing in the activity monitor app (or whatever it is called in your OS). If the disk usage is 100% (disk is swapping), then it is impossible to fit the model in RAM + VRAM and tokens per second will be very low. - In that case, reduce context size and reduce bpw. The best models you can probably run now are: - OpenChat 3.5 7B at 8bpw (Use the latest version) - at 4bpw and 4K context. - If you want to run Nous-Capybara-34b, switch to the 3bpw version and try to offload 35 layers to GPU. If you want to run bigger models, upgrade RAM to 64GB. **Tip from /u/Working-Flatworm-531:** - Just do not load kv in VRAM, you can use `ooba` to disable it. - Also try lower quants, for example Q4_K_S is still good. You still wouldn't be able to run 34B models with good speed, but at least it's something. - You can also check your BIOS and maybe increase RAM frequency. After that, you'd be able to run ~20B models at ~2t/s at 8k ~ 12k context. **Recommended List from /u/Working-Flatworm-531:** - Use Linux. - Overclock RAM (if possible). - Overclock CPU (if possible). - Overclock GPU. - Don't load kv cache in VRAM, instead load more layers to the VRAM. - Use smaller quants. - Use fast interface (didn't try Kobold, use Ooba). - Check RAM (should be dual channel). ``` #### Suggested labels #### { "label-name": "memory-optimization", "description": "Strategies for optimizing memory usage when running models on 3060 12gb GPU.", "confidence": 94.88 }

391: Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA

### Details

Similarity score: 0.86 - [ ] [Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/17h4rqz/speculative_decoding_in_exllama_v2_and_llamacpp/) Speculative Decoding in Exllama v2 and llama.cpp Comparison ============================================================= Discussion ----------- We discussed speculative decoding (SD) in a previous thread. For those who are not aware of this feature, it allows LLM loaders to use a smaller "draft" model to help predict tokens for a larger model. In that thread, someone asked for tests of speculative decoding for both Exllama v2 and llama.cpp. Although I generally only run models in GPTQ, AWQ, or exl2 formats, I was interested in doing the exl2 vs. llama.cpp comparison. Test Setup ----------- The tests were run on a 2x 4090, 13900K, DDR5 system. The screen captures of the terminal output of both are available below. If someone has experience with making llama.cpp speculative decoding work better, please share. Exllama v2 Results ------------------ **Model:** Xwin-LM-70B-V0.1-4.0bpw-h6-exl2 **Draft Model:** TinyLlama-1.1B-1T-OpenOrca-GPTQ Performance can be highly variable, but it goes from ~20 t/s without SD to 40-50 t/s with SD. ### No SD ```bash Prompt processed in 0.02 seconds, 4 tokens, 200.61 tokens/second Response generated in 10.80 seconds, 250 tokens, 23.15 tokens/second ``` ### With SD ```bash Prompt processed in 0.03 seconds, 4 tokens, 138.80 tokens/second Response generated in 5.10 seconds, 250 tokens, 49.05 tokens/second ``` #### Suggested labels #### { "key": "speculative-decoding", "value": "Technique for using a smaller 'draft' model to help predict tokens for a larger model" }

irthomasthomas / undecidability