irthomasthomas / undecidability

6 stars 2 forks source link

Guide to choosing quants and engines : r/LocalLLaMA #641

Open irthomasthomas opened 6 months ago

irthomasthomas commented 6 months ago

Guide to choosing quants and engines : r/LocalLLaMA

DESCRIPTION:
Ever wonder which type of quant to download for the same model, GPTQ or GGUF or exl2? And what app/runtime/inference engine you should use for this quant? Here's my guide.

TLDR:

You want to use a model but cannot fit it in your vram in fp16, so you have to use quantization. When talking about quantization, there are two concept, First is the format, how the model is quantized, the math behind the method to compress the model in a lossy way; Second is the engine, how to run such a quantized model. Generally speaking, quantization of the same format at the same bitrate should have the exactly same quality, but when run on different engines the speed and memory consumption can differ dramatically.

Please note that I primarily use 4-8 bit quants on Linux and never go below 4, so my take on extremely tight quants of <=3 bit might be completely off.

Part I: review of quantization formats.

There are currently 4 most popular quant formats:

So in terms of quality of the same bitrate, AWQ > GPTQ = EXL2 > GGUF. I don't know where should GGUF imatrix be put, I suppose it's at the same level as GPTQ.

Besides, the choice of calibration dataset has subtle effect on the quality of quants. Quants at lower bitrates have the tendency to overfit on the style of the calibration dataset. Early GPTQs used wikitext, making them slightly more "formal, dispassionate, machine-like". The default calibration dataset of exl2 is carefully picked by its author to contain a broad mix of different types of data. There are often also "-rpcal" flavours of exl2 calibrated on roleplay datasets to enhance RP experience.

Part II: review of runtime engines.

Different engines support different formats. I tried to make a table:

Comparison of quant formats and engines Pre-allocation: The engine pre-allocate the vram needed by activation and kv cache, effectively reducing vram usage and improving speed because pytorch handles vram allocation badly. However, pre-allocation means the engine need to take as much vram as your model's max ctx length requires at the start, even if you are not using it.

VRAM optimization: Efficient attention implementation like FlashAttention or PagedAttention to reduce memory usage, especially at long context.

One notable player here is the Aphrodite-engine (Aphrodite-engine). At first glance it looks like a replica of vLLM, which sounds less attractive for in-home usage when there are no concurrent requests. However after GGUF is supported and exl2 on the way, it could be a game changer. It supports tensor-parallel out of the box, that means if you have 2 or more gpus, you can run your (even quantized) model in parallel, and that is much faster than all the other engines where you can only use your gpus sequentially. I achieved 3x speed over llama.cpp running miqu using 4 2080 Ti!

Some personal notes:

Update: shing3232 kindly pointed out that you can convert a AWQ model to GGUF and run it in llama.cpp. I never tried that so I cannot comment on the effectiveness of this approach.

URL: Guide to choosing quants and engines

Suggested labels

{'label-name': 'model-quantization-guide', 'label-description': 'Information on choosing quantization formats for machine learning models.', 'confidence': 65.61}

irthomasthomas commented 6 months ago

Related issues

304: GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga

### DetailsSimilarity score: 0.93 - [ ] [GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga](https://www.reddit.com/r/Oobabooga/comments/178yqmg/gptq_vs_exl2_vs_awq_vs_q4_k_m_model_sizes/ GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga) GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes Mod Post Size (mb) Model 16560 Phind_Phind-CodeLlama-34B-v2-EXL2-4.000b 17053 Phind_Phind-CodeLlama-34B-v2-EXL2-4.125b 17463 Phind-CodeLlama-34B-v2-AWQ-4bit-128g 17480 Phind-CodeLlama-34B-v2-GPTQ-4bit-128g-actorder 17548 Phind_Phind-CodeLlama-34B-v2-EXL2-4.250b 18143 Phind_Phind-CodeLlama-34B-v2-EXL2-4.400b 19133 Phind_Phind-CodeLlama-34B-v2-EXL2-4.650b 19284 phind-codellama-34b-v2.Q4_K_M.gguf 19320 Phind-CodeLlama-34B-v2-AWQ-4bit-32g 19337 Phind-CodeLlama-34B-v2-GPTQ-4bit-32g-actorder I created all these EXL2 quants to compare them to GPTQ and AWQ. The preliminary result is that EXL2 4.4b seems to outperform GPTQ-4bit-32g while EXL2 4.125b seems to outperform GPTQ-4bit-128g while using less VRAM in both cases. I couldn't test AWQ yet because my quantization ended up broken, possibly due to this particular model using NTK scaling, so I'll probably have to go through the fun of burning my GPU for 16 hours again to quantize and evaluate another model so that a conclusion can be reached. Also no idea if Phind-CodeLlama is actually good. WizardCoder-Python might be better. #### Suggested labels #### "LLM-Quantization"

389: AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT

### DetailsSimilarity score: 0.89 - [ ] [AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT](https://forum.opennmt.net/t/awq-quantization-support-new-generic-converter-for-all-hf-llama-like-models/5569) **Quantization and Acceleration** We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. Here's an example of the syntax: ```bash python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors ``` * `TheBloke/Nous-Hermes-Llama2-AWQ`: The name of the repository/model on the Hugging Face Hub. * `output`: Specifies the target directory and model name you want to save. * `format`: Optionally, you can save as safetensors. For llama-like models, we download the `tokenizer.model` and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model. After converting, you will need a config file to run `translate.py` or `run_mmlu_opnenmt.py`. Here's an example of the config: ```yaml transforms: [sentencepiece] #### Subword src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model" tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model" # Model info model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" # Inference # ... ``` When considering your priority: - For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes. - AWQ models are faster than FP16 for batch size 1. Please read more here: [GitHub - casper-hansen/AutoAWQ](https://github.com/casper-hansen/AutoAWQ) **Important Note:** - There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV. - The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ. - If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible. **Offline Quantizer Script:** - We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT. Enjoy! --- **VS**: Fast Inference with vLLM Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows: - Batch size 1: 80.5 tokens/second - Batch size 60: 98 tokens/second, with GEMV being 20-25% faster. This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time. #### Suggested labels #### { "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }

431: awq llama quantization

### DetailsSimilarity score: 0.89 - [ ] [awq llama quantization](huggingface.co) Quantization and Acceleration ---------------------------- We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. ### Model Conversion Here's an example of the syntax for converting a model: ```python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors ``` - `TheBloke/Nous-Hermes-Llama2-AWQ`: The name of the repository/model on the Hugging Face Hub. - `output`: Specifies the target directory and model name you want to save. - `format`: Optionally, you can save as safetensors. For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model. ### Config File After converting, you will need a config file to run `translate.py` or `run_mmlu_opnenmt.py`. Here's an example of the config: ```yaml transforms: [sentencepiece] Subword: src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model" tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model" Model info: model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" Inference: # ... ``` ### Priority When considering your priority: - For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes. - AWQ models are faster than FP16 for batch size 1. - Read more: [GitHub - casper-hansen/AutoAWQ](https://github.com/casper-hansen/AutoAWQ) ### Important Note - There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV. - The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ. - If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible. ### Offline Quantizer Script We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT. ### vLLM Performance Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows: - Batch size 1: 80.5 tokens/second - Batch size 60: 98 tokens/second, with GEMV being 20-25% faster. - This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time. #### Suggested labels #### null

504: AutoAWQ 4bit quantization

### DetailsSimilarity score: 0.87 - [ ] [Code search results](https://github.com/casper-hansen/AutoAWQ) ## **CONTENT**: ## TITLE: Code search results DESCRIPTION: - Add file - Folders and files - Name - Latest commit - casper-hansen - Fix n_samples (#326) - ebe8fc3 - History - .github/workflows - AMD ROCM Support (#315) - awq - Fix n_samples (#326) - examples - Marlin symmetric quantization and inference (#320) - scripts - Exclude download of CUDA wheels (#159) - tests - Torch only inference + any-device quantization (#319) - .gitignore - first commit - LICENSE - add LICENSE - README.md - AMD ROCM Support (#315) - setup.py - AMD ROCM Support (#315) Repository files navigation README MIT license AutoAWQ | Roadmap | Examples | Issues: Help Wanted | AutoAWQ is an easy-to-use package for 4-bit quantized models. AutoAWQ speeds up models by 3x and reduces memory requirements by 3x compared to FP16. AutoAWQ implements the Activation-aware Weight Quantization (AWQ) algorithm for quantizing LLMs. AutoAWQ was created and improved upon from the original work from MIT. Latest News 🔥 [2023/12] Mixtral, LLaVa, QWen, Baichuan model support. [2023/11] AutoAWQ inference has been integrated into 🤗 transformers. Now includes CUDA 12.1 wheels. [2023/10] Mistral (Fused Modules), Bigcode, Turing support, Memory Bug Fix (Saves 2GB VRAM) [2023/09] 1.6x-2.5x speed boost on fused models (now including MPT and Falcon). [2023/09] Multi-GPU support, bug fixes, and better benchmark scripts available [2023/08] PyPi package released and AutoModel class available Install Prerequisites NVIDIA: Your NVIDIA GPU(s) must be of Compute Capability 7.5. Turing and later architectures are supported. Your CUDA version must be CUDA 11.8 or later. AMD: Your ROCm version must be ROCm 5.6 or later. Install from PyPi To install the newest AutoAWQ from PyPi, you need CUDA 12.1 installed. pip install autoawq Build from source For CUDA 11.8, ROCm 5.6, and ROCm 5.7, you can install wheels from the release page: pip install autoawq@https://github.com/casper-hansen/AutoAWQ/releases/download/v0.2.0/autoawq-0.2.0+cu118-cp310-cp310-linux_x86_64.whl Or from the main branch directly: pip install autoawq@https://github.com/casper-hansen/AutoAWQ.git Or by cloning the repository and installing from source: git clone https://github.com/casper-hansen/AutoAWQ cd AutoAWQ pip install -e . All three methods will install the latest and correct kernels for your system from AutoAWQ_Kernels. If your system is not supported (i.e. not on the release page), you can build the kernels yourself by following the instructions in AutoAWQ_Kernels and then install AutoAWQ from source. Supported models The detailed support list: Models Sizes LLaMA-2 7B/13B/70B LLaMA 7B/13B/30B/65B Mistral 7B Vicuna 7B/13B MPT 7B/30B Falcon 7B/40B OPT 125m/1.3B/2.7B/6.7B/13B/30B Bloom 560m/3B/7B/ GPTJ 6.7B Aquila 7B Aquila2 7B/34B Yi 6B/34B Qwen 1.8B/7B/14B/72B BigCode 1B/7B/15B GPT NeoX 20B GPT-J 6B LLaVa 7B/13B Mixtral 8x7B Baichuan 7B/13B QWen 1.8B/7B/14/72B Usage Under examples, you can find examples of how to quantize, run inference, and benchmark AutoAWQ models. INT4 GEMM vs INT4 GEMV vs FP16 There are two versions of AWQ: GEMM #### Suggested labels #### null

457: I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA

### DetailsSimilarity score: 0.87 - [ ] [I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1abihou/i_keep_running_out_of_memory_whats_the_biggest/) Here's the reformatted text in Markdown format: ```markdown # GPU and Model Recommendations **GPU Only:** - You can use 7B Models at 8 bpw with 8K context, or maybe up to 12k context. - If you wish to use 13B models, then you have to use 4bpw and limit yourself to 2K Context. **GPU + CPU:** - Use `.gguf` files to offload part of the model to VRAM. - Check the disk usage when inferencing in the activity monitor app (or whatever it is called in your OS). If the disk usage is 100% (disk is swapping), then it is impossible to fit the model in RAM + VRAM and tokens per second will be very low. - In that case, reduce context size and reduce bpw. The best models you can probably run now are: - OpenChat 3.5 7B at 8bpw (Use the latest version) - at 4bpw and 4K context. - If you want to run Nous-Capybara-34b, switch to the 3bpw version and try to offload 35 layers to GPU. If you want to run bigger models, upgrade RAM to 64GB. **Tip from /u/Working-Flatworm-531:** - Just do not load kv in VRAM, you can use `ooba` to disable it. - Also try lower quants, for example Q4_K_S is still good. You still wouldn't be able to run 34B models with good speed, but at least it's something. - You can also check your BIOS and maybe increase RAM frequency. After that, you'd be able to run ~20B models at ~2t/s at 8k ~ 12k context. **Recommended List from /u/Working-Flatworm-531:** - Use Linux. - Overclock RAM (if possible). - Overclock CPU (if possible). - Overclock GPU. - Don't load kv cache in VRAM, instead load more layers to the VRAM. - Use smaller quants. - Use fast interface (didn't try Kobold, use Ooba). - Check RAM (should be dual channel). ``` #### Suggested labels #### { "label-name": "memory-optimization", "description": "Strategies for optimizing memory usage when running models on 3060 12gb GPU.", "confidence": 94.88 }

391: Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA

### DetailsSimilarity score: 0.86 - [ ] [Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/17h4rqz/speculative_decoding_in_exllama_v2_and_llamacpp/) Speculative Decoding in Exllama v2 and llama.cpp Comparison ============================================================= Discussion ----------- We discussed speculative decoding (SD) in a previous thread. For those who are not aware of this feature, it allows LLM loaders to use a smaller "draft" model to help predict tokens for a larger model. In that thread, someone asked for tests of speculative decoding for both Exllama v2 and llama.cpp. Although I generally only run models in GPTQ, AWQ, or exl2 formats, I was interested in doing the exl2 vs. llama.cpp comparison. Test Setup ----------- The tests were run on a 2x 4090, 13900K, DDR5 system. The screen captures of the terminal output of both are available below. If someone has experience with making llama.cpp speculative decoding work better, please share. Exllama v2 Results ------------------ **Model:** Xwin-LM-70B-V0.1-4.0bpw-h6-exl2 **Draft Model:** TinyLlama-1.1B-1T-OpenOrca-GPTQ Performance can be highly variable, but it goes from ~20 t/s without SD to 40-50 t/s with SD. ### No SD ```bash Prompt processed in 0.02 seconds, 4 tokens, 200.61 tokens/second Response generated in 10.80 seconds, 250 tokens, 23.15 tokens/second ``` ### With SD ```bash Prompt processed in 0.03 seconds, 4 tokens, 138.80 tokens/second Response generated in 5.10 seconds, 250 tokens, 49.05 tokens/second ``` #### Suggested labels #### { "key": "speculative-decoding", "value": "Technique for using a smaller 'draft' model to help predict tokens for a larger model" }