Qwen-1.5-8x7B : r/LocalLLaMA

Related issues

389: AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT

### Details

Similarity score: 0.88 - [ ] [AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT](https://forum.opennmt.net/t/awq-quantization-support-new-generic-converter-for-all-hf-llama-like-models/5569) **Quantization and Acceleration** We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. Here's an example of the syntax: ```bash python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors ``` * `TheBloke/Nous-Hermes-Llama2-AWQ`: The name of the repository/model on the Hugging Face Hub. * `output`: Specifies the target directory and model name you want to save. * `format`: Optionally, you can save as safetensors. For llama-like models, we download the `tokenizer.model` and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model. After converting, you will need a config file to run `translate.py` or `run_mmlu_opnenmt.py`. Here's an example of the config: ```yaml transforms: [sentencepiece] #### Subword src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model" tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model" # Model info model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" # Inference # ... ``` When considering your priority: - For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes. - AWQ models are faster than FP16 for batch size 1. Please read more here: [GitHub - casper-hansen/AutoAWQ](https://github.com/casper-hansen/AutoAWQ) **Important Note:** - There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV. - The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ. - If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible. **Offline Quantizer Script:** - We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT. Enjoy! --- **VS**: Fast Inference with vLLM Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows: - Batch size 1: 80.5 tokens/second - Batch size 60: 98 tokens/second, with GEMV being 20-25% faster. This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time. #### Suggested labels #### { "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }

431: awq llama quantization

### Details

Similarity score: 0.88 - [ ] [awq llama quantization](huggingface.co) Quantization and Acceleration ---------------------------- We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. ### Model Conversion Here's an example of the syntax for converting a model: ```python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors ``` - `TheBloke/Nous-Hermes-Llama2-AWQ`: The name of the repository/model on the Hugging Face Hub. - `output`: Specifies the target directory and model name you want to save. - `format`: Optionally, you can save as safetensors. For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model. ### Config File After converting, you will need a config file to run `translate.py` or `run_mmlu_opnenmt.py`. Here's an example of the config: ```yaml transforms: [sentencepiece] Subword: src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model" tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model" Model info: model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" Inference: # ... ``` ### Priority When considering your priority: - For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes. - AWQ models are faster than FP16 for batch size 1. - Read more: [GitHub - casper-hansen/AutoAWQ](https://github.com/casper-hansen/AutoAWQ) ### Important Note - There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV. - The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ. - If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible. ### Offline Quantizer Script We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT. ### vLLM Performance Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows: - Batch size 1: 80.5 tokens/second - Batch size 60: 98 tokens/second, with GEMV being 20-25% faster. - This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time. #### Suggested labels #### null

456: Baseline benchmark for 17 coding models : r/LocalLLaMA

### Details

Similarity score: 0.87 - [ ] [Baseline benchmark for 17 coding models : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/19fc4uf/baseline_benchmark_for_17_coding_models/) Baseline Benchmark for 17 Coding Models ========================================= Discussion ---------- I am currently working on implementing some ideas for coding models inference strategies (prompting, control, context exploration, CoT, ToT, etc) and I needed a baseline benchmark on a bunch of models. Since I work on a 3060 12GB, I was limited in what I can test so I went for every model that is 7/13B and has an AWQ quant available, since that is what the inference library that I use supports. I thought I'd share some numbers. **Notes:** * This is a benchmark for getting a local baseline. I'm interested in improvement from here, so the absolute values are less important for me. Don't take the absolute values too seriously. (well, maybe except deepseek-coder-1.3b, that is a bit suspect). * I used the HumanEval dataset. This is superseded by HumanEval+ and other more recent benchmarks. I chose this because it was the first one I tried. Again, with my tests I'm looking for improvements over the baseline, so this is mostly fine. * AWQ quant is not the best out there, but all my tests will be done with this quant, so for me it is OK. * Temp tests were done in only one generation. In general you'd want to average the score over many generations at a given temp. * Each model was prompted according to the model card template. Here's an example for the codellama series - ```python f"""You are a helpful and respectful assistant. Answer the following question: {question}""" ``` Results ------- I've plotted the results (with horrendous contrasting colors, but alas) to look for any interesting patterns in problem solving. You can find the plots [here](https://imgur.com/a/autpnfK). | Model | Temp | Correct / 164 | Percentage | | --- | --- | --- | --- | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.0 | 67 | 0.40853658536585363 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.1 | 63 | 0.38414634146341464 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.2 | 68 | 0.4146341463414634 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.3 | 61 | 0.3719512195121951 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.4 | 61 | 0.3719512195121951 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.5 | 63 | 0.38414634146341464 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.6 | 54 | 0.32926829268292684 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.7 | 61 | 0.3719512195121951 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.8 | 60 | 0.36585365853658536 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.9 | 59 | 0.3597560975609756 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 1.0 | 65 | 0.39634146341463417 | #### Suggested labels #### { "label-name": "coding-models", "description": "Discussion and benchmark of coding models implementation strategies.", "confidence": 96.82 }

324: bigcode/tiny_starcoder_py · Hugging Face

### Details
Similarity score: 0.87 > **Note:** > > [bigcode/tiny_starcoder_py · Hugging Face](https://huggingface.co/bigcode/tiny_starcoder_py) > > TinyStarCoderPy > > This is a 164M parameters model with the same architecture as StarCoder (8k context length, MQA & FIM). It was trained on the Python data from StarCoderData for ~6 epochs which amounts to 100B tokens. > > Use > > Intended use > > The model was trained on GitHub code, to assist with some tasks like Assisted Generation. For pure code completion, we advise using our 15B models StarCoder or StarCoderBase. > > Generation > > ```python > # pip install -q transformers > from transformers import AutoModelForCausalLM, AutoTokenizer > > checkpoint = "bigcode/tiny_starcoder_py" > device = "cuda" # for GPU usage or "cpu" for CPU usage > > tokenizer = AutoTokenizer.from_pretrained(checkpoint) > model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) > > inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device) > outputs = model.generate(inputs) > print(tokenizer.decode(outputs[0])) > ``` > > Fill-in-the-middle > > Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output: > > ```python > input_text = "def print_one_two_three():\n print('one')\n \n print('three')" > inputs = tokenizer.encode(input_text, return_tensors="pt").to(device) > outputs = model.generate(inputs) > print(tokenizer.decode(outputs[0])) > ``` > > Training > > Model > > - Architecture: GPT-2 model with multi-query attention and Fill-in-the-Middle objective > - Pretraining steps: 50k > - Pretraining tokens: 100 billion > - Precision: bfloat16 > > Hardware > > - GPUs: 32 Tesla A100 > - Training time: 18 hours > > Software > > - Orchestration: Megatron-LM > - Neural networks: PyTorch > - BP16 if applicable: apex > > License > > The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/bigcode/tiny_starcoder_py/blob/main/LICENSE). > > #### Suggested labels > > - { "key": "llm-pretraining", "value": "Information related to the pretraining process of Large Language Models" }

150: Mixture of Experts Explained

### Details
Similarity score: 0.86 - [ ] [Mixture of Experts Explained](https://huggingface.co/blog/moe) TL;DR MoEs: Are pretrained much faster vs. dense models Have faster inference compared to a model with the same number of parameters Require high VRAM as all experts are loaded in memory Face many challenges in fine-tuning, but recent work with MoE instruction-tuning is promising Let’s dive in! What is a Mixture of Experts (MoE)? The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps. Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining. So, what exactly is a MoE? In the context of transformer models, a MoE consists of two main elements: Sparse MoE layers are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 8), where each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs! A gate network or router, that determines which tokens are sent to which expert. For example, in the image below, the token “More” is sent to the second expert, and the token "Parameters” is sent to the first network. As we’ll explore later, we can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs - the router is composed of learned parameters and is pretrained at the same time as the rest of the network.

304: GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga

### Details
Similarity score: 0.86 - [ ] [GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga](https://www.reddit.com/r/Oobabooga/comments/178yqmg/gptq_vs_exl2_vs_awq_vs_q4_k_m_model_sizes/ GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes : r/Oobabooga) GPTQ vs EXL2 vs AWQ vs Q4_K_M model sizes Mod Post Size (mb) Model 16560 Phind_Phind-CodeLlama-34B-v2-EXL2-4.000b 17053 Phind_Phind-CodeLlama-34B-v2-EXL2-4.125b 17463 Phind-CodeLlama-34B-v2-AWQ-4bit-128g 17480 Phind-CodeLlama-34B-v2-GPTQ-4bit-128g-actorder 17548 Phind_Phind-CodeLlama-34B-v2-EXL2-4.250b 18143 Phind_Phind-CodeLlama-34B-v2-EXL2-4.400b 19133 Phind_Phind-CodeLlama-34B-v2-EXL2-4.650b 19284 phind-codellama-34b-v2.Q4_K_M.gguf 19320 Phind-CodeLlama-34B-v2-AWQ-4bit-32g 19337 Phind-CodeLlama-34B-v2-GPTQ-4bit-32g-actorder I created all these EXL2 quants to compare them to GPTQ and AWQ. The preliminary result is that EXL2 4.4b seems to outperform GPTQ-4bit-32g while EXL2 4.125b seems to outperform GPTQ-4bit-128g while using less VRAM in both cases. I couldn't test AWQ yet because my quantization ended up broken, possibly due to this particular model using NTK scaling, so I'll probably have to go through the fun of burning my GPU for 16 hours again to quantize and evaluate another model so that a conclusion can be reached. Also no idea if Phind-CodeLlama is actually good. WizardCoder-Python might be better. #### Suggested labels #### "LLM-Quantization"

irthomasthomas / undecidability