Representation engineering:

Related issues

625: unsloth/README.md at main · unslothai/unsloth

### Details

Similarity score: 0.84 - [ ] [unsloth/README.md at main · unslothai/unsloth](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1) # unsloth/README.md at main · unslothai/unsloth

### Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory! ![](https://i.ibb.co/sJ7RhGG/image-41.png)

## ✨ Finetune for Free All notebooks are **beginner friendly**! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | **Gemma 7b** | [▶️ Start on Colab](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing) | 2.4x faster | 58% less | | **Mistral 7b** | [▶️ Start on Colab](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing) | 2.2x faster | 62% less | | **Llama-2 7b** | [▶️ Start on Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing) | 2.2x faster | 43% less | | **TinyLlama** | [▶️ Start on Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing) | 3.9x faster | 74% less | | **CodeLlama 34b** A100 | [▶️ Start on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing) | 1.9x faster | 27% less | | **Mistral 7b** 1xT4 | [▶️ Start on Kaggle](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook) | 5x faster\* | 62% less | | **DPO - Zephyr** | [▶️ Start on Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) | 1.9x faster | 19% less | - This [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing) is useful for ShareGPT ChatML / Vicuna templates. - This [text completion notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing) is for raw text. This [DPO notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) replicates Zephyr. - \* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. ## 🦥 Unsloth.ai News - 📣 [Gemma 7b](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing) on 6T tokens now works. And [Gemma 2b notebook](https://colab.research.google.com/drive/15gGm7x_jTm017_Ic8e317tdIpDG53Mtu?usp=sharing) - 📣 Added [conversational notebooks](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing) and [raw text notebooks](https://colab.research.google.com/drive/1bMOKOBzxQWUIGZBs_B0zm8pimuEnZdfM?usp=sharing) - 📣 [2x faster inference](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) added for all our models - 📣 [DPO support](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) is now included. [More info](#DPO) on DPO - 📣 We did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗Hugging Face and are in their official docs! Check out the [SFT docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth) and [DPO docs](https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth) - 📣 [Download models 4x faster](https://huggingface.co/collections/unsloth/) from 🤗Hugging Face. Eg: `unsloth/mistral-7b-bnb-4bit` ## 🔗 Links and Resources | Type | Links | | ------------------------------- | --------------------------------------- | | 📚 **Wiki & FAQ** | [Read Our Wiki](https://github.com/unslothai/unsloth/wiki) | | 📜 **Documentation** | [Read The Doc](https://github.com/unslothai/unsloth/tree/main#-documentation) | | 💾 **Installation** | [unsloth/README.md](https://github.com/unslothai/unsloth/tree/main#installation-instructions)| |

**Twitter (aka X)** | [Follow us on X](https://twitter.com/unslothai)| | 🥇 **Benchmarking** | [Performance Tables](https://github.com/unslothai/unsloth/tree/main#-performance-benchmarking) | 🌐 **Released Models** | [Unsloth Releases](https://huggingface.co/unsloth)| | ✍️ **Blog** | [Read our Blogs](https://unsloth.ai/blog)| ## ⭐ Key Features - All kernels written in [OpenAI's Triton](https://openai.com/research/triton) language. **Manual backprop engine**. - **0% loss in accuracy** - no approximation methods - all exact. - No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) [Check your GPU!](https://developer.nvidia.com/cuda-gpus) GTX 1070, 1080 works, but is slow. - Works on **Linux** and **Windows** via WSL. - Supports 4bit and 16bit QLoRA / LoRA finetuning via [bitsandbytes](https://github.com/TimDettmers/bitsandbytes). - Open source trains 5x faster - see [Unsloth Pro](https://unsloth.ai/) for **30x faster training**! - If you trained a model with 🦥Unsloth, you can use this cool sticker!

## 🥇 Performance Benchmarking - For the full list of **reproducable** benchmarking tables, [go to our website](https://unsloth.ai/blog/mistral-benchmark#Benchmark%20tables) | 1 A100 40GB | 🤗Hugging Face | Flash Attention | 🦥Unsloth Open Source | 🦥[Unsloth Pro](https://unsloth.ai/pricing) | |--------------|--------------|-----------------|---------------------|-----------------| | Alpaca | 1x | 1.04x | 1.98x | **15.64x** | | LAION Chip2 | 1x | 0.92x | 1.61x | **20.73x** | | OASST | 1x | 1.19x | 2.17x | **14.83x** | | Slim Orca | 1x | 1.18x | 2.22x | **14.82x** | - Benchmarking table below was conducted by [🤗Hugging Face](https://huggingface.co/blog/unsloth-trl). | Free Colab T4 | Dataset | 🤗Hugging Face | Pytorch 2.1.1 | 🦥Unsloth | 🦥 VRAM reduction | | --- | --- | --- | --- | --- | --- | | Llama-2 7b | OASST | 1x | 1.19x | 1.95x | -43.3% | | Mistral 7b | Alpaca | 1x | 1.07x | 1.56x | -13.7% | | Tiny Llama 1.1b | Alpaca | 1x | 2.06x | 3.87x | -73.8% | | DPO with Zephyr | Ultra Chat | 1x | 1.09x | 1.55x | -18.6% | ![](https://i.ibb.co/sJ7RhGG/image-41.png) [View on GitHub](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1) #### Suggested labels ####

494: Awesome-Efficient-LLM: A curated list for Efficient Large Language Models

### Details

Similarity score: 0.84 - [ ] [horseee/Awesome-Efficient-LLM: A curated list for Efficient Large Language Models](https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration) # Awesome-Efficient-LLM A curated list for [Efficient Large Language Models](https://github.com/horseee/Awesome-Efficient-LLM): - [Knowledge Distillation](#knowledge-distillation) - [Network Pruning](#network-pruning) - [Quantization](#quantization) - [Inference Acceleration](#inference-acceleration) - [Efficient MOE](#efficient-moe) - [Text Compression](#text-compression) - [Low-Rank Decomposition](#low-rank-decomposition) - [Hardware/System Tuning](#hardwareSystem-tuning) - [Survey](#survey) - [Leaderboard](#leaderboard) - [🚀 Updates](#updates) - [Contributing](#contributing) --- ## Inference Acceleration - … - [Add your paper here](https://github.com/horseee/Awesome-Efficient-LLM/blob/main/generate_item.py), [generate the required format](https://github.com/horseee/Awesome-Efficient-LLM#decontributing), and submit a pull request. --- ## Updates - **Sep 27, 2023:** Add tag for papers accepted at NeurIPS'23. - **Sep 6, 2023:** Add a new subdirectory `project/` to organize those projects designed for developing a lightweight LLM. - **July 11, 2023:** Create a new subdirectory `efficient_plm/` for papers applicable to PLMs (such as BERT, BART) but have yet to be verified for their effectiveness on LLMs. --- ## Contributing If you'd like to include your paper or need to update any details, please feel free to submit a pull request. You can generate the required markdown format for each paper by filling in the information in `generate_item.py` and execute `python generate_item.py`. We warmly appreciate your contributions to this list. Alternatively, you can email me with the links to your paper and code, and I would add your paper to the list at my earliest convenience. - URL: [https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration](https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration) #### Suggested labels #### { "label-name": "efficient-llm-acceleration", "description": "Inference acceleration techniques for efficient large language models.", "repo": "horseee/Awesome-Efficient-LLM", "confidence": 70.8 }

326: Assisted Generation: a new direction toward low-latency text generation

### Details

Similarity score: 0.84 > **Assisted Generation: a new direction toward low-latency text generation** Greedy decoding with assisted generation Assisted generation is a balancing act. You want the assistant to quickly generate a candidate sequence while being as accurate as possible. If the assistant has poor quality, your get the cost of using the assistant model with little to no benefits. On the other hand, optimizing the quality of the candidate sequences may imply the use of slow assistants, resulting in a net slowdown. While we can't automate the selection of the assistant model for you, we’ve included an additional requirement and a heuristic to ensure the time spent with the assistant stays in check. First, the requirement – the assistant must have the exact same tokenizer as your model. If this requirement was not in place, expensive token decoding and re-encoding steps would have to be added. Furthermore, these additional steps would have to happen on the CPU, which in turn may need slow inter-device data transfers. Fast usage of the assistant is critical for the benefits of assisted generation to show up. Finally, the heuristic. By this point, you have probably noticed the similarities between the movie Inception and assisted generation – you are, after all, running text generation inside text generation. There will be one assistant model forward pass per candidate token, and we know that forward passes are expensive. While you can’t know in advance the number of tokens that the assistant model will get right, you can keep track of this information and use it to limit the number of candidate tokens requested to the assistant – some sections of the output are easier to anticipate than others. Wrapping all up, here’s our original implementation of the assisted generation loop (code): 1. Use greedy decoding to generate a certain number of candidate tokens with the assistant model, producing candidates. The number of produced candidate tokens is initialized to 5 the first time assisted generation is called. 2. Using our model, do a forward pass with candidates, obtaining logits. 3. Use the token selection method (.argmax() for greedy search or .multinomial() for sampling) to get the next_tokens from logits. 4. Compare next_tokens to candidates and get the number of matching tokens. Remember that this comparison has to be done with left-to-right causality: after the first mismatch, all candidates are invalidated. 5. Use the number of matches to slice things up and discard variables related to unconfirmed candidate tokens. In essence, in next_tokens, keep the matching tokens plus the first divergent token (which our model generates from a valid candidate subsequence). 6. Adjust the number of candidate tokens to be produced in the next iteration — our original heuristic increases it by 2 if ALL tokens match and decreases it by 1 otherwise. We’ve designed the API in 🤗 Transformers such that this process is hassle-free for you. All you need to do is to pass the assistant model under the new `assistant_model` keyword argument and reap the latency gains! At the time of the release of this blog post, assisted generation is limited to a batch size of 1. ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch prompt = "Alice and Bob" checkpoint = "EleutherAI/pythia-1.4b-deduped" assistant_checkpoint = "EleutherAI/pythia-160m-deduped" device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained(checkpoint) inputs = tokenizer(prompt, return_tensors="pt").to(device) model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint).to(device) outputs = model.generate(**inputs, assistant_model=assistant_model) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) # ['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a'] ``` Is the additional internal complexity worth it? Let’s have a look at the latency numbers for the greedy decoding case (results for sampling are in the next section), considering a batch size of 1. These results were pulled directly out of 🤗 Transformers without any additional optimizations, so you should be able to reproduce them in your setup. **Assisted Generation Benchmark** | OPT: Open | OPT: Summ | Whisper: ARS | CodeGen: Code | Flan-T5: Summ | |------------|------------|---------------|----------------|----------------| | GPU | | | | | | Omit cases with memory offload? | Yes | No | Image | | Assistant Model | facebook/opt-125m | Model Names: | 1.3B: facebook/opt-1.3b | 6.7B: facebook/opt-6.7b | | 30B: facebook/opt-30b | 66B: facebook/opt-66b | Dataset used as input prompt: | C4 (en, validation set) | joaogante/assisted_generation_benchmarks | | built with Gradio. | Hosted on Spaces | | | | Glancing at the collected numbers, we see that assisted generation can deliver significant latency reductions in diverse settings, but it is not a silver bullet – you should benchmark it before applying it to your use case. We can conclude that assisted generation: - 🤏 Requires access to an assistant model that is at least an order of magnitude smaller than your model (the bigger the difference, the better); - 🚀 Gets up to 3x speedups in the presence of INT8 and up to 2x otherwise, when the model fits in the GPU memory; - 🤯 If you’re playing with models that do not fit in your GPU and are relying on memory offloading, you can see up to 10x speedups; - 📄 Shines in input-grounded tasks, like automatic speech recognition or summarization. **Sample with assisted generation** Greedy decoding is suited for input-grounded tasks (automatic speech recognition, translation, summarization, ...) or factual knowledge-seeking. Open-ended tasks requiring large levels of creativity, such as most uses of a language model as a chatbot, should use sampling instead. Assisted generation is naturally designed for greedy decoding, but that doesn’t mean that you can’t use assisted generation with multinomial sampling! Drawing samples from a probability distribution for the next token will cause our greedy assistant to fail more often, reducing its latency benefits. However, we can control how sharp the probability distribution for the next tokens is, using the temperature coefficient that’s present in most sampling-based applications. At one extreme, with temperatures close to 0, sampling will approximate greedy decoding, favoring the most likely token. At the other extreme, with the temperature set to values much larger than 1, sampling will be chaotic, drawing from a uniform distribution. Low temperatures are, therefore, more favorable to your assistant model, retaining most of the latency benefits from assisted generation, as we can see below. #### Suggested labels #### { "key": "assisted-generation", "value": "Text generation with the use of an assistant model for latency reduction" }

456: Baseline benchmark for 17 coding models : r/LocalLLaMA

### Details

Similarity score: 0.84 - [ ] [Baseline benchmark for 17 coding models : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/19fc4uf/baseline_benchmark_for_17_coding_models/) Baseline Benchmark for 17 Coding Models ========================================= Discussion ---------- I am currently working on implementing some ideas for coding models inference strategies (prompting, control, context exploration, CoT, ToT, etc) and I needed a baseline benchmark on a bunch of models. Since I work on a 3060 12GB, I was limited in what I can test so I went for every model that is 7/13B and has an AWQ quant available, since that is what the inference library that I use supports. I thought I'd share some numbers. **Notes:** * This is a benchmark for getting a local baseline. I'm interested in improvement from here, so the absolute values are less important for me. Don't take the absolute values too seriously. (well, maybe except deepseek-coder-1.3b, that is a bit suspect). * I used the HumanEval dataset. This is superseded by HumanEval+ and other more recent benchmarks. I chose this because it was the first one I tried. Again, with my tests I'm looking for improvements over the baseline, so this is mostly fine. * AWQ quant is not the best out there, but all my tests will be done with this quant, so for me it is OK. * Temp tests were done in only one generation. In general you'd want to average the score over many generations at a given temp. * Each model was prompted according to the model card template. Here's an example for the codellama series - ```python f"""You are a helpful and respectful assistant. Answer the following question: {question}""" ``` Results ------- I've plotted the results (with horrendous contrasting colors, but alas) to look for any interesting patterns in problem solving. You can find the plots [here](https://imgur.com/a/autpnfK). | Model | Temp | Correct / 164 | Percentage | | --- | --- | --- | --- | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.0 | 67 | 0.40853658536585363 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.1 | 63 | 0.38414634146341464 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.2 | 68 | 0.4146341463414634 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.3 | 61 | 0.3719512195121951 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.4 | 61 | 0.3719512195121951 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.5 | 63 | 0.38414634146341464 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.6 | 54 | 0.32926829268292684 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.7 | 61 | 0.3719512195121951 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.8 | 60 | 0.36585365853658536 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 0.9 | 59 | 0.3597560975609756 | | TheBloke/Mistral-7B-Instruct-v0.2-AWQ | 1.0 | 65 | 0.39634146341463417 | #### Suggested labels #### { "label-name": "coding-models", "description": "Discussion and benchmark of coding models implementation strategies.", "confidence": 96.82 }

333: Paper Digest: NeurIPS-2023 Highlights (Full List)

### Details
Similarity score: 0.83 - [ ] [Paper Digest: NeurIPS-2023 Highlights (Full List)](https://www.paperdigest.org/data/neurips-2023-full.html) Paper Digest: NeurIPS 2023 Highlights https://www.paperdigest.org 1, Toolformer: Language Models Can Teach Themselves to Use Tools Timo Schick; Jane Dwivedi-Yu; Roberto Dessi; Roberta Raileanu; Maria Lomeli; Eric Hambro; Luke Zettlemoyer; Nicola Cancedda; Thomas Scialom; Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we show that LMs can teach themselves to *use external tools* via simple APIs and achieve the best of both worlds. 2, Self-Refine: Iterative Refinement with Self-Feedback Aman Madaan; Niket Tandon; Prakhar Gupta; Skyler Hallinan; Luyu Gao; Sarah Wiegreffe; Uri Alon; Nouha Dziri; Shrimai Prabhumoye; Yiming Yang; Shashank Gupta; Bodhisattwa Prasad Majumder; Katherine Hermann; Sean Welleck; Amir Yazdanbakhsh; Peter Clark; Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. 3, Vicuna Evaluation: Exploring LLM-as-a-Judge and Chatbot Arena Lianmin Zheng; Wei-Lin Chiang; Ying Sheng; Siyuan Zhuang; Zhanghao Wu; Yonghao Zhuang; Zi Lin; Zhuohan Li; Dacheng Li; Eric Xing; Hao Zhang; Joseph Gonzalez; Ion Stoica; Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. #### Suggested labels #### { "key": "LLM-Applications", "value": "Topics related to practical applications of Large Language Models in various fields" }

6: Man Page to Embeddings Pipeline

### Details
Similarity score: 0.83 I want to create a human and LLM readable and searchable man database. I will create multiple formats and summaries of the man page and a vector embedding of the strip-tagged version. Rough idea from gpt3.5: 1. Repository Structure: - Create a Github Repository, say `manpages-formatted`. - Each man page would exist as a separate folder named based on convention like `cmd-section` (e.g. `ls-1`). - In each directory, different format files would exist as `original.txt`, `original.md`, `summary.md`, `embeddings.txt` etc. 2. Using GitHub Issues: - Create a new issue for each man page. - The body of issue could be the plain text man page. - Each different format (like MD, Embedded, Summary) could be a unique comment on that issue. - Every issue could be labeled for easier searching. 3. GitHub Actions: - GitHub actions could be used to automate the process. - When a new man page is added (as a new issue), the action would run a script that converts the issue body into different formats, add these as comments and also update the repo by creating new files. 4. Deploy a CLI tool: - You could develop a CLI tool for searching these man pages from GitHub. - The CLI would communicate with the GitHub API to perform searches based on user-inputted keywords. - The CLI can retrieve the formatted text from the comments of the respective issue, or directly from the files in the repository. However, remember to take into account GitHub's rate limiting and/or potential costs for heavy usage. Storing such data and numerous requests would be subject to these limitations, so it would be important to build the system considering these factors. Here is a script that takes a terminal command and outputs the command string and the output, as well as a token count of the output and formats it to markdown for inclusion in a github issue etc. ```bash cmd_and_output_to_markdown () { cmd="$1" title=$(echo $cmd | cut -d' ' -f2) output=$(eval "$cmd") token_count=$(ttok "$output") formatted_output="## Command \`\`\` ${cmd} \`\`\` [Link to Command](#command) ## Token Count \`\`\` ${token_count} \`\`\` [Link to Token Count](#token_count) ## Output \`\`\` ${output} \`\`\` [Link to Output](#output) " echo -e "$formatted_output" } ```

irthomasthomas / undecidability