Best way to add knowledge to a llm : r/LocalLLaMA

Related issues

643: I finally got perfect labels (classification task) via prompting : r/LocalLLaMA

### Details

Similarity score: 0.89 - [ ] [I finally got perfect labels (classification task) via prompting : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1amvfua/i_finally_got_perfect_labels_classification_task/) # TITLE I finally got perfect labels (classification task) via prompting : r/LocalLLaMA # DESCRIPTION "I finally got perfect labels (classification task) via prompting Tutorial | Guide It took me weeks of trial and error, but here are my biggest lessons: - Alpaca works REALLY well, even for Mistral/Mixtral instructs - Mixtral8x7b-instruct is the best (in my experience) at in-context learning For some reason, fine-tuned/DPO models tend to HELLA over-fit for false positives and improvements were marginal compared to Mixtral Split your prompt into 3 sections: 1. Instructions: Explains the task 2. Hint: Explains likely mislabeling reasons 3. Few-shot: Examples w/ reasoning Below is the plug-n-play template I finalized/am using Below is an instruction that describes a task, paired with an input that provides further context. Write response that appropriately completes the request. ### Instruction: Label the text based on this question: \"{task}\" Below are example labeled comments, w/ the reason behind their labels as context. Learn from the examples and think step by step before responding. Start your response by printing a \"Yes/No\" statement first as the label. (Hint: {common mistakes you see after trial and error}) Text: {few-shot example} Reason for Label: {explanation} Label: {correct label} ### Input: Text: {Text for it to label} Label (Print Yes/No Only): ### Response: For experimentation, I found that discrepancies are your best friend. My setup was: - Create baseline labels, you don't care at this point how accurate they are - I think few-shot w/ 5 examples and no hints are the way to go here because you want the model to fail - If you use Mixtral8x7b using the prompt format above, you will 100% get Yes/No labels + it's justification, so you can just quickly sample 10 outputs to see how it did and make notes of common mistakes to make your hint - Run the model again, include a hint to your prompt, and then look specifically at the discrepancies -- you should be able to instantly tell if the baseline is overfitting for false positives or false negatives, that's kind of your goal - As you iterate through your instruction, hints, and few-shot examples, you want to continue to look at the discrepancies, your goal should be to get it to decrease little by little, so that by the time you done, your prompt will correct all the mislabels. - Adding MORE few-shot examples will exaggerate the overfitting, you want to do this so you can quickly see if your model leans towards false positives or negatives - I wrote a script that output something like this: Comparison between M8x7b-t0-s1000.csv and M8x7b-t1-s1000.csv: Same: 900, Different: 100 Number of times M8x7b-t0 said \"Yes\" and M8x7b-t1 said \"No\": 100 Number of times M8x7b-t0 said \"No\" and M8x7b-t1 said \"Yes\": 0 That was actually the result of my first test, where I increased the number of few-shot examples from 5 to 19. Looking at this, I could tell that the update made lead to more negative labels. After checking, there were some correct labels but mostly just false negatives. This was super helpful because its more feasible to examine 100 outputs than 1000... or 1 million... Eventually I got it down to this: Comparison between M8x7b-t1-s1000.csv and M8x7b-t2-s1000.csv: Same: 972, Different: 28 Number of times M8x7b-t1 said \"Yes\" and M8x7b-t2 said \"No\": 2 Number of times M8x7b-t1 said \"No\" and M8x7b-t2 said \"Yes\": 26 When I reviewed the output, filtering for these cases, it turns out that the second round of testing corrected all of the mislabels. Now is this perfect? After sampling instances where they agreed, it seems to be in order. I think there is something really special about this approach - by forcing overfitting, we can turn that into a feature instead of bug. Working with the flaws of a model is a lot easier than trying to blindly iterate. At least here, we have a way to measure outputs against each other. [monday.com](https://www.monday.com) [Sign Up](https://www.monday.com) **Sort by:** - Add a Comment **aichiusagi** • 19d ago • Edited 18d ago For some reason, fine-tuned/DPO models tend to HELLA over-fit for false positives and improvements were marginal compared to Mixtral I ran into this too. When fine-tuning, what you need to do is provide some subset of training data where you explicitly return nothing for false positives. In my data, I set this to about ~10% of the total and the problem disappeared. **Reply** **GeeBrain** • 19d ago Oh very interesting, what did this look like exactly? Could you give me an example? I’m thinking about fine-tuning BERT for classification after this round, since using Mixtral takes forever and is unrealistic when I want to process millions of data points **reply** Can you please provide an example of an actual prompt? **GeeBrain** • 18d ago It's literally the template + whatever you want in the {}. But here ya go... Below is an instruction that describes a task, paired with an input that provides further context. Write response that appropriately completes the request. ### Instruction: Label the comment based on this question: \"Does this comment share personal details, like how friends might talk to each other, and share from little to big things in their lives?\" Below are example labeled comments, w/ the reason behind their labels as context. Learn from the examples and think step by step before responding. Start your response by printing a \"Yes/No\" statement first as the label. (Hint: If a comment merely expresses an opinion or admiration without any personal context or experience, label it as ‘No’. But if the comment shares additional context about the commenter’s life, it should be labeled as ‘Yes’. The level of detail matters!) Comment: Wow, you are so beautiful. Reason for Label: Sharing simple statements admiration or opinions, does not count as disclosing personal details, they need to express something about their personal life, habits, or experiences. Label: No .... (More examples) ### Input: Comment: \"When he comes up?\" Label (Print Yes/No Only): ### Response: **Reply** **reply** **trapping_rainwater** • 18d ago What's your production use case for something like this? **Reply** **reply** **GeeBrain** • 18d ago My project is around building an ML model that measures trust — kinda like a fandom score. But in general, this type of setup I can see being really helpful when you have a lot of unlabeled data and wanna get really close with it. Even though I’ll likely end up fine-tuning BERT models in the future for production, this has helped me understand so much about data space. Pretty fun" [URL](https://www.reddit.com/r/LocalLLaMA/comments/1amvfua/i_finally_got_perfect_labels_classification_task/) #### Suggested labels ####

660: Qwen - supervised finetuning script and guide for SFT.

### Details

Similarity score: 0.87 - [ ] [Example - Qwen](https://qwen.readthedocs.io/en/latest/training/SFT/example.html) # Example - Qwen **DESCRIPTION:** Here we provide a very simple script for supervised finetuning, which is revised from the training script in [`Fastchat`](https://github.com/lm-sys/FastChat). The script is used to finetune Qwen with Hugging Face Trainer. You can check the script [here](https://qwen.readthedocs.io/en/latest/training/SFT/example.html). This script for supervised finetuning (SFT) has the following features: - Support single-GPU and multi-GPU training; - Support full-parameter tuning, LoRA, and Q-LoRA. In the following, we introduce more details about the usage of the script. **Installation** Before you start, make sure you have installed the following packages: ``` pip install peft deepspeed optimum accelerate ``` **Data Preparation** For data preparation, we advise you to organize the data in a jsonl file, where each line is a dictionary as demonstrated below: ```json { "type": "chatml", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Tell me something about large language models." }, { "role": "assistant", "content": "Large language models are a type of language model that is trained on a large corpus of text data. They are capable of generating human-like text and are used in a variety of natural language processing tasks..." } ], "source": "unknown" } { "type": "chatml", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is your name?" }, { "role": "assistant", "content": "My name is Qwen." } ], "source": "self-made" } ``` Above are two examples of each data sample in the dataset. Each sample is a JSON object with the following fields: type, messages, and source. messages is required while the others are optional for you to label your data format and data source. The messages field is a list of JSON objects, each of which has two fields: role and content. role can be system, user, or assistant. content is the text of the message. source is the source of the data, which can be self-made, alpaca, open-hermes, or any other string. To make the jsonl file, you can use json to save a list of dictionaries to the jsonl file: ```python import json with open('data.jsonl', 'w') as f: for sample in samples: f.write(json.dumps(sample) + '\n') ``` **Quickstart** For you to start finetuning quickly, we directly provide a shell script for you to run without paying attention to details. You need different hyperparameters for different types of training, e.g., single-GPU / multi-GPU training, full-parameter tuning, LoRA, or Q-LoRA. ```bash cd examples/sft bash finetune.sh -m -d --deepspeed [--use_lora True] [--q_lora True] ``` Specify the `` for your model, `` for your data, and `` for your deepspeed configuration. If you use LoRA or Q-LoRA, just add `--use_lora True` or `--q_lora True` based on your requirements. This is the simplest way to start finetuning. If you want to change more hyperparameters, you can dive into the script and modify those parameters. **Advanced Usages** In this section, we introduce the details of the scripts, including the core python script as well as the corresponding shell script. **Shell Script** Before we introduce the python code, we provide a brief introduction to the shell script with commands. We provide some guidance inside the shell script and here we take `finetune.sh` as an example. To set up the environment variables for distributed training (or single-GPU training), specify the following variables: `GPUS_PER_NODE`, `NNODES`, `NODE_RANK`, `MASTER_ADDR`, and `MASTER_PORT`. No need to worry too much about them as we provide the default settings for you. In the command, you can pass in the argument `-m` and `-d` to specify the model path and data path, respectively. You can also pass in the argument `--deepspeed` to specify the deepspeed configuration file. We provide two configuration files for ZeRO2 and ZeRO3, and you can choose one based on your requirements. In most cases, we recommend using ZeRO3 for multi-GPU training except for Q-LoRA, where we recommend using ZeRO2. There are a series of hyperparameters to tune. Passing in `--bf16` or `--fp16` to specify the precision for mixed precision training. The other significant hyperparameters include: - `--output_dir`: the path of your output models or adapters. - `--num_train_epochs`: the number of training epochs. - `--gradient_accumulation_steps`: the number of gradient accumulation steps. - `--per_device_train_batch_size`: the batch size per GPU for training, and the total batch size is equal to `per_device_train_batch_size * number_of_gpus * gradient_accumulation_steps`. - `--learning_rate`: the learning rate. - `--warmup_steps`: the number of warmup steps. - `--lr_scheduler_type`: the type of learning rate scheduler. - `--weight_decay`: the value of weight decay. - `--adam_beta2`: the value of in Adam. - `--model_max_length`: the maximum sequence length. - `--use_lora`: whether to use LoRA. Adding `--q_lora` can enable Q-LoRA. - `--gradient_checkpointing`: whether to use gradient checkpointing. **URL:** [https://qwen.readthedocs.io/en/latest/training/SFT/example.html](https://qwen.readthedocs.io/en/latest/training/SFT/example.html) #### Suggested labels ####

315: A Cheat Sheet and Some Recipes For Building Advanced RAG | by Andrei | Jan, 2024 | LlamaIndex Blog

### Details

Similarity score: 0.87 - [ ] [A Cheat Sheet and Some Recipes For Building Advanced RAG | by Andrei | Jan, 2024 | LlamaIndex Blog](https://blog.llamaindex.ai/a-cheat-sheet-and-some-recipes-for-building-advanced-rag-803a9d94c41b) A comprehensive RAG Cheat Sheet detailing motivations for RAG as well as techniques and strategies for progressing beyond Basic or Naive RAG builds. (high-resolution version) It’s the start of a new year and perhaps you’re looking to break into the RAG scene by building your very first RAG system. Or, maybe you’ve built Basic RAG systems and are now looking to enhance them to something more advanced in order to better handle your user’s queries and data structures. In either case, knowing where or how to begin may be a challenge in and of itself! If that’s true, then hopefully this blog post points you in the right direction for your next steps, and moreover, provides for you a mental model for you to anchor your decisions when building advanced RAG systems. The RAG cheat sheet shared above was greatly inspired by a recent RAG survey paper (“Retrieval-Augmented Generation for Large Language Models: A Survey” Gao, Yunfan, et al. 2023). Basic RAG Mainstream RAG as defined today involves retrieving documents from an external knowledge database and passing these along with the user’s query to an LLM for response generation. In other words, RAG involves a Retrieval component, an External Knowledge database and a Generation component. LlamaIndex Basic RAG Recipe: from llama_index import SimpleDirectoryReader, VectorStoreIndex # load data documents = SimpleDirectoryReader(input_dir="...").load_data() # build VectorStoreIndex that takes care of chunking documents # and encoding chunks to embeddings for future retrieval index = VectorStoreIndex.from_documents(documents=documents) # The QueryEngine class is equipped with the generator # and facilitates the retrieval and generation steps query_engine = index.as_query_engine() # Use your Default RAG response = query_engine.query("A user's query") #### Suggested labels #### { "key": "RAG-Building", "value": "Techniques and strategies for building advanced Retrieval Augmented Generation systems for language models" }

647: Qwen-1.5-8x7B : r/LocalLLaMA

### Details

Similarity score: 0.87 - [ ] [Qwen-1.5-8x7B : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1atw4ud/qwen158x7b/) # TITLE: Qwen-1.5-8x7B : r/LocalLLaMA **DESCRIPTION:** "Qwen-1.5-8x7B New Model Someone created a sparse MoE Qwen model by merging and finetuning Qwen1.5-7B **Model:** [Link to Model](https://huggingface.co/Crystalcareai/Qwen1.5-8x7b) **Dataset:** [Link to Dataset](https://huggingface.co/datasets/Crystalcareai/MoD) **Thread:** I'm excited to release a project I've been working on the last couple of weeks. **Qwen1.5-8x7b:** [Link to Model](http://huggingface.co/Crystalcareai/Qwen1.5-8x7b) And the accompanying dataset created with the intention of encouraging MoE models to organically develop their own experts: [Link to Dataset](http://huggingface.co/datasets/Crystalcareai/MoD) The purpose and intention behind this project is better detailed in the model/dataset card, but basically: I curated a diverse dataset from the highest quality conversations I could find. It's actually great. All sources are included in the dataset card. I then trained Qwen1.5-7b on a 100k subset over 4 epochs. Took that and made a MoE using @maximelabonne 's lazymergekit, utilizing a random gate and no base model. Trained that on another 351,000 pairs. I had planned on doing 4 full epochs, but @runpod_io had cuda errors in my machine 3x, expending the rest of my budget for the project after only 0.45/4 epochs. **Good news:** Model is surprisingly awesome even at such a (comparatively) small training set size. Reasoning compares with Mixtral in my (very basic) tests. Will benchmark it properly once runpod situation gets sorted, and plan to finish the rest of the training. Thank you to @Teknium1 , @jon_durbin , @erhartford , Maxime Labonne, and @chargoddard for their contributions to open source AI and making these processes accessible and transparent. And of course thank you to @MistralAI for inspiring this work and @alibaba_cloud for releasing the weights of the Qwen1.5 family. Teknium and Eric Hartford have been especially helpful, answering questions with humility and generosity. We're just getting started." **URL:** [Link to Reddit Post](https://www.reddit.com/r/LocalLLaMA/comments/1atw4ud/qwen158x7b/) #### Suggested labels #### {'label-name': 'MoE-model', 'label-description': 'Refers to a Mixture of Experts model created by merging and finetuning Qwen1.5-7B.', 'gh-repo': 'llm', 'confidence': 52.49}

625: unsloth/README.md at main · unslothai/unsloth

### Details

Similarity score: 0.87 - [ ] [unsloth/README.md at main · unslothai/unsloth](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1) # unsloth/README.md at main · unslothai/unsloth

### Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory! ![](https://i.ibb.co/sJ7RhGG/image-41.png)

## ✨ Finetune for Free All notebooks are **beginner friendly**! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | **Gemma 7b** | [▶️ Start on Colab](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing) | 2.4x faster | 58% less | | **Mistral 7b** | [▶️ Start on Colab](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing) | 2.2x faster | 62% less | | **Llama-2 7b** | [▶️ Start on Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing) | 2.2x faster | 43% less | | **TinyLlama** | [▶️ Start on Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing) | 3.9x faster | 74% less | | **CodeLlama 34b** A100 | [▶️ Start on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing) | 1.9x faster | 27% less | | **Mistral 7b** 1xT4 | [▶️ Start on Kaggle](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook) | 5x faster\* | 62% less | | **DPO - Zephyr** | [▶️ Start on Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) | 1.9x faster | 19% less | - This [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing) is useful for ShareGPT ChatML / Vicuna templates. - This [text completion notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing) is for raw text. This [DPO notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) replicates Zephyr. - \* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. ## 🦥 Unsloth.ai News - 📣 [Gemma 7b](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing) on 6T tokens now works. And [Gemma 2b notebook](https://colab.research.google.com/drive/15gGm7x_jTm017_Ic8e317tdIpDG53Mtu?usp=sharing) - 📣 Added [conversational notebooks](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing) and [raw text notebooks](https://colab.research.google.com/drive/1bMOKOBzxQWUIGZBs_B0zm8pimuEnZdfM?usp=sharing) - 📣 [2x faster inference](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) added for all our models - 📣 [DPO support](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) is now included. [More info](#DPO) on DPO - 📣 We did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗Hugging Face and are in their official docs! Check out the [SFT docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth) and [DPO docs](https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth) - 📣 [Download models 4x faster](https://huggingface.co/collections/unsloth/) from 🤗Hugging Face. Eg: `unsloth/mistral-7b-bnb-4bit` ## 🔗 Links and Resources | Type | Links | | ------------------------------- | --------------------------------------- | | 📚 **Wiki & FAQ** | [Read Our Wiki](https://github.com/unslothai/unsloth/wiki) | | 📜 **Documentation** | [Read The Doc](https://github.com/unslothai/unsloth/tree/main#-documentation) | | 💾 **Installation** | [unsloth/README.md](https://github.com/unslothai/unsloth/tree/main#installation-instructions)| |

**Twitter (aka X)** | [Follow us on X](https://twitter.com/unslothai)| | 🥇 **Benchmarking** | [Performance Tables](https://github.com/unslothai/unsloth/tree/main#-performance-benchmarking) | 🌐 **Released Models** | [Unsloth Releases](https://huggingface.co/unsloth)| | ✍️ **Blog** | [Read our Blogs](https://unsloth.ai/blog)| ## ⭐ Key Features - All kernels written in [OpenAI's Triton](https://openai.com/research/triton) language. **Manual backprop engine**. - **0% loss in accuracy** - no approximation methods - all exact. - No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) [Check your GPU!](https://developer.nvidia.com/cuda-gpus) GTX 1070, 1080 works, but is slow. - Works on **Linux** and **Windows** via WSL. - Supports 4bit and 16bit QLoRA / LoRA finetuning via [bitsandbytes](https://github.com/TimDettmers/bitsandbytes). - Open source trains 5x faster - see [Unsloth Pro](https://unsloth.ai/) for **30x faster training**! - If you trained a model with 🦥Unsloth, you can use this cool sticker!

## 🥇 Performance Benchmarking - For the full list of **reproducable** benchmarking tables, [go to our website](https://unsloth.ai/blog/mistral-benchmark#Benchmark%20tables) | 1 A100 40GB | 🤗Hugging Face | Flash Attention | 🦥Unsloth Open Source | 🦥[Unsloth Pro](https://unsloth.ai/pricing) | |--------------|--------------|-----------------|---------------------|-----------------| | Alpaca | 1x | 1.04x | 1.98x | **15.64x** | | LAION Chip2 | 1x | 0.92x | 1.61x | **20.73x** | | OASST | 1x | 1.19x | 2.17x | **14.83x** | | Slim Orca | 1x | 1.18x | 2.22x | **14.82x** | - Benchmarking table below was conducted by [🤗Hugging Face](https://huggingface.co/blog/unsloth-trl). | Free Colab T4 | Dataset | 🤗Hugging Face | Pytorch 2.1.1 | 🦥Unsloth | 🦥 VRAM reduction | | --- | --- | --- | --- | --- | --- | | Llama-2 7b | OASST | 1x | 1.19x | 1.95x | -43.3% | | Mistral 7b | Alpaca | 1x | 1.07x | 1.56x | -13.7% | | Tiny Llama 1.1b | Alpaca | 1x | 2.06x | 3.87x | -73.8% | | DPO with Zephyr | Ultra Chat | 1x | 1.09x | 1.55x | -18.6% | ![](https://i.ibb.co/sJ7RhGG/image-41.png) [View on GitHub](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1) #### Suggested labels ####

317: Streaming-llm: Efficient Streaming Language Models with Attention Sinks

### Details

Similarity score: 0.87 - [ ] [mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks](https://github.com/mit-han-lab/streaming-llm) # Efficient Streaming Language Models with Attention Sinks [[paper](http://arxiv.org/abs/2309.17453)] [[slides](assets/StreamingLLM.pdf)][[video](https://youtu.be/hvJsEzP34o8)] ![schemes](figures/schemes.png) https://github.com/mit-han-lab/streaming-llm/assets/40906949/2bd1cda4-a0bd-47d1-a023-fbf7779b8358 ## TL;DR We deploy LLMs for infinite-length inputs without sacrificing efficiency and performance. ## News - [2024/01] [SwiftInfer](https://github.com/hpcaitech/SwiftInfer), a TensorRT-based implementation makes StreamingLLM more production-grade. - [2024/01] StreamingLLM is integrated into NVIDIA [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#run-llama-with-streamingllm)! - [2023/12] StreamingLLM enables endless and efficient LLM generation on [iPhone](https://x.com/davidpissarra/status/1735761373261427189?s=20)! - [2023/12] StreamingLLM is integrated by HuggingFace Transformers' [main branch](https://github.com/huggingface/transformers/pull/26681). - [2023/10] StreamingLLM is integrated into [Intel Extension for Transformers](https://github.com/intel/intel-extension-for-transformers). - [2023/10] Check out [Attention Sinks](https://github.com/tomaarsen/attention_sinks), a third-party implementation to enable StreamingLLM on more Huggingface LLMs. ## Abstract Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach --- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a ``sink'' even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. ## Usage ### Environment Setup ```bash conda create -yn streaming python=3.8 conda activate streaming pip install torch torchvision torchaudio pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece python setup.py develop ``` ### Run Streaming Llama Chatbot ```bash CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py --enable_streaming ``` ## FAQ 1. **What does "working on infinite-length inputs" imply for LLMs?** Handling infinite-length text with LLMs presents challenges. Notably, storing all previous Key and Value (KV) states demands significant memory, and models might struggle to generate text beyond their training sequence length. StreamingLLM addresses this by retaining only the most recent tokens and attention sinks, discarding intermediate tokens. This enables the model to generate coherent text from recent tokens without a cache reset — a capability not seen in earlier methods. 2. **Is the context window of LLMs expanded?** No. The context window remains unchanged. Only the most recent tokens and attention sinks are retained, discarding middle tokens. This means the model can only process the latest tokens. The context window remains constrained by its initial pre-training. For instance, if Llama-2 is pre-trained with a context window of 4096 tokens, then the maximum cache size for StreamingLLM on Llama-2 remains 4096. 3. **Can I input an extensive text, like a book, into StreamingLLM for summarization?** While you can input a lengthy text, the model will only recognize the latest tokens. Thus, if a book is an input, StreamingLLM might only summarize the concluding paragraphs, which might not be very insightful. As emphasized earlier, we neither expand the LLMs' context window nor enhance their long-term memory. StreamingLLM's strength lies in generating fluent text from recent tokens without needing a cache refresh. 4. **What is the ideal use case for StreamingLLM?** StreamingLLM is optimized for streaming applications, such as multi-round dialogues. It's ideal for scenarios where a model needs to operate continually without requiring extensive memory or dependency on past data. An example is a daily assistant based on LLMs. StreamingLLM would let the model function continuously, basing its responses on recent conversations without needing to refresh its cache. Earlier methods would either need a cache reset when the conversation length exceeded the training length (losing recent context) or recompute KV states from recent text history, which can be time-consuming. 5. **How does StreamingLLM relate to recent works on context extension?** StreamingLLM is orthogonal to recent context extension methods and can be integrated with them. In StreamingLLM's context, "context extension" refers to the possibility of using a larger cache size to store more recent tokens. For a practical demonstration, refer to Figure 9 in our paper, where we implement StreamingLLM with models like LongChat-7B-v1.5-32K and Llama-2-7B-32K-Instruct. ## TODOs We will release the code and data in the following order, please stay tuned! - [x] Release core code of StreamingLLM, including Llama-2, MPT, Falcon, and Pythia. - [x] Release perplexity evaluation code - [x] Release Streaming Llama Chatbot demo. - [ ] Release StreamEval dataset and evaluation code. ## Citation If you find StreamingLLM useful or relevant to your project and research, please kindly cite our paper: ```bibtex @article{xiao2023streamingllm, title={Efficient Streaming Language Models with Attention Sinks}, author={Xiao, Guangxuan and Tian, Yuandong and Chen, Beidi and Han, Song and Lewis, Mike}, journal={arXiv}, year={2023} } ```

irthomasthomas / undecidability