Gemma llm context window extended to 90k using LongLM

Related issues

625: unsloth/README.md at main · unslothai/unsloth

### Details

Similarity score: 0.87 - [ ] [unsloth/README.md at main · unslothai/unsloth](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1) # unsloth/README.md at main · unslothai/unsloth

### Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory! ![](https://i.ibb.co/sJ7RhGG/image-41.png)

## ✨ Finetune for Free All notebooks are **beginner friendly**! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face. | Unsloth supports | Free Notebooks | Performance | Memory use | |-----------------|--------------------------------------------------------------------------------------------------------------------------|-------------|----------| | **Gemma 7b** | [▶️ Start on Colab](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing) | 2.4x faster | 58% less | | **Mistral 7b** | [▶️ Start on Colab](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing) | 2.2x faster | 62% less | | **Llama-2 7b** | [▶️ Start on Colab](https://colab.research.google.com/drive/1lBzz5KeZJKXjvivbYvmGarix9Ao6Wxe5?usp=sharing) | 2.2x faster | 43% less | | **TinyLlama** | [▶️ Start on Colab](https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing) | 3.9x faster | 74% less | | **CodeLlama 34b** A100 | [▶️ Start on Colab](https://colab.research.google.com/drive/1y7A0AxE3y8gdj4AVkl2aZX47Xu3P1wJT?usp=sharing) | 1.9x faster | 27% less | | **Mistral 7b** 1xT4 | [▶️ Start on Kaggle](https://www.kaggle.com/code/danielhanchen/kaggle-mistral-7b-unsloth-notebook) | 5x faster\* | 62% less | | **DPO - Zephyr** | [▶️ Start on Colab](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) | 1.9x faster | 19% less | - This [conversational notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing) is useful for ShareGPT ChatML / Vicuna templates. - This [text completion notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing) is for raw text. This [DPO notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) replicates Zephyr. - \* Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. ## 🦥 Unsloth.ai News - 📣 [Gemma 7b](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing) on 6T tokens now works. And [Gemma 2b notebook](https://colab.research.google.com/drive/15gGm7x_jTm017_Ic8e317tdIpDG53Mtu?usp=sharing) - 📣 Added [conversational notebooks](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing) and [raw text notebooks](https://colab.research.google.com/drive/1bMOKOBzxQWUIGZBs_B0zm8pimuEnZdfM?usp=sharing) - 📣 [2x faster inference](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) added for all our models - 📣 [DPO support](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing) is now included. [More info](#DPO) on DPO - 📣 We did a [blog](https://huggingface.co/blog/unsloth-trl) with 🤗Hugging Face and are in their official docs! Check out the [SFT docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth) and [DPO docs](https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth) - 📣 [Download models 4x faster](https://huggingface.co/collections/unsloth/) from 🤗Hugging Face. Eg: `unsloth/mistral-7b-bnb-4bit` ## 🔗 Links and Resources | Type | Links | | ------------------------------- | --------------------------------------- | | 📚 **Wiki & FAQ** | [Read Our Wiki](https://github.com/unslothai/unsloth/wiki) | | 📜 **Documentation** | [Read The Doc](https://github.com/unslothai/unsloth/tree/main#-documentation) | | 💾 **Installation** | [unsloth/README.md](https://github.com/unslothai/unsloth/tree/main#installation-instructions)| |

**Twitter (aka X)** | [Follow us on X](https://twitter.com/unslothai)| | 🥇 **Benchmarking** | [Performance Tables](https://github.com/unslothai/unsloth/tree/main#-performance-benchmarking) | 🌐 **Released Models** | [Unsloth Releases](https://huggingface.co/unsloth)| | ✍️ **Blog** | [Read our Blogs](https://unsloth.ai/blog)| ## ⭐ Key Features - All kernels written in [OpenAI's Triton](https://openai.com/research/triton) language. **Manual backprop engine**. - **0% loss in accuracy** - no approximation methods - all exact. - No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) [Check your GPU!](https://developer.nvidia.com/cuda-gpus) GTX 1070, 1080 works, but is slow. - Works on **Linux** and **Windows** via WSL. - Supports 4bit and 16bit QLoRA / LoRA finetuning via [bitsandbytes](https://github.com/TimDettmers/bitsandbytes). - Open source trains 5x faster - see [Unsloth Pro](https://unsloth.ai/) for **30x faster training**! - If you trained a model with 🦥Unsloth, you can use this cool sticker!

## 🥇 Performance Benchmarking - For the full list of **reproducable** benchmarking tables, [go to our website](https://unsloth.ai/blog/mistral-benchmark#Benchmark%20tables) | 1 A100 40GB | 🤗Hugging Face | Flash Attention | 🦥Unsloth Open Source | 🦥[Unsloth Pro](https://unsloth.ai/pricing) | |--------------|--------------|-----------------|---------------------|-----------------| | Alpaca | 1x | 1.04x | 1.98x | **15.64x** | | LAION Chip2 | 1x | 0.92x | 1.61x | **20.73x** | | OASST | 1x | 1.19x | 2.17x | **14.83x** | | Slim Orca | 1x | 1.18x | 2.22x | **14.82x** | - Benchmarking table below was conducted by [🤗Hugging Face](https://huggingface.co/blog/unsloth-trl). | Free Colab T4 | Dataset | 🤗Hugging Face | Pytorch 2.1.1 | 🦥Unsloth | 🦥 VRAM reduction | | --- | --- | --- | --- | --- | --- | | Llama-2 7b | OASST | 1x | 1.19x | 1.95x | -43.3% | | Mistral 7b | Alpaca | 1x | 1.07x | 1.56x | -13.7% | | Tiny Llama 1.1b | Alpaca | 1x | 2.06x | 3.87x | -73.8% | | DPO with Zephyr | Ultra Chat | 1x | 1.09x | 1.55x | -18.6% | ![](https://i.ibb.co/sJ7RhGG/image-41.png) [View on GitHub](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1) #### Suggested labels ####

647: Qwen-1.5-8x7B : r/LocalLLaMA

### Details

Similarity score: 0.85 - [ ] [Qwen-1.5-8x7B : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1atw4ud/qwen158x7b/) # TITLE: Qwen-1.5-8x7B : r/LocalLLaMA **DESCRIPTION:** "Qwen-1.5-8x7B New Model Someone created a sparse MoE Qwen model by merging and finetuning Qwen1.5-7B **Model:** [Link to Model](https://huggingface.co/Crystalcareai/Qwen1.5-8x7b) **Dataset:** [Link to Dataset](https://huggingface.co/datasets/Crystalcareai/MoD) **Thread:** I'm excited to release a project I've been working on the last couple of weeks. **Qwen1.5-8x7b:** [Link to Model](http://huggingface.co/Crystalcareai/Qwen1.5-8x7b) And the accompanying dataset created with the intention of encouraging MoE models to organically develop their own experts: [Link to Dataset](http://huggingface.co/datasets/Crystalcareai/MoD) The purpose and intention behind this project is better detailed in the model/dataset card, but basically: I curated a diverse dataset from the highest quality conversations I could find. It's actually great. All sources are included in the dataset card. I then trained Qwen1.5-7b on a 100k subset over 4 epochs. Took that and made a MoE using @maximelabonne 's lazymergekit, utilizing a random gate and no base model. Trained that on another 351,000 pairs. I had planned on doing 4 full epochs, but @runpod_io had cuda errors in my machine 3x, expending the rest of my budget for the project after only 0.45/4 epochs. **Good news:** Model is surprisingly awesome even at such a (comparatively) small training set size. Reasoning compares with Mixtral in my (very basic) tests. Will benchmark it properly once runpod situation gets sorted, and plan to finish the rest of the training. Thank you to @Teknium1 , @jon_durbin , @erhartford , Maxime Labonne, and @chargoddard for their contributions to open source AI and making these processes accessible and transparent. And of course thank you to @MistralAI for inspiring this work and @alibaba_cloud for releasing the weights of the Qwen1.5 family. Teknium and Eric Hartford have been especially helpful, answering questions with humility and generosity. We're just getting started." **URL:** [Link to Reddit Post](https://www.reddit.com/r/LocalLLaMA/comments/1atw4ud/qwen158x7b/) #### Suggested labels #### {'label-name': 'MoE-model', 'label-description': 'Refers to a Mixture of Experts model created by merging and finetuning Qwen1.5-7B.', 'gh-repo': 'llm', 'confidence': 52.49}

499: marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.

### Details

Similarity score: 0.85 - [ ] [marella/ctransformers: Python bindings for the Transformer models implemented in C/C++ using GGML library.](https://github.com/marella/ctransformers?tab=readme-ov-file#gptq) # CTransformers [![PyPI version](https://badge.fury.io/py/ctransformers.svg)](https://badge.fury.io/py/ctransformers) [![Documentation](https://readthedocs.org/images/button/readthedocs-ci.svg)](https://ctransformers.readthedocs.io/) [![Build and Test](https://github.com/ marella / ctransformers / actions / workflows / build.yml / badge.svg)](https://github.com/marella/ctransformers/actions/workflows/build.yml) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) Python bindings for the Transformer models implemented in C/C++ using GGML library. Also see [ChatDocs](https://github.com/marella/chatdocs) ## Supported Models | Model | Model Type | CUDA | Metal | | ------ | --------- | :--: | :--: | | GPT-2 | gpt2 | | | | GPT-J, GPT4All-J | gptj | | | | GPT-NeoX, StableLM | gpt_neox | | | | Falcon | falcon | ✅ | | | LLaMA, LLaMA 2 | llamai | ✅ | ✅ | | MPT | mpt | ✅ | | | StarCoder, StarChat | gpt_bigcode | ✅ | | | Dolly V2 | dolly-v2 | | | | Replit | replit | | | ## Installation To install via `pip`, simply run: ``` pip install ctransformers ``` ## Usage It provides a unified interface for all models: ```python from ctransformers import AutoModelForCausalLM llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2") print(llm("AI is going to")) ``` Run in Google Colab To stream the output: ```python for text in llm("AI is going to", stream=True): print(text, end="", flush=True) ``` You can load models from Hugging Face Hub directly: ```python llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml") ``` If a model repo has multiple model files (`.bin` or `.gguf` files), specify a model file using: ```python llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin") ``` ### 🤗 Transformers Note: This is an experimental feature and may change in the future. To use with 🤗 Transformers, create the model and tokenizer using: ```python from ctransformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True) tokenizer = AutoTokenizer.from_pretrained(model) ``` Run in Google Colab You can use 🤗 Transformers text generation pipeline: ```python from transformers import pipeline pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) print(pipe("AI is going to", max_new_tokens=256)) ``` You can use 🤗 Transformers generation parameters: ```python pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1) ``` You can use 🤗 Transformers tokenizers: ```python from ctransformers import AutoModelForCausalLM from transformers import AutoTokenizer model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True) # Load model from GGML model repo. tokenizer = AutoTokenizer.from_pretrained("gpt2") # Load tokenizer from original model repo. ``` ### LangChain It is integrated into LangChain. See LangChain [docs](https://github.com/LangChainAI/langchain#using-ctransformers-backed-models). ### GPU To run some of the model layers on GPU, set the `gpu_layers` parameter: ```python llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50) ``` Run in Google Colab #### CUDA Install CUDA libraries using: ```bash pip install ctransformers[cuda] ``` #### ROCm To enable ROCm support, install the `ctransformers` package using: ```bash CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers ``` #### Metal To enable Metal support, install the `ctransformers` package using: ```bash CT_METAL=1 pip install ctransformers --no-binary ctransformers ``` ### GPTQ Note: This is an experimental feature and only LLaMA models are supported using [ExLlama](https ://github.com/TheLastBen/exllama). Install additional dependencies using: ```bash pip install ctransformers[gptq] ``` Load a GPTQ model using: ```python llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ") ``` Run in Google Colab If the model name or path doesn't contain the word `gptq`, specify `model_type="gptq"`. It can also be used with LangChain. Low-level APIs are not fully supported. ## Documentation Find the documentation on [Read the Docs](https://ctransformers.readthedocs.io/). #### Config | Parameter | Type | Description | Default | | --------- | ------ | ------------------------------------------------------------ | ------- | | `top_k` | `int` | The top-k value to use for sampling | `40` | | `top_p` | `float` | The top-p value to use for sampling | `0.95` | | `temperature` | `float` | The temperature to use for sampling | `0.8` | | `repetition_penalty` | `float` | The repetition penalty to use for sampling | `1.1` | | `last_n_tokens` | `int` | The number of last tokens to use for repetition penalty | `64` | | `seed` | `int` | The seed value to use for sampling tokens | `-1` | | `max_new_tokens` | `int` | The maximum number of new tokens to generate | `256` | | `stop` | `List` | A list of sequences to stop generation when encountered | `None` | | `stream` | `bool` | Whether to stream the generated text | `False` | | `reset` | `bool` | Whether to reset the model state before generating text | `True` | | `batch_size` | `int` | The batch size to use for evaluating tokens in a single prompt | `8` | | `threads` | `int` | The number of threads to use for evaluating tokens | `-1` | | `context_length` | `int` | The maximum context length to use | `-1` | | `gpu_layers` | `int` | The number of layers to run on GPU | `0` | Find the URL for the model card for GPTQ [here](https://github.com/marella/ctransformers?tab=readme-ov-file#gptq). --- Made with ❤️ by [marella](https://github.com/marella) #### Suggested labels #### null

383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face

### Details

Similarity score: 0.84 - [ ] [deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face](https://huggingface.co/deepseek-ai/deepseek-coder-5.7bmqa-base) Deepseek Coder Introduction ---------------------------- Deepseek Coder is a series of code language models, each trained from scratch on 2T tokens with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on a project-level code corpus with a window size of 16K and an extra fill-in-the-blank task, supporting project-level code completion and infilling. Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks. ### Key Features - **Massive Training Data:** Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages. - **Highly Flexible & Scalable:** Offered in model sizes of 1.3B, 5.7B, 6.7B, and 33B, enabling users to choose the setup most suitable for their requirements. - **Superior Model Performance:** State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks. - **Advanced Code Completion Capabilities:** A window size of 16K and a fill-in-the-blank task, supporting project-level code completion and infilling tasks. ### Model Summary - **deepseek-coder-5.7bmqa-base:** A 5.7B parameter model with Multi Query Attention, trained on 2 trillion tokens. - **Home Page:** [DeepSeek](http://deepseek.com) - **Repository:** [deepseek-ai/deepseek-coder](https://github.com/deepseek-ai/deepseek-coder) - **Chat With DeepSeek Coder:** [DeepSeek-Coder](https://github.com/deepseek-ai/deepseek-coder/discussions) ### How to Use This section provides examples of how to use the Deepseek Coder model for code completion, code insertion, and repository-level code completion tasks. #### Code Completion ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = "#write a quick sort algorithm" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` #### Code Insertion ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = """<|begin|>def quick_sort(arr): if len(arr) <= 1: return arr pivot = arr[0] left = [] right = [] <|hole|> if arr[i] < pivot: left.append(arr[i]) else: right.append(arr[i]) return quick_sort(left) + [pivot] + quick_sort(right)<|end|>""" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):]) ``` #### Repository Level Code Completion ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = """#utils.py import torch from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score def load_data(): iris = datasets.load_iris() X = iris.data y = iris.target # Standardize the data scaler = StandardScaler() X = scaler.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Convert numpy data to PyTorch tensors X_train = torch.tensor(X_train, dtype=torch.float32) X_test = torch.tensor(X_test, dtype=torch.float32) y_train = torch.tensor(y_train, dtype=torch.int64) y_test = torch.tensor(y_test, dtype=torch.int64) return X_train, X_test, y_train, y_test def evaluate_predictions(y_test, y_pred): return accuracy_score(y_test, y_pred) #model.py import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader, TensorDataset class IrisClassifier(nn.Module): def __init__(self): super(IrisClassifier, self).__init__() self.fc = nn.Sequential( nn.Linear(4, 16), nn.ReLU(), nn.Linear(16, 3) ) def forward(self, x): return self.fc(x) def train_model(self, X_train, y_train, epochs, lr, batch_size): criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(self.parameters(), lr=lr) # Create DataLoader for batches dataset = TensorDataset(X_train, y_train) dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) for epoch in range(epochs): for batch_X, batch_y in dataloader: optimizer.zero_grad() outputs = self(batch_X) loss = criterion(outputs, batch_y) loss.backward() optimizer.step() def predict(self, X_test): with torch.no_grad(): outputs = self(X_test) _, predicted = outputs.max(1) return predicted.numpy() #main.py from utils import load_data, evaluate_predictions from model import IrisClassifier as Classifier def main(): # Model training and evaluation """ inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=140) print(tokenizer.decode(outputs[0])) ``` License ------- This code repository is licensed under the MIT License. The use of Deepseek Coder models is subject to the Model License. DeepSeek Coder supports commercial use. See the [LICENSE-MODEL](https://github.com/deepseek-ai/deepseek-coder/blob/main/LICENSE-MODEL) for more details. Contact ------- If you have any questions, please raise an issue or contact us at [agi\_code@deepseek.com](mailto:agi_code@deepseek.com). #### Suggested labels #### { "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }

326: Assisted Generation: a new direction toward low-latency text generation

### Details

Similarity score: 0.84 > **Assisted Generation: a new direction toward low-latency text generation** Greedy decoding with assisted generation Assisted generation is a balancing act. You want the assistant to quickly generate a candidate sequence while being as accurate as possible. If the assistant has poor quality, your get the cost of using the assistant model with little to no benefits. On the other hand, optimizing the quality of the candidate sequences may imply the use of slow assistants, resulting in a net slowdown. While we can't automate the selection of the assistant model for you, we’ve included an additional requirement and a heuristic to ensure the time spent with the assistant stays in check. First, the requirement – the assistant must have the exact same tokenizer as your model. If this requirement was not in place, expensive token decoding and re-encoding steps would have to be added. Furthermore, these additional steps would have to happen on the CPU, which in turn may need slow inter-device data transfers. Fast usage of the assistant is critical for the benefits of assisted generation to show up. Finally, the heuristic. By this point, you have probably noticed the similarities between the movie Inception and assisted generation – you are, after all, running text generation inside text generation. There will be one assistant model forward pass per candidate token, and we know that forward passes are expensive. While you can’t know in advance the number of tokens that the assistant model will get right, you can keep track of this information and use it to limit the number of candidate tokens requested to the assistant – some sections of the output are easier to anticipate than others. Wrapping all up, here’s our original implementation of the assisted generation loop (code): 1. Use greedy decoding to generate a certain number of candidate tokens with the assistant model, producing candidates. The number of produced candidate tokens is initialized to 5 the first time assisted generation is called. 2. Using our model, do a forward pass with candidates, obtaining logits. 3. Use the token selection method (.argmax() for greedy search or .multinomial() for sampling) to get the next_tokens from logits. 4. Compare next_tokens to candidates and get the number of matching tokens. Remember that this comparison has to be done with left-to-right causality: after the first mismatch, all candidates are invalidated. 5. Use the number of matches to slice things up and discard variables related to unconfirmed candidate tokens. In essence, in next_tokens, keep the matching tokens plus the first divergent token (which our model generates from a valid candidate subsequence). 6. Adjust the number of candidate tokens to be produced in the next iteration — our original heuristic increases it by 2 if ALL tokens match and decreases it by 1 otherwise. We’ve designed the API in 🤗 Transformers such that this process is hassle-free for you. All you need to do is to pass the assistant model under the new `assistant_model` keyword argument and reap the latency gains! At the time of the release of this blog post, assisted generation is limited to a batch size of 1. ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch prompt = "Alice and Bob" checkpoint = "EleutherAI/pythia-1.4b-deduped" assistant_checkpoint = "EleutherAI/pythia-160m-deduped" device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained(checkpoint) inputs = tokenizer(prompt, return_tensors="pt").to(device) model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint).to(device) outputs = model.generate(**inputs, assistant_model=assistant_model) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) # ['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a'] ``` Is the additional internal complexity worth it? Let’s have a look at the latency numbers for the greedy decoding case (results for sampling are in the next section), considering a batch size of 1. These results were pulled directly out of 🤗 Transformers without any additional optimizations, so you should be able to reproduce them in your setup. **Assisted Generation Benchmark** | OPT: Open | OPT: Summ | Whisper: ARS | CodeGen: Code | Flan-T5: Summ | |------------|------------|---------------|----------------|----------------| | GPU | | | | | | Omit cases with memory offload? | Yes | No | Image | | Assistant Model | facebook/opt-125m | Model Names: | 1.3B: facebook/opt-1.3b | 6.7B: facebook/opt-6.7b | | 30B: facebook/opt-30b | 66B: facebook/opt-66b | Dataset used as input prompt: | C4 (en, validation set) | joaogante/assisted_generation_benchmarks | | built with Gradio. | Hosted on Spaces | | | | Glancing at the collected numbers, we see that assisted generation can deliver significant latency reductions in diverse settings, but it is not a silver bullet – you should benchmark it before applying it to your use case. We can conclude that assisted generation: - 🤏 Requires access to an assistant model that is at least an order of magnitude smaller than your model (the bigger the difference, the better); - 🚀 Gets up to 3x speedups in the presence of INT8 and up to 2x otherwise, when the model fits in the GPU memory; - 🤯 If you’re playing with models that do not fit in your GPU and are relying on memory offloading, you can see up to 10x speedups; - 📄 Shines in input-grounded tasks, like automatic speech recognition or summarization. **Sample with assisted generation** Greedy decoding is suited for input-grounded tasks (automatic speech recognition, translation, summarization, ...) or factual knowledge-seeking. Open-ended tasks requiring large levels of creativity, such as most uses of a language model as a chatbot, should use sampling instead. Assisted generation is naturally designed for greedy decoding, but that doesn’t mean that you can’t use assisted generation with multinomial sampling! Drawing samples from a probability distribution for the next token will cause our greedy assistant to fail more often, reducing its latency benefits. However, we can control how sharp the probability distribution for the next tokens is, using the temperature coefficient that’s present in most sampling-based applications. At one extreme, with temperatures close to 0, sampling will approximate greedy decoding, favoring the most likely token. At the other extreme, with the temperature set to values much larger than 1, sampling will be chaotic, drawing from a uniform distribution. Low temperatures are, therefore, more favorable to your assistant model, retaining most of the latency benefits from assisted generation, as we can see below. #### Suggested labels #### { "key": "assisted-generation", "value": "Text generation with the use of an assistant model for latency reduction" }

363: LongLM: Self-Extend LLM Context Window Without Tuning

### Details

Similarity score: 0.83 - [ ] [datamllab/LongLM: LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning](https://github.com/datamllab/LongLM) LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning Implementation of the proposed SelfExtend in LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning. If you find our method useful, please kindly cite our paper. @misc{jin2024llm, title={LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning}, author={Hongye Jin and Xiaotian Han and Jingfeng Yang and Zhimeng Jiang and Zirui Liu and Chia-Yuan Chang and Huiyuan Chen and Xia Hu}, year={2024}, eprint={2401.01325}, archivePrefix={arXiv}, primaryClass={cs.CL} } Updates: [01/11/2024]: We've tested the implementation for phi-2. It works. You may find some results on this Reddit post [01/08/2024]: Add third-party implementations section [01/07/2024]: Add Implementation for Mistral [01/05/2024]: Our proposed method is discussed on this Reddit post 1. Overview This work elicits LLMs' inherent ability to handle long contexts without fine-tuning. The limited length of the training sequence during training may limit the application of Large Language Models (LLMs) on long input sequences for inference. In this work, we argue that existing LLMs themselves have inherent capabilities for handling long contexts. Based on this argument, we suggest extending LLMs' context window by themselves to fully utilize the inherent ability. We propose Self-Extend to stimulate LLMs' long context handling potential. The basic idea is to construct bi-level attention information: the group level and the neighbor level. The two levels are computed by the original model's self-attention, which means the proposed does not require any training. #### Suggested labels #### { "key": "llm-self-extension", "value": "Discussing LLM's inherent capability to handle long contexts without fine-tuning and proposing SelfExtend to stimulate LLM's long context handling potential" }

irthomasthomas / undecidability