self-speculative-decoding/README.md at main · dilab-zju/self-speculative-decoding

Related issues

495: Paper page - Accelerating LLM Inference with Staged Speculative Decoding

### Details

Similarity score: 0.89 - [ ] [Paper page - Accelerating LLM Inference with Staged Speculative Decoding](https://huggingface.co/papers/2308.04623) # Paper Page - Accelerating LLM Inference with Staged Speculative Decoding Published on Aug 9, 2023 | Featured in Daily Papers on Aug 10, 2023 **Authors:** [Benjamin Spector](https://huggingface.co/benjamin-spector), [Chris Re](https://huggingface.co/chris-re) --- **Abstract** Recent advances with large language models (LLM) have highlighted their diverse capabilities. This paper proposes a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. The algorithm restructures the speculative batch as a tree, reducing generation costs and increasing the expected tokens per batch. Additionally, it introduces a second stage of speculative decoding, further decreasing single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model, all while perfectly preserving output quality. **[Read the Paper »](https://huggingface.co/papers/2308.04623)** ---

New Paper Card

#### Suggested labels #### { "label-name": "Algorithm", "description": "Staged speculative decoding algorithm for LLM inference acceleration", "confidence": 91.15 }

391: Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA

### Details

Similarity score: 0.89 - [ ] [Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/17h4rqz/speculative_decoding_in_exllama_v2_and_llamacpp/) Speculative Decoding in Exllama v2 and llama.cpp Comparison ============================================================= Discussion ----------- We discussed speculative decoding (SD) in a previous thread. For those who are not aware of this feature, it allows LLM loaders to use a smaller "draft" model to help predict tokens for a larger model. In that thread, someone asked for tests of speculative decoding for both Exllama v2 and llama.cpp. Although I generally only run models in GPTQ, AWQ, or exl2 formats, I was interested in doing the exl2 vs. llama.cpp comparison. Test Setup ----------- The tests were run on a 2x 4090, 13900K, DDR5 system. The screen captures of the terminal output of both are available below. If someone has experience with making llama.cpp speculative decoding work better, please share. Exllama v2 Results ------------------ **Model:** Xwin-LM-70B-V0.1-4.0bpw-h6-exl2 **Draft Model:** TinyLlama-1.1B-1T-OpenOrca-GPTQ Performance can be highly variable, but it goes from ~20 t/s without SD to 40-50 t/s with SD. ### No SD ```bash Prompt processed in 0.02 seconds, 4 tokens, 200.61 tokens/second Response generated in 10.80 seconds, 250 tokens, 23.15 tokens/second ``` ### With SD ```bash Prompt processed in 0.03 seconds, 4 tokens, 138.80 tokens/second Response generated in 5.10 seconds, 250 tokens, 49.05 tokens/second ``` #### Suggested labels #### { "key": "speculative-decoding", "value": "Technique for using a smaller 'draft' model to help predict tokens for a larger model" }

492: speculative decoding in llama.cpp : PoC for speeding-up inference via speculative sampling by ggerganov · Pull Request #2926 · ggerganov/llama.cpp

### Details

Similarity score: 0.88 - [ ] [speculative : PoC for speeding-up inference via speculative sampling by ggerganov · Pull Request #2926 · ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp/pull/2926) # Title: speculative : PoC for speeding-up inference via speculative sampling #292 #### Suggested labels #### { "label-name": "LLM-speed-optimization", "description": "Optimizing LLama model inference speed", "confidence": 80.85 }

383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face

### Details

Similarity score: 0.88 - [ ] [deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face](https://huggingface.co/deepseek-ai/deepseek-coder-5.7bmqa-base) Deepseek Coder Introduction ---------------------------- Deepseek Coder is a series of code language models, each trained from scratch on 2T tokens with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on a project-level code corpus with a window size of 16K and an extra fill-in-the-blank task, supporting project-level code completion and infilling. Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks. ### Key Features - **Massive Training Data:** Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages. - **Highly Flexible & Scalable:** Offered in model sizes of 1.3B, 5.7B, 6.7B, and 33B, enabling users to choose the setup most suitable for their requirements. - **Superior Model Performance:** State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks. - **Advanced Code Completion Capabilities:** A window size of 16K and a fill-in-the-blank task, supporting project-level code completion and infilling tasks. ### Model Summary - **deepseek-coder-5.7bmqa-base:** A 5.7B parameter model with Multi Query Attention, trained on 2 trillion tokens. - **Home Page:** [DeepSeek](http://deepseek.com) - **Repository:** [deepseek-ai/deepseek-coder](https://github.com/deepseek-ai/deepseek-coder) - **Chat With DeepSeek Coder:** [DeepSeek-Coder](https://github.com/deepseek-ai/deepseek-coder/discussions) ### How to Use This section provides examples of how to use the Deepseek Coder model for code completion, code insertion, and repository-level code completion tasks. #### Code Completion ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = "#write a quick sort algorithm" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` #### Code Insertion ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = """<|begin|>def quick_sort(arr): if len(arr) <= 1: return arr pivot = arr[0] left = [] right = [] <|hole|> if arr[i] < pivot: left.append(arr[i]) else: right.append(arr[i]) return quick_sort(left) + [pivot] + quick_sort(right)<|end|>""" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_length=128) print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):]) ``` #### Repository Level Code Completion ```python from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda() input_text = """#utils.py import torch from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score def load_data(): iris = datasets.load_iris() X = iris.data y = iris.target # Standardize the data scaler = StandardScaler() X = scaler.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Convert numpy data to PyTorch tensors X_train = torch.tensor(X_train, dtype=torch.float32) X_test = torch.tensor(X_test, dtype=torch.float32) y_train = torch.tensor(y_train, dtype=torch.int64) y_test = torch.tensor(y_test, dtype=torch.int64) return X_train, X_test, y_train, y_test def evaluate_predictions(y_test, y_pred): return accuracy_score(y_test, y_pred) #model.py import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader, TensorDataset class IrisClassifier(nn.Module): def __init__(self): super(IrisClassifier, self).__init__() self.fc = nn.Sequential( nn.Linear(4, 16), nn.ReLU(), nn.Linear(16, 3) ) def forward(self, x): return self.fc(x) def train_model(self, X_train, y_train, epochs, lr, batch_size): criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(self.parameters(), lr=lr) # Create DataLoader for batches dataset = TensorDataset(X_train, y_train) dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) for epoch in range(epochs): for batch_X, batch_y in dataloader: optimizer.zero_grad() outputs = self(batch_X) loss = criterion(outputs, batch_y) loss.backward() optimizer.step() def predict(self, X_test): with torch.no_grad(): outputs = self(X_test) _, predicted = outputs.max(1) return predicted.numpy() #main.py from utils import load_data, evaluate_predictions from model import IrisClassifier as Classifier def main(): # Model training and evaluation """ inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=140) print(tokenizer.decode(outputs[0])) ``` License ------- This code repository is licensed under the MIT License. The use of Deepseek Coder models is subject to the Model License. DeepSeek Coder supports commercial use. See the [LICENSE-MODEL](https://github.com/deepseek-ai/deepseek-coder/blob/main/LICENSE-MODEL) for more details. Contact ------- If you have any questions, please raise an issue or contact us at [agi\_code@deepseek.com](mailto:agi_code@deepseek.com). #### Suggested labels #### { "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }

494: Awesome-Efficient-LLM: A curated list for Efficient Large Language Models

### Details

Similarity score: 0.88 - [ ] [horseee/Awesome-Efficient-LLM: A curated list for Efficient Large Language Models](https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration) # Awesome-Efficient-LLM A curated list for [Efficient Large Language Models](https://github.com/horseee/Awesome-Efficient-LLM): - [Knowledge Distillation](#knowledge-distillation) - [Network Pruning](#network-pruning) - [Quantization](#quantization) - [Inference Acceleration](#inference-acceleration) - [Efficient MOE](#efficient-moe) - [Text Compression](#text-compression) - [Low-Rank Decomposition](#low-rank-decomposition) - [Hardware/System Tuning](#hardwareSystem-tuning) - [Survey](#survey) - [Leaderboard](#leaderboard) - [🚀 Updates](#updates) - [Contributing](#contributing) --- ## Inference Acceleration - … - [Add your paper here](https://github.com/horseee/Awesome-Efficient-LLM/blob/main/generate_item.py), [generate the required format](https://github.com/horseee/Awesome-Efficient-LLM#decontributing), and submit a pull request. --- ## Updates - **Sep 27, 2023:** Add tag for papers accepted at NeurIPS'23. - **Sep 6, 2023:** Add a new subdirectory `project/` to organize those projects designed for developing a lightweight LLM. - **July 11, 2023:** Create a new subdirectory `efficient_plm/` for papers applicable to PLMs (such as BERT, BART) but have yet to be verified for their effectiveness on LLMs. --- ## Contributing If you'd like to include your paper or need to update any details, please feel free to submit a pull request. You can generate the required markdown format for each paper by filling in the information in `generate_item.py` and execute `python generate_item.py`. We warmly appreciate your contributions to this list. Alternatively, you can email me with the links to your paper and code, and I would add your paper to the list at my earliest convenience. - URL: [https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration](https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration) #### Suggested labels #### { "label-name": "efficient-llm-acceleration", "description": "Inference acceleration techniques for efficient large language models.", "repo": "horseee/Awesome-Efficient-LLM", "confidence": 70.8 }

irthomasthomas / undecidability