Self-Speculative Decoding is a novel inference scheme for accelerating Large Language Models (LLMs) without additional neural network training and extra memory footprint. It not only maintains consistent output quality but also ensures model compatibility, making it a plug-and-play and cost-effective solution for LLM inference acceleration.
Self-Speculative Decoding involves a two-stage process:
Drafting stage: Generates draft tokens by selectively skipping certain intermediate layers.
Verification stage: Employs the original LLM to validate draft tokens in one forward pass.
Cite Our Paper
If you find this code and paper useful in your research, please consider citing:
@article{zhang2023draft,
title={Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding},
author={Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, Sharad Mehrotra},
year={2023},
eprint={2309.08168},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Requirements
PyTorch
Transformer
NumPy
More in ssd.yml
Files
searching.py: Selection of skipped layers by Bayesian optmization
decoding.py: Core process of self-speculative decoding
modeling_llama.py: Model structure with self-speculative decoding
search.ipynb: Main script searches for skipped layers
evaluate_sum.ipynb: Main script evaluates self-speculative decoding on text generation task
evaluate_code.ipynb: Main script evaluates self-speculative decoding on code generation task
skip_layers.json: Layers skipped by draft models corresponding to different base models
ssd.yml: Relevant environment
Usage
Configure the relevant environment according to ssd.yml;
Execute search.ipynb to get skipped layers to generate a draft model;
Execute evaluate_sum.ipynb to evaluate self-speculative decoding on summarization;
Execute evaluate_code.ipynb to evaluate self-speculative decoding on code generation.
{'label-name': 'Inference-Scheme', 'label-description': 'Describes a novel approach for accelerating Large Language Models without additional training or memory footprint.', 'confidence': 71.69}
495: Paper page - Accelerating LLM Inference with Staged Speculative Decoding
### DetailsSimilarity score: 0.89
- [ ] [Paper page - Accelerating LLM Inference with Staged Speculative Decoding](https://huggingface.co/papers/2308.04623)
# Paper Page - Accelerating LLM Inference with Staged Speculative Decoding
Published on Aug 9, 2023 | Featured in Daily Papers on Aug 10, 2023
**Authors:** [Benjamin Spector](https://huggingface.co/benjamin-spector), [Chris Re](https://huggingface.co/chris-re)
---
**Abstract**
Recent advances with large language models (LLM) have highlighted their diverse capabilities. This paper proposes a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. The algorithm restructures the speculative batch as a tree, reducing generation costs and increasing the expected tokens per batch. Additionally, it introduces a second stage of speculative decoding, further decreasing single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model, all while perfectly preserving output quality.
**[Read the Paper »](https://huggingface.co/papers/2308.04623)**
---
391: Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA
### DetailsSimilarity score: 0.89
- [ ] [Speculative Decoding in Exllama v2 and llama.cpp comparison : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/17h4rqz/speculative_decoding_in_exllama_v2_and_llamacpp/)
Speculative Decoding in Exllama v2 and llama.cpp Comparison
=============================================================
Discussion
-----------
We discussed speculative decoding (SD) in a previous thread. For those who are not aware of this feature, it allows LLM loaders to use a smaller "draft" model to help predict tokens for a larger model. In that thread, someone asked for tests of speculative decoding for both Exllama v2 and llama.cpp. Although I generally only run models in GPTQ, AWQ, or exl2 formats, I was interested in doing the exl2 vs. llama.cpp comparison.
Test Setup
-----------
The tests were run on a 2x 4090, 13900K, DDR5 system. The screen captures of the terminal output of both are available below. If someone has experience with making llama.cpp speculative decoding work better, please share.
Exllama v2 Results
------------------
**Model:** Xwin-LM-70B-V0.1-4.0bpw-h6-exl2
**Draft Model:** TinyLlama-1.1B-1T-OpenOrca-GPTQ
Performance can be highly variable, but it goes from ~20 t/s without SD to 40-50 t/s with SD.
### No SD
```bash
Prompt processed in 0.02 seconds, 4 tokens, 200.61 tokens/second
Response generated in 10.80 seconds, 250 tokens, 23.15 tokens/second
```
### With SD
```bash
Prompt processed in 0.03 seconds, 4 tokens, 138.80 tokens/second
Response generated in 5.10 seconds, 250 tokens, 49.05 tokens/second
```
#### Suggested labels
#### { "key": "speculative-decoding", "value": "Technique for using a smaller 'draft' model to help predict tokens for a larger model" }
492: speculative decoding in llama.cpp : PoC for speeding-up inference via speculative sampling by ggerganov · Pull Request #2926 · ggerganov/llama.cpp
### DetailsSimilarity score: 0.88
- [ ] [speculative : PoC for speeding-up inference via speculative sampling by ggerganov · Pull Request #2926 · ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp/pull/2926)
# Title: speculative : PoC for speeding-up inference via speculative sampling #292
#### Suggested labels
#### { "label-name": "LLM-speed-optimization", "description": "Optimizing LLama model inference speed", "confidence": 80.85 }
383: deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face
### DetailsSimilarity score: 0.88
- [ ] [deepseek-ai/deepseek-coder-5.7bmqa-base · Hugging Face](https://huggingface.co/deepseek-ai/deepseek-coder-5.7bmqa-base)
Deepseek Coder Introduction
----------------------------
Deepseek Coder is a series of code language models, each trained from scratch on 2T tokens with a composition of 87% code and 13% natural language in both English and Chinese. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on a project-level code corpus with a window size of 16K and an extra fill-in-the-blank task, supporting project-level code completion and infilling. Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.
### Key Features
- **Massive Training Data:** Trained from scratch on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages.
- **Highly Flexible & Scalable:** Offered in model sizes of 1.3B, 5.7B, 6.7B, and 33B, enabling users to choose the setup most suitable for their requirements.
- **Superior Model Performance:** State-of-the-art performance among publicly available code models on HumanEval, MultiPL-E, MBPP, DS-1000, and APPS benchmarks.
- **Advanced Code Completion Capabilities:** A window size of 16K and a fill-in-the-blank task, supporting project-level code completion and infilling tasks.
### Model Summary
- **deepseek-coder-5.7bmqa-base:** A 5.7B parameter model with Multi Query Attention, trained on 2 trillion tokens.
- **Home Page:** [DeepSeek](http://deepseek.com)
- **Repository:** [deepseek-ai/deepseek-coder](https://github.com/deepseek-ai/deepseek-coder)
- **Chat With DeepSeek Coder:** [DeepSeek-Coder](https://github.com/deepseek-ai/deepseek-coder/discussions)
### How to Use
This section provides examples of how to use the Deepseek Coder model for code completion, code insertion, and repository-level code completion tasks.
#### Code Completion
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()
input_text = "#write a quick sort algorithm"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
#### Code Insertion
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()
input_text = """<|begin|>def quick_sort(arr):
if len(arr) <= 1:
return arr
pivot = arr[0]
left = []
right = []
<|hole|>
if arr[i] < pivot:
left.append(arr[i])
else:
right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)<|end|>"""
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):])
```
#### Repository Level Code Completion
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/deepseek-coder-5.7bmqa-base", trust_remote_code=True).cuda()
input_text = """#utils.py
import torch
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
def load_data():
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Standardize the data
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Convert numpy data to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.int64)
y_test = torch.tensor(y_test, dtype=torch.int64)
return X_train, X_test, y_train, y_test
def evaluate_predictions(y_test, y_pred):
return accuracy_score(y_test, y_pred)
#model.py
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
class IrisClassifier(nn.Module):
def __init__(self):
super(IrisClassifier, self).__init__()
self.fc = nn.Sequential(
nn.Linear(4, 16),
nn.ReLU(),
nn.Linear(16, 3)
)
def forward(self, x):
return self.fc(x)
def train_model(self, X_train, y_train, epochs, lr, batch_size):
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(self.parameters(), lr=lr)
# Create DataLoader for batches
dataset = TensorDataset(X_train, y_train)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
for epoch in range(epochs):
for batch_X, batch_y in dataloader:
optimizer.zero_grad()
outputs = self(batch_X)
loss = criterion(outputs, batch_y)
loss.backward()
optimizer.step()
def predict(self, X_test):
with torch.no_grad():
outputs = self(X_test)
_, predicted = outputs.max(1)
return predicted.numpy()
#main.py
from utils import load_data, evaluate_predictions
from model import IrisClassifier as Classifier
def main():
# Model training and evaluation
"""
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=140)
print(tokenizer.decode(outputs[0]))
```
License
-------
This code repository is licensed under the MIT License. The use of Deepseek Coder models is subject to the Model License. DeepSeek Coder supports commercial use.
See the [LICENSE-MODEL](https://github.com/deepseek-ai/deepseek-coder/blob/main/LICENSE-MODEL) for more details.
Contact
-------
If you have any questions, please raise an issue or contact us at [agi\_code@deepseek.com](mailto:agi_code@deepseek.com).
#### Suggested labels
#### { "key": "llm-experiments", "value": "Experiments and results related to Large Language Models" } { "key": "AI-Chatbots", "value": "Topics related to advanced chatbot platforms integrating multiple AI models" }
494: Awesome-Efficient-LLM: A curated list for Efficient Large Language Models
### DetailsSimilarity score: 0.88
- [ ] [horseee/Awesome-Efficient-LLM: A curated list for Efficient Large Language Models](https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration)
# Awesome-Efficient-LLM
A curated list for [Efficient Large Language Models](https://github.com/horseee/Awesome-Efficient-LLM):
- [Knowledge Distillation](#knowledge-distillation)
- [Network Pruning](#network-pruning)
- [Quantization](#quantization)
- [Inference Acceleration](#inference-acceleration)
- [Efficient MOE](#efficient-moe)
- [Text Compression](#text-compression)
- [Low-Rank Decomposition](#low-rank-decomposition)
- [Hardware/System Tuning](#hardwareSystem-tuning)
- [Survey](#survey)
- [Leaderboard](#leaderboard)
- [🚀 Updates](#updates)
- [Contributing](#contributing)
---
## Inference Acceleration
- …
- [Add your paper here](https://github.com/horseee/Awesome-Efficient-LLM/blob/main/generate_item.py), [generate the required format](https://github.com/horseee/Awesome-Efficient-LLM#decontributing), and submit a pull request.
---
## Updates
- **Sep 27, 2023:** Add tag for papers accepted at NeurIPS'23.
- **Sep 6, 2023:** Add a new subdirectory `project/` to organize those projects designed for developing a lightweight LLM.
- **July 11, 2023:** Create a new subdirectory `efficient_plm/` for papers applicable to PLMs (such as BERT, BART) but have yet to be verified for their effectiveness on LLMs.
---
## Contributing
If you'd like to include your paper or need to update any details, please feel free to submit a pull request. You can generate the required markdown format for each paper by filling in the information in `generate_item.py` and execute `python generate_item.py`. We warmly appreciate your contributions to this list. Alternatively, you can email me with the links to your paper and code, and I would add your paper to the list at my earliest convenience.
- URL: [https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration](https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration)
#### Suggested labels
#### { "label-name": "efficient-llm-acceleration", "description": "Inference acceleration techniques for efficient large language models.", "repo": "horseee/Awesome-Efficient-LLM", "confidence": 70.8 }
Self-Speculative Decoding
Code associated with the paper:
Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding
Self-Speculative Decoding is a novel inference scheme for accelerating Large Language Models (LLMs) without additional neural network training and extra memory footprint. It not only maintains consistent output quality but also ensures model compatibility, making it a plug-and-play and cost-effective solution for LLM inference acceleration.
Self-Speculative Decoding involves a two-stage process:
Drafting stage: Generates draft tokens by selectively skipping certain intermediate layers.
Verification stage: Employs the original LLM to validate draft tokens in one forward pass.
Cite Our Paper
If you find this code and paper useful in your research, please consider citing:
Requirements
Files
Usage
View on GitHub
Suggested labels
{'label-name': 'Inference-Scheme', 'label-description': 'Describes a novel approach for accelerating Large Language Models without additional training or memory footprint.', 'confidence': 71.69}