All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, Ollama, vLLM or uploaded to Hugging Face.
Unsloth supports | Free Notebooks | Performance | Memory use |
---|---|---|---|
Llama 3.2 (3B) | ▶️ Start for free | 2x faster | 60% less |
Llama 3.2 Vision (11B) | ▶️ Start for free | 2x faster | 40% less |
Llama 3.1 (8B) | ▶️ Start for free | 2x faster | 60% less |
Phi-3.5 (mini) | ▶️ Start for free | 2x faster | 50% less |
Gemma 2 (9B) | ▶️ Start for free | 2x faster | 63% less |
Qwen 2.5 (7B) | ▶️ Start for free | 2x faster | 63% less |
Mistral v0.3 (7B) | ▶️ Start for free | 2.2x faster | 73% less |
Ollama | ▶️ Start for free | 1.9x faster | 43% less |
ORPO | ▶️ Start for free | 1.9x faster | 43% less |
DPO Zephyr | ▶️ Start for free | 1.9x faster | 43% less |
📣 NEW! Vision models now supported! Llama 3.2 Vision (11B), Qwen 2.5 VL (7B) and Pixtral (12B) 2409
📣 NEW! Qwen-2.5 including Coder models are now supported with bugfixes. 14b fits in a Colab GPU! [Qwen 2.5 conversational notebook]
📣 NEW! We found and helped fix a gradient accumulation bug! Please update Unsloth and transformers.
📣 NEW! Mistral Small 22b notebook finetuning fits in under 16GB of VRAM!
📣 Try out Chat interface!
📣 NEW! Llama 3.1 8b, 70b & Mistral Nemo-12b both Base and Instruct are now supported
📣 NEW! pip install unsloth
now works! Head over to pypi to check it out! This allows non git pull installs. Use pip install unsloth[colab-new]
for non dependency installs.
📣 NEW! Continued Pretraining notebook for other languages like Korean!
📣 2x faster inference added for all our models
📣 We cut memory usage by a further 30% and now support 4x longer context windows!
Type | Links |
---|---|
📚 Documentation & Wiki | Read Our Docs |
Twitter (aka X) | Follow us on X |
💾 Installation | unsloth/README.md |
🥇 Benchmarking | Performance Tables |
🌐 Released Models | Unsloth Releases |
✍️ Blog | Read our Blogs |
1 A100 40GB | 🤗Hugging Face | Flash Attention | 🦥Unsloth Open Source | 🦥Unsloth Pro |
---|---|---|---|---|
Alpaca | 1x | 1.04x | 1.98x | 15.64x |
LAION Chip2 | 1x | 0.92x | 1.61x | 20.73x |
OASST | 1x | 1.19x | 2.17x | 14.83x |
Slim Orca | 1x | 1.18x | 2.22x | 14.82x |
Free Colab T4 | Dataset | 🤗Hugging Face | Pytorch 2.1.1 | 🦥Unsloth | 🦥 VRAM reduction |
---|---|---|---|---|---|
Llama-2 7b | OASST | 1x | 1.19x | 1.95x | -43.3% |
Mistral 7b | Alpaca | 1x | 1.07x | 1.56x | -13.7% |
Tiny Llama 1.1b | Alpaca | 1x | 2.06x | 3.87x | -73.8% |
DPO with Zephyr | Ultra Chat | 1x | 1.09x | 1.55x | -18.6% |
For stable releases, use pip install unsloth
. We recommend pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
for most installations though.
⚠️Only use Conda if you have it. If not, use Pip
. Select either pytorch-cuda=11.8,12.1
for CUDA 11.8 or CUDA 12.1. We support python=3.10,3.11,3.12
.
conda create --name unsloth_env \
python=3.11 \
pytorch-cuda=12.1 \
pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers \
-y
conda activate unsloth_env
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
⚠️Do **NOT** use this if you have Conda.
Pip is a bit more complex since there are dependency issues. The pip command is different for torch 2.2,2.3,2.4,2.5
and CUDA versions.
For other torch versions, we support torch211
, torch212
, torch220
, torch230
, torch240
and for CUDA versions, we support cu118
and cu121
and cu124
. For Ampere devices (A100, H100, RTX3090) and above, use cu118-ampere
or cu121-ampere
or cu124-ampere
.
For example, if you have torch 2.4
and CUDA 12.1
, use:
pip install --upgrade pip
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
Another example, if you have torch 2.5
and CUDA 12.4
, use:
pip install --upgrade pip
pip install "unsloth[cu124-torch250] @ git+https://github.com/unslothai/unsloth.git"
And other examples:
pip install "unsloth[cu121-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-ampere-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu118-torch240] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-ampere-torch230] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu121-torch250] @ git+https://github.com/unslothai/unsloth.git"
pip install "unsloth[cu124-ampere-torch250] @ git+https://github.com/unslothai/unsloth.git"
Or, run the below in a terminal to get the optimal pip installation command:
wget -qO- https://raw.githubusercontent.com/unslothai/unsloth/main/unsloth/_auto_install.py | python -
Or, run the below manually in a Python REPL:
try: import torch
except: raise ImportError('Install torch via `pip install torch`')
from packaging.version import Version as V
v = V(torch.__version__)
cuda = str(torch.version.cuda)
is_ampere = torch.cuda.get_device_capability()[0] >= 8
if cuda != "12.1" and cuda != "11.8" and cuda != "12.4": raise RuntimeError(f"CUDA = {cuda} not supported!")
if v <= V('2.1.0'): raise RuntimeError(f"Torch = {v} too old!")
elif v <= V('2.1.1'): x = 'cu{}{}-torch211'
elif v <= V('2.1.2'): x = 'cu{}{}-torch212'
elif v < V('2.3.0'): x = 'cu{}{}-torch220'
elif v < V('2.4.0'): x = 'cu{}{}-torch230'
elif v < V('2.5.0'): x = 'cu{}{}-torch240'
elif v < V('2.6.0'): x = 'cu{}{}-torch250'
else: raise RuntimeError(f"Torch = {v} too new!")
x = x.format(cuda.replace(".", ""), "-ampere" if is_ampere else "")
print(f'pip install --upgrade pip && pip install "unsloth[{x}] @ git+https://github.com/unslothai/unsloth.git"')
To run Unsloth directly on Windows:
dataset_num_proc=1
to avoid a crashing issue:
trainer = SFTTrainer(
dataset_num_proc=1,
...
)
For advanced installation instructions or if you see weird errors during installations:
torch
and triton
. Go to https://pytorch.org to install it. For example pip install torch torchvision torchaudio triton
nvcc
. If that fails, you need to install cudatoolkit
or CUDA drivers.xformers
manually. You can try installing vllm
and seeing if vllm
succeeds. Check if xformers
succeeded with python -m xformers.info
Go to https://github.com/facebookresearch/xformers. Another option is to install flash-attn
for Ampere GPUs.bitsandbytes
and check it with python -m bitsandbytes
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")
# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
"unsloth/mistral-7b-v0.3-bnb-4bit", # New Mistral v3 2x faster!
"unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
"unsloth/llama-3-8b-bnb-4bit", # Llama-3 15 trillion tokens model 2x faster!
"unsloth/llama-3-8b-Instruct-bnb-4bit",
"unsloth/llama-3-70b-bnb-4bit",
"unsloth/Phi-3-mini-4k-instruct", # Phi-3 2x faster!
"unsloth/Phi-3-medium-4k-instruct",
"unsloth/mistral-7b-bnb-4bit",
"unsloth/gemma-7b-bnb-4bit", # Gemma 2.2x faster!
] # More models at https://huggingface.co/unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/llama-3-8b-bnb-4bit",
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
max_seq_length = max_seq_length,
use_rslora = False, # We support rank stabilized LoRA
loftq_config = None, # And LoftQ
)
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
tokenizer = tokenizer,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 60,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
output_dir = "outputs",
optim = "adamw_8bit",
seed = 3407,
),
)
trainer.train()
# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like
# (1) Saving to GGUF / merging to 16bit for vLLM
# (2) Continued training from a saved LoRA adapter
# (3) Adding an evaluation loop / OOMs
# (4) Customized chat templates
DPO (Direct Preference Optimization), PPO, Reward Modelling all seem to work as per 3rd party independent testing from Llama-Factory. We have a preliminary Google Colab notebook for reproducing Zephyr on Tesla T4 here: notebook.
We're in 🤗Hugging Face's official docs! We're on the SFT docs and the DPO docs!
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Optional set GPU device ID
from unsloth import FastLanguageModel, PatchDPOTrainer
from unsloth import is_bfloat16_supported
PatchDPOTrainer()
import torch
from transformers import TrainingArguments
from trl import DPOTrainer
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/zephyr-sft-bnb-4bit",
max_seq_length = max_seq_length,
dtype = None,
load_in_4bit = True,
)
# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
model,
r = 64,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 64,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
# [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
max_seq_length = max_seq_length,
)
dpo_trainer = DPOTrainer(
model = model,
ref_model = None,
args = TrainingArguments(
per_device_train_batch_size = 4,
gradient_accumulation_steps = 8,
warmup_ratio = 0.1,
num_train_epochs = 3,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
seed = 42,
output_dir = "outputs",
),
beta = 0.1,
train_dataset = YOUR_DATASET_HERE,
# eval_dataset = YOUR_DATASET_HERE,
tokenizer = tokenizer,
max_length = 1024,
max_prompt_length = 512,
)
dpo_trainer.train()
1 A100 40GB | 🤗Hugging Face | Flash Attention 2 | 🦥Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max | |
---|---|---|---|---|---|---|---|
Alpaca | 1x | 1.04x | 1.98x | 2.48x | 5.32x | 15.64x | |
code | Code | Code | Code | Code | |||
seconds | 1040 | 1001 | 525 | 419 | 196 | 67 | |
memory MB | 18235 | 15365 | 9631 | 8525 | |||
% saved | 15.74 | 47.18 | 53.25 |
Method | Bits | TGS | GRAM | Speed |
---|---|---|---|---|
HF | 16 | 2392 | 18GB | 100% |
HF+FA2 | 16 | 2954 | 17GB | 123% |
Unsloth+FA2 | 16 | 4007 | 16GB | 168% |
HF | 4 | 2415 | 9GB | 101% |
Unsloth+FA2 | 4 | 3726 | 7GB | 160% |