unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.4k stars 1.29k forks source link

why is unsloth thinking I'm doing multi gpu optimization when I'm not? #1240

Open brando90 opened 2 weeks ago

brando90 commented 2 weeks ago

code

'''
conda activate beyond_scale_2_unsloth
'''
import torch
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from unsloth import FastLanguageModel
from transformers import TrainingArguments
from pathlib import Path

from pdb import set_trace as st

opt_args = {
    'batch_size': 8,
    'learning_rate': 5e-2,
    'epochs': 1,
    'adam_epsilon': 1e-8,
    'weight_decay': 1e-4,
    'num_workers': 0,
    'break_early': False
}
hf_args = {'max_seq_length': 256, 'dataset_text_field': "text"}

# Set data type and device
torch_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float32
device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")

# Load model and tokenizer using Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    # model_name="unsloth/Qwen2-1.5B",
    model_name="Qwen/Qwen2.5-Math-1.5B-Instruct",
    max_seq_length=hf_args['max_seq_length'],
    dtype=None,  # Auto-detection for Float16/BFloat16
    load_in_4bit=False,  # Set False if not using 4-bit precision
)

model = model.to(device)
tok = tokenizer
tok.pad_token = tok.eos_token if tok.pad_token_id is None else tok.pad_token

# Add LoRA adapters, targeting only `lm_head` for fine-tuning
st()
model = FastLanguageModel.get_peft_model(
    model=model,
    r=16,  # LoRA rank
    target_modules=["lm_head"],  # Only optimize `lm_head`
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

# Load dataset
dataset = load_dataset("stanfordnlp/imdb", split="train")

# Define training configuration
training_args = TrainingArguments(
    per_device_train_batch_size=opt_args['batch_size'],
    gradient_accumulation_steps=4,
    num_train_epochs=opt_args['epochs'],
    learning_rate=opt_args['learning_rate'],
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=1,
    optim="paged_adamw_32bit",
    weight_decay=opt_args['weight_decay'],
    output_dir="./tmp",
    report_to='none'
)

# Initialize the Trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field=hf_args['dataset_text_field'],
    max_seq_length=hf_args['max_seq_length'],
    args=training_args,
)

# Print norms before training to check only lm_head will change
print(f'{model.model.embed_tokens.weight.norm(2)=}')
print(f'{model.model.layers[14].self_attn.v_proj.weight.norm(2)=}')
print(f'{model.model.layers[14].mlp.down_proj.weight.norm(2)=}')
print(f'{model.lm_head.weight.norm(2)=}')

# Start training
trainer.train()

# Print norms after training to verify only lm_head changed
print(f'{model.model.embed_tokens.weight.norm(2)=}')
print(f'{model.model.layers[14].self_attn.v_proj.weight.norm(2)=}')
print(f'{model.model.layers[14].mlp.down_proj.weight.norm(2)=}')
print(f'{model.lm_head.weight.norm(2)=}')

print("Done!\a")

but I'm only doing 1 gpu a100...

(beyond_scale_2_unsloth) brando9@ampere1~/beyond-scale-2-alignment-coeff $ python /lfs/ampere1/0/brando9/beyond-scale-2-alignment-coeff/experiments/bm/2024/11_november/week_4_8/train_unsloth_head_qwen2.py
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.7: Fast Qwen2 patching. Transformers = 4.46.1.
   \\   /|    GPU: NVIDIA A100-SXM4-80GB. Max memory: 79.138 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu124. CUDA = 8.0. CUDA Toolkit = 12.4.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = True]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Traceback (most recent call last):
  File "/lfs/ampere1/0/brando9/beyond-scale-2-alignment-coeff/experiments/bm/2024/11_november/week_4_8/train_unsloth_head_qwen2.py", line 29, in <module>
    model, tokenizer = FastLanguageModel.from_pretrained(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale_2_unsloth/lib/python3.11/site-packages/unsloth/models/loader.py", line 332, in from_pretrained
    model, tokenizer = dispatch_model.from_pretrained(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale_2_unsloth/lib/python3.11/site-packages/unsloth/models/qwen2.py", line 87, in from_pretrained
    return FastLlamaModel.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale_2_unsloth/lib/python3.11/site-packages/unsloth/models/llama.py", line 1645, in from_pretrained
    raise RuntimeError('Unsloth currently does not support multi GPU setups - but we are working on it!')
RuntimeError: Unsloth currently does not support multi GPU setups - but we are working on it!
danielhanchen commented 2 weeks ago

Hm that is very weird - is this like a machine with multiple cards - could you try nvidia-smi

brando90 commented 2 weeks ago

(beyond_scale_2) @.***~/beyond-scale-2-alignment-coeff $ nvidia-smi Tue Nov 5 08:57:13 2024
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-SXM4-80GB On | 00000000:07:00.0 Off | 0 | | N/A 54C P0 223W / 400W | 75448MiB / 81920MiB | 97% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB On | 00000000:0A:00.0 Off | 0 | | N/A 43C P0 89W / 400W | 31490MiB / 81920MiB | 88% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA A100-SXM4-80GB On | 00000000:44:00.0 Off | 0 | | N/A 31C P0 68W / 400W | 1031MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA A100-SXM4-80GB On | 00000000:4A:00.0 Off | 0 | | N/A 60C P0 297W / 400W | 31514MiB / 81920MiB | 84% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA A100-SXM4-80GB On | 00000000:84:00.0 Off | 0 | | N/A 38C P0 97W / 400W | 23790MiB / 81920MiB | 31% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA A100-SXM4-80GB On | 00000000:8A:00.0 Off | 0 | | N/A 37C P0 105W / 400W | 71724MiB / 81920MiB | 96% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA A100-SXM4-80GB On | 00000000:C0:00.0 Off | 0 | | N/A 52C P0 269W / 400W | 31518MiB / 81920MiB | 85% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA A100-SXM4-80GB On | 00000000:C3:00.0 Off | 0 | | N/A 55C P0 237W / 400W | 60673MiB / 81920MiB | 88% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 412907 C python 1018MiB | | 0 N/A N/A 2531149 C python 74416MiB | | 1 N/A N/A 3611 C ...nqduc/miniconda3/envs/lf/bin/python 30540MiB | | 1 N/A N/A 1534976 C python 908MiB | | 2 N/A N/A 4165148 C python 2482MiB | | 3 N/A N/A 2201035 C python 848MiB | | 3 N/A N/A 4140397 C ...nqduc/miniconda3/envs/lf/bin/python 30624MiB | | 4 N/A N/A 2174832 C ...iconda3/envs/ampere1-env/bin/python 9328MiB | | 4 N/A N/A 2737509 C python 14412MiB | | 5 N/A N/A 119688 C python 43242MiB | | 5 N/A N/A 124733 C python 28468MiB | | 6 N/A N/A 111759 C ...nqduc/miniconda3/envs/lf/bin/python 30548MiB | | 6 N/A N/A 1488814 C python 928MiB | | 7 N/A N/A 3185003 C python 60650MiB | +-----------------------------------------------------------------------------------------+

The error was also non-deterministic. I changed nothing of my code and then it went away (at least for 1 run). I didn't try again afterwards given lm_head wasn't lora-able but def non-deterministic. Let me know how I can help. I think I attached the code.

On Nov 5, 2024, at 2:03 AM, Daniel Han @.***> wrote:

Hm that is very weird - is this like a machine with multiple cards - could you try nvidia-smi

— Reply to this email directly, view it on GitHub https://github.com/unslothai/unsloth/issues/1240#issuecomment-2456743639, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOE6LRFOPKDPTBSJMSVXKDZ7CJYNAVCNFSM6AAAAABRFTUK7OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJWG42DGNRTHE. You are receiving this because you authored the thread.

Peter-Fy commented 1 week ago

I encountered the same issue on a single machine with multiple GPUs. I used os.environ["CUDA_VISIBLE_DEVICES"] = "1" at the beginning of the code to set a single GPU, but sometimes it throws the following error:

RuntimeError: Unsloth currently does not support multi GPU setups - but we are working on it!

Without changing any code, rerunning it sometimes succeeds and sometimes fails. I believe this issue is the same as #983, and I hope it can be fixed as soon as possible.