why is unsloth thinking I'm doing multi gpu optimization when I'm not?

brando90 commented 2 weeks ago

code

'''
conda activate beyond_scale_2_unsloth
'''
import torch
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from unsloth import FastLanguageModel
from transformers import TrainingArguments
from pathlib import Path

from pdb import set_trace as st

opt_args = {
    'batch_size': 8,
    'learning_rate': 5e-2,
    'epochs': 1,
    'adam_epsilon': 1e-8,
    'weight_decay': 1e-4,
    'num_workers': 0,
    'break_early': False
}
hf_args = {'max_seq_length': 256, 'dataset_text_field': "text"}

# Set data type and device
torch_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float32
device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")

# Load model and tokenizer using Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    # model_name="unsloth/Qwen2-1.5B",
    model_name="Qwen/Qwen2.5-Math-1.5B-Instruct",
    max_seq_length=hf_args['max_seq_length'],
    dtype=None,  # Auto-detection for Float16/BFloat16
    load_in_4bit=False,  # Set False if not using 4-bit precision
)

model = model.to(device)
tok = tokenizer
tok.pad_token = tok.eos_token if tok.pad_token_id is None else tok.pad_token

# Add LoRA adapters, targeting only `lm_head` for fine-tuning
st()
model = FastLanguageModel.get_peft_model(
    model=model,
    r=16,  # LoRA rank
    target_modules=["lm_head"],  # Only optimize `lm_head`
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

# Load dataset
dataset = load_dataset("stanfordnlp/imdb", split="train")

# Define training configuration
training_args = TrainingArguments(
    per_device_train_batch_size=opt_args['batch_size'],
    gradient_accumulation_steps=4,
    num_train_epochs=opt_args['epochs'],
    learning_rate=opt_args['learning_rate'],
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=1,
    optim="paged_adamw_32bit",
    weight_decay=opt_args['weight_decay'],
    output_dir="./tmp",
    report_to='none'
)

# Initialize the Trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field=hf_args['dataset_text_field'],
    max_seq_length=hf_args['max_seq_length'],
    args=training_args,
)

# Print norms before training to check only lm_head will change
print(f'{model.model.embed_tokens.weight.norm(2)=}')
print(f'{model.model.layers[14].self_attn.v_proj.weight.norm(2)=}')
print(f'{model.model.layers[14].mlp.down_proj.weight.norm(2)=}')
print(f'{model.lm_head.weight.norm(2)=}')

# Start training
trainer.train()

# Print norms after training to verify only lm_head changed
print(f'{model.model.embed_tokens.weight.norm(2)=}')
print(f'{model.model.layers[14].self_attn.v_proj.weight.norm(2)=}')
print(f'{model.model.layers[14].mlp.down_proj.weight.norm(2)=}')
print(f'{model.lm_head.weight.norm(2)=}')

print("Done!\a")

but I'm only doing 1 gpu a100...

(beyond_scale_2_unsloth) brando9@ampere1~/beyond-scale-2-alignment-coeff $ python /lfs/ampere1/0/brando9/beyond-scale-2-alignment-coeff/experiments/bm/2024/11_november/week_4_8/train_unsloth_head_qwen2.py
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.10.7: Fast Qwen2 patching. Transformers = 4.46.1.
   \\   /|    GPU: NVIDIA A100-SXM4-80GB. Max memory: 79.138 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.5.1+cu124. CUDA = 8.0. CUDA Toolkit = 12.4.
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = True]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Traceback (most recent call last):
  File "/lfs/ampere1/0/brando9/beyond-scale-2-alignment-coeff/experiments/bm/2024/11_november/week_4_8/train_unsloth_head_qwen2.py", line 29, in <module>
    model, tokenizer = FastLanguageModel.from_pretrained(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale_2_unsloth/lib/python3.11/site-packages/unsloth/models/loader.py", line 332, in from_pretrained
    model, tokenizer = dispatch_model.from_pretrained(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale_2_unsloth/lib/python3.11/site-packages/unsloth/models/qwen2.py", line 87, in from_pretrained
    return FastLlamaModel.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lfs/ampere1/0/brando9/miniconda/envs/beyond_scale_2_unsloth/lib/python3.11/site-packages/unsloth/models/llama.py", line 1645, in from_pretrained
    raise RuntimeError('Unsloth currently does not support multi GPU setups - but we are working on it!')
RuntimeError: Unsloth currently does not support multi GPU setups - but we are working on it!

danielhanchen commented 2 weeks ago

Hm that is very weird - is this like a machine with multiple cards - could you try nvidia-smi

brando90 commented 2 weeks ago

+-------------------------- | Processes: | GPU GI CI | ID ID |========================== | 0 N/A N/A 412907 | 0 N/A N/A 2531149 | 1 N/A N/A | 1 N/A N/A 1534976 | 2 N/A N/A 4165148 | 3 N/A N/A 2201035 | 3 N/A N/A 4140397 | 4 N/A N/A 2174832 | 4 N/A N/A 2737509 | 5 N/A N/A 119688 | 5 N/A N/A 124733 | 6 N/A N/A 111759 | 6 N/A N/A 1488814 | 7 N/A N/A 3185003 +-------------------------- ---------------------------------------------------------------+ | PID Type Process name GPU Memory | Usage | ===============================================================| C python 1018MiB | C python 74416MiB | 3611 C ...nqduc/miniconda3/envs/lf/bin/python 30540MiB | C python 908MiB | C python 2482MiB | C python 848MiB | C ...nqduc/miniconda3/envs/lf/bin/python 30624MiB | C ...iconda3/envs/ampere1-env/bin/python 9328MiB | C python 14412MiB | C python 43242MiB | C python 28468MiB | C ...nqduc/miniconda3/envs/lf/bin/python 30548MiB | C python 928MiB | C python 60650MiB | ---------------------------------------------------------------+

The error was also non-deterministic. I changed nothing of my code and then it went away (at least for 1 run). I didn't try again afterwards given lm_head wasn't lora-able but def non-deterministic. Let me know how I can help. I think I attached the code.

On Nov 5, 2024, at 2:03 AM, Daniel Han @.***> wrote:

Hm that is very weird - is this like a machine with multiple cards - could you try nvidia-smi

— Reply to this email directly, view it on GitHub https://github.com/unslothai/unsloth/issues/1240#issuecomment-2456743639, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOE6LRFOPKDPTBSJMSVXKDZ7CJYNAVCNFSM6AAAAABRFTUK7OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJWG42DGNRTHE. You are receiving this because you authored the thread.

Peter-Fy commented 1 week ago

I encountered the same issue on a single machine with multiple GPUs. I used os.environ["CUDA_VISIBLE_DEVICES"] = "1" at the beginning of the code to set a single GPU, but sometimes it throws the following error:

RuntimeError: Unsloth currently does not support multi GPU setups - but we are working on it!

Without changing any code, rerunning it sometimes succeeds and sometimes fails. I believe this issue is the same as #983, and I hope it can be fixed as soon as possible.

unslothai / unsloth

why is unsloth thinking I'm doing multi gpu optimization when I'm not? #1240