OutOfMemoryError for ORPO and DPO with llama-3-8b-bnb-4bit

ArvindSharma18 commented 1 month ago

I have followed the Sample Colab with my custom dataset ( < 100 samples ). With the same Configs as in the Sample Colab(loading the model in 4 bit and dtype as None and other configs like Peft and Trainers), I faced OutofMermoryError. Even with the batch size of 1 and some config changes like reducing target modules, the same issue persists.

Environment: Google Colab T4 GPU

Peft Config:

model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

DPO Config:

dpo_trainer = DPOTrainer(
    model = model,
    ref_model = None,
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.1,
        num_train_epochs = 3,
        learning_rate = 2e-3,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.0,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "outputs",
    ),
    beta = 0.1,
    train_dataset = dataset,
    # eval_dataset = raw_datasets["test"],
    tokenizer = tokenizer,
    max_length = 2048,
    max_prompt_length = 1024,
)

Error Message for DPO:

Error Message: OutOfMemoryError                          Traceback (most recent call last)
<ipython-input-10-864e6a7adbc3> in <cell line: 1>()
----> 1 dpo_trainer.train()

6 frames
/usr/local/lib/python3.10/dist-packages/trl/trainer/dpo_trainer.py in get_batch_logps(logits, labels, average_log_prob, label_pad_token_id, is_encoder_decoder)
    952         labels[labels == label_pad_token_id] = 0
    953 
--> 954         per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=labels.unsqueeze(2)).squeeze(2)
    955 
    956         if average_log_prob:

OutOfMemoryError: CUDA out of memory. Tried to allocate 1.96 GiB. GPU

Same OOM error for ORPO was observed.

danielhanchen commented 1 month ago

I'll check this out! So sorry on the issue!

ArvindSharma18 commented 4 weeks ago

Thanks for such a quick response, appreciate it!

masc-it commented 3 weeks ago

I am having the same issue on my local rtx A4000 rig, just trying a 0.5B Qwen peft... CUDA Out of memory even if it's just using 3GBs / 16GB..

nvm, my issue is related to this

ArvindSharma18 commented 3 weeks ago

Hello, any updates on this? I am very keen to check different Alignment techniques using Unsloth!

danielhanchen commented 3 weeks ago

Much apologies, my bro and I relocated to SF, so just back to Github issues! I think Llama-3 in general has a much larger vocab size, so it might be OOMing for DPO / ORPO when compared to Mistral - I could try reducing VRAM usage further, but I would advise reducing max_length = 2048 to something smaller and max_prompt_length = 1024 similarly

ArvindSharma18 commented 2 days ago

Hi, I have the same issue with max_length < 1000 and max_prompt_length = 512. I have also tried Gemma 2 ( a bigger model ) but again unable to do DPO or ORPO with minimal configs. I am really interested in Llama 3 or Gemma with DPO and ORPO.. Any guidance?

danielhanchen commented 2 days ago

Ye I can reproduce in a free Colab - it seems like there really is a lot of VRAM usage hmmm

unslothai / unsloth

OutOfMemoryError for ORPO and DPO with llama-3-8b-bnb-4bit #682