Llama 3.1 (8B) fine-tuning demo suddenly stopped working during local training.

brainchen2020 commented 2 months ago

2024-09-10 150733 As shown in the image, at the 11 step of training, CUDA is inactive. I've tried several times, and it always gets stuck like this. Even after waiting for more than ten minutes, there is no progress.

code base: Llama 3.1 (8B) env: Windows 11 WSL2 2080ti

unsloth 2024.8 xformers 0.0.24 transformers 4.44.2 triton 2.2.0 torch 2.2.0 accelerate 0.34.2 bitsandbytes 0.43.3 peft 0.12.0

danielhanchen commented 2 months ago

Oh no stalling is very bad - it probably means something in the GPU itself is going haywire - does this often or rarely?

brainchen2020 commented 2 months ago

Oh no stalling is very bad - it probably means something in the GPU itself is going haywire - does this often or rarely?

This is the case every time, and each time it gets stuck at step 11.

here is the test code: LLama_3_1.py.txt

brainchen2020 commented 2 months ago

Strange, I replaced above test code with the following code and it worked!


from unsloth import FastLanguageModel 
from unsloth import is_bfloat16_supported
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get LAION dataset
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files = {"train" : url}, split = "train")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    max_seq_length = max_seq_length,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_8bit",
        seed = 3407,
    ),
)
trainer.train()

# Go to https://github.com/unslothai/unsloth/wiki for advanced tips like
# (1) Saving to GGUF / merging to 16bit for vLLM
# (2) Continued training from a saved LoRA adapter
# (3) Adding an evaluation loop / OOMs
# (4) Customized chat templates

2024-09-10 171909

danielhanchen commented 1 month ago

@brainchen2020 Sorry on the delay! Ok weird hmm - it could be a dataset tokenization issue maybe going out of bounds

brainchen2020 commented 1 month ago

doesn't happen again after changing the code, close now

dust-YsY commented 1 month ago

As shown in the image, at the 11 step of training, CUDA is inactive. I've tried several times, and it always gets stuck like this. Even after waiting for more than ten minutes, there is no progress.

code base: Llama 3.1 (8B) env: Windows 11 WSL2 2080ti

unsloth 2024.8 xformers 0.0.24 transformers 4.44.2 triton 2.2.0 torch 2.2.0 accelerate 0.34.2 bitsandbytes 0.43.3 peft 0.12.0

Hi~ I had the same problem! and my GPU is 2080Ti 22G too, I have tried your new code, but it seem doesn't work, did you finally find out why?

dust-YsY commented 1 month ago

As shown in the image, at the 11 step of training, CUDA is inactive. I've tried several times, and it always gets stuck like this. Even after waiting for more than ten minutes, there is no progress. code base: Llama 3.1 (8B) env: Windows 11 WSL2 2080ti unsloth 2024.8 xformers 0.0.24 transformers 4.44.2 triton 2.2.0 torch 2.2.0 accelerate 0.34.2 bitsandbytes 0.43.3 peft 0.12.0

Hi~ I had the same problem! and my GPU is 2080Ti 22G too, I have tried your new code, but it seem doesn't work, did you finally find out why?

And strangely enough, I'm stuck at step 11 too🤣

unslothai / unsloth

Llama 3.1 (8B) fine-tuning demo suddenly stopped working during local training. #1010