unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
15.55k stars 1.04k forks source link

Unexpected train_batch_size in saved checkpoint file, causing the training resume not working #973

Open Decentblast opened 2 weeks ago

Decentblast commented 2 weeks ago

In my training script, I set the per_device_train_batch_size = 4 in the TrainingArguments. But the train_batch_size in the trainer_state.json of each checkpoint is 2. When I tried to resume from checkpoint, it will pop error showing the batch size is not aligned, and failed to resume.

Warning: The following arguments do not match the ones in the trainer_state.json within the checkpoint directory: 2024-08-30T16:27:21.433929414Z per_device_train_batch_size: 4 (from args) != 2 (from trainer_state.json)

Here are the key part of the training script: I also use 4 GPU with accelerate, so my command to initiate is accelerate launch --mixed_precision fp16 finetune_script.py

from accelerate import PartialState
device_string = PartialState().process_index

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-Instruct-bnb-4bit",
    max_seq_length = 1024,
    dtype = torch.float16,
    load_in_4bit = True,
    device_map={"": device_string},
)

model = FastLanguageModel.get_peft_model(
    model,
    r = 64, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, 
    bias = "none",  
    use_gradient_checkpointing = "unsloth", 
    random_state = 3407,
    max_seq_length = 1024,
    use_rslora = False, 
    loftq_config = None,
)
training_arguments = TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=1,
    save_steps=20,
    logging_steps=5,
    ...
    gradient_checkpointing = False,
    gradient_checkpointing_kwargs = {"use_reentrant": False}
    ddp_find_unused_parameters=False,
    optim = optim,
    fp16 = True,
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = training_arguments,
    packing=True
)
trainer.train()

In the log, it also showing a wrong batch size per device:

2024-08-30T17:03:12.208635032Z ==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 4
2024-08-30T17:03:12.208657727Z    \\   /|    Num examples = 10,787 | Num Epochs = 16
2024-08-30T17:03:12.208661952Z O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 1
2024-08-30T17:03:12.208665465Z \        /    Total batch size = 8 | Total steps = 21,574
2024-08-30T17:03:12.208668659Z  "-____-"     Number of trainable parameters = 167,772,160

Since the train_batch_size:2 is saved in the trainer_state.json, I cannot run following with the other part of the script kept the same. trainer.train(resume_from_checkpoint="./model_output/checkpoint-100")

danielhanchen commented 2 weeks ago

I'll have to investigate this!

Decentblast commented 1 week ago

More observation to add here: I purely change the model to the huggingface one with the meta one load from AutoModelForCausalLM (also with more code for 4-bit Quantization and lora config), and still do the accelerate launch with 4 GPU. The logging is correctly based on the set batch size per device, and the resume also works.

from accelerate import PartialState
device_string = PartialState().process_index # For DDP device_map

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
     "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=bnb_config,
    trust_remote_code=True,
    device_map=device_string,
    torch_dtype= torch.float16,
)

model.config.use_cache = False

peft_config = LoraConfig(
    lora_alpha=8,
    lora_dropout=0,
    r=4,
    bias="none",
    target_modules=["q_proj"],
    task_type="CAUSAL_LM",
)
Decentblast commented 1 week ago

Would it because of this part? https://github.com/unslothai/unsloth/blob/main/unsloth/models/llama.py#L1632-L1643

       check_batches = """train_dataloader = self.get_train_dataloader()
        ga  = args.gradient_accumulation_steps
        bsz = self._train_batch_size
        total_batches = bsz * ga * args.world_size
        n_total_devices = total_batches // ga // bsz
        if n_total_devices > 1:
            logger.warning_once('Unsloth currently does not support multi GPU setups - but we are working on it!')
            divisor = n_total_devices / 1
            bsz = self._train_batch_size = max(int(bsz / divisor), 1)
            if total_batches // ga // bsz > 1:
                divisor = n_total_devices / 1
                ga = args.gradient_accumulation_steps = max(int(ga / divisor), 1)"""