Open Decentblast opened 2 months ago
I'll have to investigate this!
More observation to add here:
I purely change the model to the huggingface one with the meta one load from AutoModelForCausalLM (also with more code for 4-bit Quantization and lora config), and still do the accelerate launch
with 4 GPU. The logging is correctly based on the set batch size per device, and the resume also works.
from accelerate import PartialState
device_string = PartialState().process_index # For DDP device_map
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
quantization_config=bnb_config,
trust_remote_code=True,
device_map=device_string,
torch_dtype= torch.float16,
)
model.config.use_cache = False
peft_config = LoraConfig(
lora_alpha=8,
lora_dropout=0,
r=4,
bias="none",
target_modules=["q_proj"],
task_type="CAUSAL_LM",
)
Would it because of this part? https://github.com/unslothai/unsloth/blob/main/unsloth/models/llama.py#L1632-L1643
check_batches = """train_dataloader = self.get_train_dataloader()
ga = args.gradient_accumulation_steps
bsz = self._train_batch_size
total_batches = bsz * ga * args.world_size
n_total_devices = total_batches // ga // bsz
if n_total_devices > 1:
logger.warning_once('Unsloth currently does not support multi GPU setups - but we are working on it!')
divisor = n_total_devices / 1
bsz = self._train_batch_size = max(int(bsz / divisor), 1)
if total_batches // ga // bsz > 1:
divisor = n_total_devices / 1
ga = args.gradient_accumulation_steps = max(int(ga / divisor), 1)"""
In my training script, I set the per_device_train_batch_size = 4 in the TrainingArguments. But the train_batch_size in the trainer_state.json of each checkpoint is 2. When I tried to resume from checkpoint, it will pop error showing the batch size is not aligned, and failed to resume.
Here are the key part of the training script: I also use 4 GPU with accelerate, so my command to initiate is
accelerate launch --mixed_precision fp16 finetune_script.py
In the log, it also showing a wrong batch size per device:
Since the train_batch_size:2 is saved in the trainer_state.json, I cannot run following with the other part of the script kept the same.
trainer.train(resume_from_checkpoint="./model_output/checkpoint-100")