Closed skmanzg closed 5 months ago
@skmanzg Yes packing = True
essentially combines small and long sequences into 1 example, hence it decreases
@skmanzg Yes
packing = True
essentially combines small and long sequences into 1 example, hence it decreases
Would it be OK to say It trained 109955 data then? One more question, can you link the source or explain how packing works in detail?
I would turn it off to see if the results are better
@danielhanchen This is the result without packing.
I had to reduce the size of LoRA and change parameters to avoid oscillate only status. Although It may look less stable than packing one, at least it used all data for each... What do you think of this?
Yes looks fine to me!
probs increase grad accumulation steps to smooth out the loss
increasing grad might smooth out the lose? ok. thank you.
Hmm probs not - i would just inc grad accum
I am using packing = False still getting very less Num Examples :
Map (numproc=15): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198460/198460 [05:22<00:00, 615.72 examples/s] ==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1 \ /| Num examples = 210 | Num Epochs = 3 O^O/ \/ \ Batch size per device = 2 | Gradient Accumulation steps = 4 \ / Total batch size = 8 | Total steps = 78 "-____-" Number of trainable parameters = 20,766,720
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
formatting_func=format_instruction,
max_seq_length = max_seq_length,
dataset_num_proc = 15,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
num_train_epochs = 3,# Set this for 1 full training run.
# num_train_epochs = 5
save_strategy = "steps",
save_steps = 0.05,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
# bf16 = is_bfloat16_supported(),
bf16 = True,
warmup_steps = 10,
logging_steps = 20,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "/clearml_agent_cache/storage_manager/bhupendra_workdir/gemma-2-2b-fintune-dir/checkpoints_gemma2b-2-050824/",
),
)
hey sorry, i fixed. the. problem was with my formatting function. it used to work with batch_size =1 with SFT directly trl.
new formatting function :
def formatting_prompts_func(examples):
texts = []
prompts = examples["prompt"]
outputs = examples["selected_response"]
for prompt,output in zip(prompts,outputs):
text = f"""{prompt}\n\n{output}""" + EOS_TOKEN
texts.append(text)
return { "text" : texts, }
problamatic : Old function :
def format_instruction(sample):
return [f"""{sample['prompt']}\n\n{sample['selected_response']}"""]
This is my trial for corpus training in unsloth. model load is the same as the example of unsloth code.
and then I changed r and alpha from default 16 to 64 and added dropout(0.1).
data set(name = combined_dataset) consists of bunch of sentences as you see:
print("Dataset structure:", combined_dataset)
and I used the same code from unsloth example accordingly(train_dataset, dataset_text_field)
and then when I train
trainer_stats = trainer.train()
, it shows the Num examples decreased.but I did not noticed this fact and waited for the result.
This is the wandb result you might need.
I cannot clearly say model is well-trained when I try to infer as I intended. As soon as I noticed this num_examples decreased, I tried to re-run all code just in case. However, it shows the same decreased number(4862). Now I am not sure if I did wrong or it is bug or something.