Num examples of SFTTrainer decreased to 4862 from 109955(original data)

skmanzg commented 5 months ago

This is my trial for corpus training in unsloth. model load is the same as the example of unsloth code.

and then I changed r and alpha from default 16 to 64 and added dropout(0.1).

model = FastLanguageModel.get_peft_model(
    model,
    r = 64,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0.1, 
    bias = "none",   
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,  
    loftq_config = None, 
)

data set(name = combined_dataset) consists of bunch of sentences as you see: print("Dataset structure:", combined_dataset)

and I used the same code from unsloth example accordingly(train_dataset, dataset_text_field)

EOS_TOKEN = tokenizer.eos_token

def formatting_func(example):
    return example["sentence"] + EOS_TOKEN

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    train_dataset = combined_dataset,
    dataset_text_field = "sentence",
    tokenizer = tokenizer,
    max_seq_length = max_seq_length,
    packing = True, 
    formatting_func = formatting_func,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_ratio = 0.03,
        max_grad_norm = 1.0,
        num_train_epochs = 1,
        learning_rate = 2e-5,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.1,
        lr_scheduler_type = "cosine",
        seed = 3407,
        output_dir = "outputs",
    ),
)

and then when I train trainer_stats = trainer.train(), it shows the Num examples decreased.

but I did not noticed this fact and waited for the result.

8552.5999 seconds used for training.
142.54 minutes used for training.
Peak reserved memory = 11.16 GB.
Peak reserved memory for training = 4.801 GB.
Peak reserved memory % of max memory = 23.477 %.
Peak reserved memory for training % of max memory = 10.1 %.

This is the wandb result you might need.

I cannot clearly say model is well-trained when I try to infer as I intended. As soon as I noticed this num_examples decreased, I tried to re-run all code just in case. However, it shows the same decreased number(4862). Now I am not sure if I did wrong or it is bug or something.

danielhanchen commented 5 months ago

@skmanzg Yes packing = True essentially combines small and long sequences into 1 example, hence it decreases

skmanzg commented 5 months ago

@skmanzg Yes packing = True essentially combines small and long sequences into 1 example, hence it decreases

Would it be OK to say It trained 109955 data then? One more question, can you link the source or explain how packing works in detail?

danielhanchen commented 5 months ago

@skmanzg https://huggingface.co/docs/trl/en/sft_trainer#packing-dataset--constantlengthdataset-

danielhanchen commented 5 months ago

I would turn it off to see if the results are better

skmanzg commented 5 months ago

@danielhanchen This is the result without packing.

스크린샷 2024-05-28 075850 스크린샷 2024-05-28 081259

I had to reduce the size of LoRA and change parameters to avoid oscillate only status. Although It may look less stable than packing one, at least it used all data for each... What do you think of this?

danielhanchen commented 5 months ago

Yes looks fine to me!

danielhanchen commented 5 months ago

probs increase grad accumulation steps to smooth out the loss

skmanzg commented 5 months ago

increasing grad might smooth out the lose? ok. thank you.

danielhanchen commented 5 months ago

Hmm probs not - i would just inc grad accum

bhupendrathore commented 3 months ago

I am using packing = False still getting very less Num Examples :

Map (numproc=15): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 198460/198460 [05:22<00:00, 615.72 examples/s] ==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1 \ /| Num examples = 210 | Num Epochs = 3 O^O/ \/ \ Batch size per device = 2 | Gradient Accumulation steps = 4 \ / Total batch size = 8 | Total steps = 78 "-____-" Number of trainable parameters = 20,766,720


 trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    formatting_func=format_instruction,
    max_seq_length = max_seq_length,
    dataset_num_proc = 15,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        num_train_epochs = 3,# Set this for 1 full training run.
         # num_train_epochs = 5
        save_strategy = "steps",
        save_steps = 0.05,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        # bf16 = is_bfloat16_supported(),
        bf16 = True,

        warmup_steps = 10,
        logging_steps = 20,

        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "/clearml_agent_cache/storage_manager/bhupendra_workdir/gemma-2-2b-fintune-dir/checkpoints_gemma2b-2-050824/",
    ),
)

bhupendrathore commented 3 months ago

hey sorry, i fixed. the. problem was with my formatting function. it used to work with batch_size =1 with SFT directly trl.

new formatting function :


def formatting_prompts_func(examples):
    texts = []
    prompts = examples["prompt"]
    outputs = examples["selected_response"]

    for prompt,output in zip(prompts,outputs):
        text = f"""{prompt}\n\n{output}""" +  EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

problamatic : Old function :



 def format_instruction(sample):

    return [f"""{sample['prompt']}\n\n{sample['selected_response']}"""]

unslothai / unsloth

Num examples of SFTTrainer decreased to 4862 from 109955(original data) #524