celsowm commented 8 months ago

Hi ! I am trying to use this small dataset:

dataset = load_dataset("celsowm/auryn", split="train", download_mode="force_redownload")

on SFTTrainer

and I got this error:

ValueError: Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence.

So, how many "samples" is enough? It is not clear here

younesbelkada commented 8 months ago

hi @celsowm Thanks for the issue! This depends on your max_seq_length you have set in your SFTTrainer, I suspect the sentences in that dataset are too small. Can you share the full script you are using? Maybe decreasing the seqlen & batch_size would help here

celsowm commented 8 months ago

hi @celsowm Thanks for the issue! This depends on your max_seq_length you have set in your SFTTrainer, I suspect the sentences in that dataset are too small. Can you share the full script you are using? Maybe decreasing the seqlen & batch_size would help here

from datasets import load_dataset from trl import SFTTrainer from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments from peft import LoraConfig import torch, sys

dataset = load_dataset("celsowm/auryn", split="train", download_mode="force_redownload") dataset2 = load_dataset("philschmid/dolly-15k-oai-style", split="train")

print(dataset) print(dataset2) print(dataset['messages'][100]) print(dataset2['messages'][100])

Specify the model

model_name = 'mistralai/Mistral-7B-Instruct-v0.2'

Load the tokenizer and the model

tokenizer = AutoTokenizer.from_pretrained(model_name)

BitsAndBytesConfig int-4 config

bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, )

model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, use_cache=False, device_map="auto", torch_dtype=torch.float16 )

peft_config = LoraConfig( lora_alpha=16, lora_dropout=0.1, r=64, bias="none", task_type="CAUSAL_LM", )

training_args = TrainingArguments( output_dir='outputs/summarize', num_train_epochs=3, per_device_train_batch_size=1, # Mistral is a large model, reduce batch size to avoid OOM per_device_eval_batch_size=1, gradient_accumulation_steps=4, # Increase gradient accumulation to compensate for small batch size gradient_checkpointing=True, bf16=True, optim="adamw_bnb_8bit",

warmup_steps=500,

weight_decay=0.01,
#save_steps=500,  # Save the model every 500 steps
save_total_limit=3,  # Keep only the last 3 models

)

trainer = SFTTrainer( model=model, peft_config=peft_config, args=training_args, train_dataset=dataset, packing=True, max_seq_length=2048 )



I tried to comment the max_seq_length, but still giving error.
@younesbelkada would be nice if the error message was more "precise" about "minimum needed"

younesbelkada commented 8 months ago

Hi @celsowm , thanks for getting back ! looking at the dataset page: https://huggingface.co/datasets/celsowm/auryn , the dataset looks really small with only 500 examples. If you comment out max_seq_length SFTTrainer will use the default value which is 2048 as well. Can you try to set it to 512 / 256 instead? The other solution is to not use packing at all On another note I agree we should enahnce the error message, I'll work on that in the next days

celsowm commented 8 months ago

Hi @celsowm , thanks for getting back ! looking at the dataset page: https://huggingface.co/datasets/celsowm/auryn , the dataset looks really small with only 500 examples. If you comment out max_seq_length SFTTrainer will use the default value which is 2048 as well. Can you try to set it to 512 / 256 instead? The other solution is to not use packing at all On another note I agree we should enahnce the error message, I'll work on that in the next days

I discovered what happened:

One of dicts had the key "content" value "None" (I blame google gemini api for that)

For some reason, SFTTrainer hid this "info/error" and gives the "Make sure that your dataset has enough samples to at least yield one packed sequence"

So the error message does not make sense, because it is not a "dataset samples problem" but an error in a specific row

And yes, I fixed my dataset, so if you want to reproduce the error you need to add something like:

[ { "content": "Você é um assistente que responde precisamente a perguntas.", "role": "system" }, { "content": "Qual é o prato tradicional mais famoso de Auryn?", "role": "user" }, { "content": None, "role": "assistant" } ]

younesbelkada commented 8 months ago

Thanks @celsowm - indeed what you have said looks correct, I can update the error message ! Can you help me by writing an end-to-end simple script to repro the error?

celsowm commented 8 months ago

Thanks @celsowm - indeed what you have said looks correct, I can update the error message ! Can you help me by writing an end-to-end simple script to repro the error?

I discovered two things.

the first approach:

from datasets import Dataset, load_dataset
from trl import SFTTrainer
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig
import torch, sys

model_name = 'recogna-nlp/bode-7b-alpaca-pt-br-no-peft'
tokenizer = AutoTokenizer.from_pretrained(model_name)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    use_cache=False,
    device_map="auto",
    torch_dtype=torch.float16
)

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

training_args = TrainingArguments(
    output_dir='outputs/test_fail_sft',
    num_train_epochs=3,                     # number of training epochs
    per_device_train_batch_size=3,          # batch size per device during training
    gradient_accumulation_steps=2,          # number of steps before performing a backward/update pass
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=10,                       # log every 10 steps
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    bf16=True,                              # use fploat16 precision
    tf32=True,                              # use tf32 precision
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant",
    overwrite_output_dir=True,
)

for i in range(1, 600):

    messages = [
        {"content": "Você é um bot que responde perguntas", "role": "system" },
        {"content": "Você já foi até Auryn?", "role": "user" },
        {"content": "Não", "role": "assistant" }
    ]

    dataset = Dataset.from_dict({"messages": [messages] * i})

    try:
        trainer = SFTTrainer(
            model=model,
            peft_config=peft_config,
            args=training_args,
            train_dataset=dataset,
            packing=True
        )
    except ValueError as e:
        print(f"no success at {i} rows {e}")
    else:
        print(f"Success at {i} rows !!!!!!")

Its is crazy but Success at 21 rows !!!!!! Why? Mayber 21 is not a minor anymore, hahahah

the second approach:

from datasets import Dataset, load_dataset
from trl import SFTTrainer
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig
import torch, sys

model_name = 'recogna-nlp/bode-7b-alpaca-pt-br-no-peft'
tokenizer = AutoTokenizer.from_pretrained(model_name)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    use_cache=False,
    device_map="auto",
    torch_dtype=torch.float16
)

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

training_args = TrainingArguments(
    output_dir='outputs/test_fail_sft',
    num_train_epochs=3,                     # number of training epochs
    per_device_train_batch_size=3,          # batch size per device during training
    gradient_accumulation_steps=2,          # number of steps before performing a backward/update pass
    gradient_checkpointing=True,            # use gradient checkpointing to save memory
    optim="adamw_torch_fused",              # use fused adamw optimizer
    logging_steps=10,                       # log every 10 steps
    save_strategy="epoch",                  # save checkpoint every epoch
    learning_rate=2e-4,                     # learning rate, based on QLoRA paper
    bf16=True,                              # use fploat16 precision
    tf32=True,                              # use tf32 precision
    max_grad_norm=0.3,                      # max gradient norm based on QLoRA paper
    warmup_ratio=0.03,                      # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant",
    overwrite_output_dir=True,
)

messages = [
        {"content": "Você é um bot que responde perguntas", "role": "system" },
        {"content": "Você já foi até Auryn?", "role": "user" },
        {"content": "Não", "role": "assistant" }
    ]

min_dataset = 21
dataset = Dataset.from_dict({"messages": [messages] * min_dataset})

#no error:

trainer = SFTTrainer(
    model=model,
    peft_config=peft_config,
    args=training_args,
    train_dataset=dataset,
    packing=True
)

#pull the error:

new_messages = [messages] * min_dataset
new_messages[0][-1]['content'] = None #this causes error
dataset = Dataset.from_dict({"messages": new_messages})

trainer = SFTTrainer(
    model=model,
    peft_config=peft_config,
    args=training_args,
    train_dataset=dataset,
    packing=True
)

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

huggingface / trl

Error: Make sure that your dataset has enough samples to at least yield one packed sequence #1337

Specify the model

Load the tokenizer and the model

BitsAndBytesConfig int-4 config

warmup_steps=500,

the first approach:

the second approach: