Closed celsowm closed 7 months ago
hi @celsowm Thanks for the issue! This depends on your max_seq_length you have set in your SFTTrainer, I suspect the sentences in that dataset are too small. Can you share the full script you are using? Maybe decreasing the seqlen & batch_size would help here
hi @celsowm Thanks for the issue! This depends on your max_seq_length you have set in your SFTTrainer, I suspect the sentences in that dataset are too small. Can you share the full script you are using? Maybe decreasing the seqlen & batch_size would help here
from datasets import load_dataset from trl import SFTTrainer from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments from peft import LoraConfig import torch, sys
dataset = load_dataset("celsowm/auryn", split="train", download_mode="force_redownload") dataset2 = load_dataset("philschmid/dolly-15k-oai-style", split="train")
print(dataset) print(dataset2) print(dataset['messages'][100]) print(dataset2['messages'][100])
model_name = 'mistralai/Mistral-7B-Instruct-v0.2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, )
model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=bnb_config, use_cache=False, device_map="auto", torch_dtype=torch.float16 )
peft_config = LoraConfig( lora_alpha=16, lora_dropout=0.1, r=64, bias="none", task_type="CAUSAL_LM", )
training_args = TrainingArguments( output_dir='outputs/summarize', num_train_epochs=3, per_device_train_batch_size=1, # Mistral is a large model, reduce batch size to avoid OOM per_device_eval_batch_size=1, gradient_accumulation_steps=4, # Increase gradient accumulation to compensate for small batch size gradient_checkpointing=True, bf16=True, optim="adamw_bnb_8bit",
weight_decay=0.01,
#save_steps=500, # Save the model every 500 steps
save_total_limit=3, # Keep only the last 3 models
)
trainer = SFTTrainer( model=model, peft_config=peft_config, args=training_args, train_dataset=dataset, packing=True, max_seq_length=2048 )
I tried to comment the max_seq_length, but still giving error.
@younesbelkada would be nice if the error message was more "precise" about "minimum needed"
Hi @celsowm , thanks for getting back !
looking at the dataset page: https://huggingface.co/datasets/celsowm/auryn , the dataset looks really small with only 500 examples. If you comment out max_seq_length
SFTTrainer will use the default value which is 2048 as well. Can you try to set it to 512 / 256 instead? The other solution is to not use packing at all
On another note I agree we should enahnce the error message, I'll work on that in the next days
Hi @celsowm , thanks for getting back ! looking at the dataset page: https://huggingface.co/datasets/celsowm/auryn , the dataset looks really small with only 500 examples. If you comment out
max_seq_length
SFTTrainer will use the default value which is 2048 as well. Can you try to set it to 512 / 256 instead? The other solution is to not use packing at all On another note I agree we should enahnce the error message, I'll work on that in the next days
I discovered what happened:
One of dicts had the key "content" value "None" (I blame google gemini api for that)
For some reason, SFTTrainer hid this "info/error" and gives the "Make sure that your dataset has enough samples to at least yield one packed sequence"
So the error message does not make sense, because it is not a "dataset samples problem" but an error in a specific row
And yes, I fixed my dataset, so if you want to reproduce the error you need to add something like:
[ { "content": "Você é um assistente que responde precisamente a perguntas.", "role": "system" }, { "content": "Qual é o prato tradicional mais famoso de Auryn?", "role": "user" }, { "content": None, "role": "assistant" } ]
Thanks @celsowm - indeed what you have said looks correct, I can update the error message ! Can you help me by writing an end-to-end simple script to repro the error?
Thanks @celsowm - indeed what you have said looks correct, I can update the error message ! Can you help me by writing an end-to-end simple script to repro the error?
I discovered two things.
from datasets import Dataset, load_dataset
from trl import SFTTrainer
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig
import torch, sys
model_name = 'recogna-nlp/bode-7b-alpaca-pt-br-no-peft'
tokenizer = AutoTokenizer.from_pretrained(model_name)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
use_cache=False,
device_map="auto",
torch_dtype=torch.float16
)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
task_type="CAUSAL_LM",
)
training_args = TrainingArguments(
output_dir='outputs/test_fail_sft',
num_train_epochs=3, # number of training epochs
per_device_train_batch_size=3, # batch size per device during training
gradient_accumulation_steps=2, # number of steps before performing a backward/update pass
gradient_checkpointing=True, # use gradient checkpointing to save memory
optim="adamw_torch_fused", # use fused adamw optimizer
logging_steps=10, # log every 10 steps
save_strategy="epoch", # save checkpoint every epoch
learning_rate=2e-4, # learning rate, based on QLoRA paper
bf16=True, # use fploat16 precision
tf32=True, # use tf32 precision
max_grad_norm=0.3, # max gradient norm based on QLoRA paper
warmup_ratio=0.03, # warmup ratio based on QLoRA paper
lr_scheduler_type="constant",
overwrite_output_dir=True,
)
for i in range(1, 600):
messages = [
{"content": "Você é um bot que responde perguntas", "role": "system" },
{"content": "Você já foi até Auryn?", "role": "user" },
{"content": "Não", "role": "assistant" }
]
dataset = Dataset.from_dict({"messages": [messages] * i})
try:
trainer = SFTTrainer(
model=model,
peft_config=peft_config,
args=training_args,
train_dataset=dataset,
packing=True
)
except ValueError as e:
print(f"no success at {i} rows {e}")
else:
print(f"Success at {i} rows !!!!!!")
Its is crazy but Success at 21 rows !!!!!! Why? Mayber 21 is not a minor anymore, hahahah
from datasets import Dataset, load_dataset
from trl import SFTTrainer
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig
import torch, sys
model_name = 'recogna-nlp/bode-7b-alpaca-pt-br-no-peft'
tokenizer = AutoTokenizer.from_pretrained(model_name)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
use_cache=False,
device_map="auto",
torch_dtype=torch.float16
)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
task_type="CAUSAL_LM",
)
training_args = TrainingArguments(
output_dir='outputs/test_fail_sft',
num_train_epochs=3, # number of training epochs
per_device_train_batch_size=3, # batch size per device during training
gradient_accumulation_steps=2, # number of steps before performing a backward/update pass
gradient_checkpointing=True, # use gradient checkpointing to save memory
optim="adamw_torch_fused", # use fused adamw optimizer
logging_steps=10, # log every 10 steps
save_strategy="epoch", # save checkpoint every epoch
learning_rate=2e-4, # learning rate, based on QLoRA paper
bf16=True, # use fploat16 precision
tf32=True, # use tf32 precision
max_grad_norm=0.3, # max gradient norm based on QLoRA paper
warmup_ratio=0.03, # warmup ratio based on QLoRA paper
lr_scheduler_type="constant",
overwrite_output_dir=True,
)
messages = [
{"content": "Você é um bot que responde perguntas", "role": "system" },
{"content": "Você já foi até Auryn?", "role": "user" },
{"content": "Não", "role": "assistant" }
]
min_dataset = 21
dataset = Dataset.from_dict({"messages": [messages] * min_dataset})
#no error:
trainer = SFTTrainer(
model=model,
peft_config=peft_config,
args=training_args,
train_dataset=dataset,
packing=True
)
#pull the error:
new_messages = [messages] * min_dataset
new_messages[0][-1]['content'] = None #this causes error
dataset = Dataset.from_dict({"messages": new_messages})
trainer = SFTTrainer(
model=model,
peft_config=peft_config,
args=training_args,
train_dataset=dataset,
packing=True
)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Hi ! I am trying to use this small dataset:
dataset = load_dataset("celsowm/auryn", split="train", download_mode="force_redownload")
on SFTTrainer
and I got this error:
ValueError: Error occurred while packing the dataset. Make sure that your dataset has enough samples to at least yield one packed sequence.
So, how many "samples" is enough? It is not clear here