Open JohnnyRacer opened 6 months ago
@JohnnyRacer Oh wait you need to change train
to text
I think! It's only the train or test split, but rather the column which you want! ['instruction', 'input', 'output', 'text']
are your columns, and text
is the column you want.
@danielhanchen Sorry I don't really follow what you mean since I already specified dataset_text_field="text"
in the args when I inited the SFTTrainer
instance. If you don't mind, can you clarify what I need to alter. Here is the snippet I am trying to run, adapted from this example on HF :
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import FastLanguageModel
from datasets import load_dataset, ClassLabel
dataset_path = "vicgalle/alpaca-gpt4"
target_dataset = load_dataset(dataset_path)
dataset = target_dataset["train"] # Has the columns : ['instruction', 'input', 'output', 'text']
model, tokenizer = FastLanguageModel.from_pretrained(model_name = "unsloth/mistral-7b",**load_cfg)
model = FastLanguageModel.get_peft_model(model, **lora_cfg)
trainer = SFTTrainer(
model = model,
train_dataset = dataset ,
dataset_text_field = "text", # I already specified that the 'text' column here
max_seq_length = max_seq_length,
tokenizer = tokenizer,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 60,
bf16 = True,
logging_steps = 1,
output_dir = "outputs",
optim = "adamw_bnb_8bit"
),
)
trainer.train()
@JohnnyRacer I'll check it out! :)
@danielhanchen I think I have solved it, if I add packing=False
to the training args the SFTTrainer
seems to initialize and train fine.
This only occurs with some datasets, I suspect this maybe a bug. hey mind giving me an example of a dataset that works normally with the settings in your first message so I can reproduce?
@OneCodeToRuleThemAll I don't actually remember the exact dataset that worked since I was just testing a bunch of my own. I think its this one that worked. It seems like it the training split is generated automatically instead of being explicitly specified then packing=False
is required to make the dataset load correctly. Hope this helps.
Hello, I 've been trying to use the
SFTTrainer
with the vicgalle/alpaca-gpt4 dataset. However after prepping the dataset in the SFT format, I keep on getting this error when I initialize the trainer.However, the the dataset only has the
train
split when I print it. This only occurs with some datasets, I suspect this maybe a bug.