SFTTrainer doesn't work with some datasets due to column key error

JohnnyRacer commented 6 months ago

Hello, I 've been trying to use the SFTTrainer with the vicgalle/alpaca-gpt4 dataset. However after prepping the dataset in the SFT format, I keep on getting this error when I initialize the trainer.

File /usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py:3025, in Dataset.map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
   3023     missing_columns = set(remove_columns) - set(self._data.column_names)
   3024     if missing_columns:
-> 3025         raise ValueError(
   3026             f"Column to remove {list(missing_columns)} not in the dataset. Current columns in the dataset: {self._data.column_names}"
   3027         )
   3029 load_from_cache_file = load_from_cache_file if load_from_cache_file is not None else is_caching_enabled()
   3031 if fn_kwargs is None:

ValueError: Column to remove ['train'] not in the dataset. Current columns in the dataset: ['instruction', 'input', 'output', 'text']

However, the the dataset only has the train split when I print it. This only occurs with some datasets, I suspect this maybe a bug.

danielhanchen commented 6 months ago

@JohnnyRacer Oh wait you need to change train to text I think! It's only the train or test split, but rather the column which you want! ['instruction', 'input', 'output', 'text'] are your columns, and text is the column you want.

JohnnyRacer commented 6 months ago

@danielhanchen Sorry I don't really follow what you mean since I already specified dataset_text_field="text" in the args when I inited the SFTTrainer instance. If you don't mind, can you clarify what I need to alter. Here is the snippet I am trying to run, adapted from this example on HF :

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import FastLanguageModel
from datasets import load_dataset, ClassLabel

dataset_path = "vicgalle/alpaca-gpt4"
target_dataset = load_dataset(dataset_path) 
dataset = target_dataset["train"] # Has the columns : ['instruction', 'input', 'output', 'text']

model, tokenizer = FastLanguageModel.from_pretrained(model_name = "unsloth/mistral-7b",**load_cfg)
model = FastLanguageModel.get_peft_model(model, **lora_cfg)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset ,
    dataset_text_field = "text", # I already specified that the 'text' column here
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 10,
        max_steps = 60,
        bf16 = True,
        logging_steps = 1,
        output_dir = "outputs",
        optim = "adamw_bnb_8bit"
    ),
)

trainer.train()

danielhanchen commented 6 months ago

@JohnnyRacer I'll check it out! :)

JohnnyRacer commented 6 months ago

@danielhanchen I think I have solved it, if I add packing=False to the training args the SFTTrainer seems to initialize and train fine.

NPap0 commented 6 months ago

This only occurs with some datasets, I suspect this maybe a bug. hey mind giving me an example of a dataset that works normally with the settings in your first message so I can reproduce?

JohnnyRacer commented 6 months ago

@OneCodeToRuleThemAll I don't actually remember the exact dataset that worked since I was just testing a bunch of my own. I think its this one that worked. It seems like it the training split is generated automatically instead of being explicitly specified then packing=False is required to make the dataset load correctly. Hope this helps.

unslothai / unsloth

SFTTrainer doesn't work with some datasets due to column key error #252