Open LostRuins opened 1 month ago
@LostRuins Apologies on the delay - it seems like it's saying the labels are nested? Would it be possible to print out maybe the first few rows of trainer.train_dataset
? Thanks! Also our Discord server can be more helpful for async help if that works!
Hi @danielhanchen , there are many rows, i've trimmed it to show the format
trainer.train_dataset
Dataset({ features: ['input_ids', 'attention_mask', 'labels'], num_rows: 1835 })
trainer.train_dataset[0]
{'input_ids': [1, 1595, 83779, 1877, 18746, ...], 'attention_mask': [1, 1, 1, 1, 1, ...], 'labels': [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 1877, 82236, 1321, 14969, 5978, ...]}
What other commands should I run?
Ok I'll check on my end and get back to you asap!
did anyone find a fix?
Sadly no.
I am facing the same issue using train_on_responses_only with Qwen 2.5 7B, and the solution is using the DataCollatorForSeq2Seq as the data_collator as follows:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = joined_dataset["train"],
dataset_text_field = "text",
max_seq_length = max_seq_length,
data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
...
I found the above usage in the Llama 3.2 conversational notebook for the gradient accumulation fix. However, looks like training using this data_collator takes more than 4X longer per training step, I assume it is due to the padding happening. Currently it is way faster to train without using train_on_responses_only, at least on my 32k+ context use case.
Thanks @marcelodiaz558 very helpful response! Yeah using Seq2Seq is way too slow considering Unsloth should speed things up. I am currently at 0.01 it/s I'll try getting rid of the data_collator argument tomorrow
I'm trying to finetune Mistral-Nemo-Base-2407 with a
text
dataset of long inputs. Usually, the SFTrainer will truncate it to fit the specified context size.However, I get an error when using
train_on_responses_only
.Running the same dataset without
train_on_responses_only
works fine and trains normally.Any help would be appreciated.