train_on_responses_only doesn't map `eval_dataset`, breaking evaluation

fpgaminer commented 2 months ago

I followed the current Google Colab notebook for finetuning Llama 3.1 8B Instruct, which includes the use of train_on_responses_only. train_on_responses_only adds a labels column to trainer.train_dataset. However it doesn't process eval_dataset, leaving eval_dataset without any labels column. That results in the evaluation runs not returning any eval_loss, since it doesn't get any labels from the dataloader.

Simply adding trainer.eval_dataset = trainer.eval_dataset.map(_train_on_responses_only, batched = True) to train_on_responses_only fixes the issue for me and I get eval_loss again.

Perhaps train_on_responses_only should add that (behind an if check)?

selalipop commented 2 months ago

Looks like a solution to https://github.com/unslothai/unsloth/issues/1019

danielhanchen commented 2 months ago

Oh nice catch @fpgaminer !! Just pushed a fix for it!

unslothai / unsloth

train_on_responses_only doesn't map `eval_dataset`, breaking evaluation #1041