Recommendations for obtaining validation dataset loss after each epoch

dcsuka commented 1 month ago

For finetuning using a custom dataset, message converter function, and csv column format, how do we obtain validation losses on a separate csv with the same format at the end of each epoch? Do we need to wait until after training to run on all the checkpointed files?

Also, how can we generate outputs using the same message converter function and tune run generate, using a csv file with a single row as input?

pbontrager commented 1 month ago

None of our recipes currently support doing a validation check each epoch. The easiest way to get this functionality would be to copy the recipe you want, duplicate _setup_data with something like _setup_val_data, call it, and then setup a for loop after the training for loop to run validation.

You can also make a request for this to be in our recipes by default, but we'd have to discuss whether it's worth the extra complexity. As for the generate recipe, it's not meant for generating from a csv, that seems more like an evaluation flow but we can discuss that too.

RdoubleA commented 1 month ago

I'm glad you brought this up because this is a common workflow (validation or generation while training) that we need to improve on. I would follow @pbontrager's suggestion of modifying our existing recipe with a validation step, but I do think this should eventually be a default recipe or an option in an existing recipe. @pbontrager maybe we should start setting up a location for community contributed recipes like this one that would be widely useful.

For generating on a single row of custom csv data, you can use utils.generate directly after loading your csv file, applying the same transforms (such as through InstructDataset or ChatDataset) and running generate on a single row of token IDs as a time. Or you can update the generate recipe locally to add this flexibility. We should consider providing more direct examples of this in our documentation or the option to run on a custom dataset (cc @ebsmothers)

pytorch / torchtune

Recommendations for obtaining validation dataset loss after each epoch #1042