pytorch / torchtune

A Native-PyTorch Library for LLM Fine-tuning
BSD 3-Clause "New" or "Revised" License
3.53k stars 285 forks source link

How to finetune a model for sequence classification tasks using the dataset like stanfordnlp/imdb from huggingface? #1124

Open JonasQN opened 4 days ago

JonasQN commented 4 days ago

Right now, all the tutorials are focusing on finetuning the model with chat datasets; how do I prepare a sequence classification dataset for finetuning?

ebsmothers commented 2 days ago

Hi @JonasQN this is a good question. I think sequence classification would need some minor changes to our dataset abstractions -- I assume you'd need to feed in labels from your dataset in that case (instead of just using shifted tokens as we currently do). Also you'd need to append a head to the model that will project to some fixed number of classes. Is that the right understanding of what you're trying to do?

If so I think something like our text_completion_dataset could be a good starting point, but we would need to change the labels here to whatever is in your dataset (depending on the format). And you would need to make sure to use a model with a classification head (we have something like this for Mistral you can use as an example here). Finally you would probably want to change this line in the training recipe which shifts the labels (because it assumes we are doing next-token prediction). Also cc @RdoubleA here who may have more informed things to say than I do.

(Btw as you rightly point out we don't really have any tutorials or docs on how to do this. I think as we provide better support for this type of task we can also improve our documentation on how to setup your dataset for it.)