pytorch / torchtune

A Native-PyTorch Library for LLM Fine-tuning
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
3.94k stars 357 forks source link

Does torchtune support for traditional text-generation tasks? #1249

Open sherlcok314159 opened 1 month ago

sherlcok314159 commented 1 month ago

I wonder whether torchtune can support traditional tasks such as translation or more general text generation tasks which have a input and output column. I have read the datasets doc at here. But it looks like we all need to format it by InstructTemplate. How can I do traditional text-generation tasks by Llama3.1-8B? Any help is welcome. I am new to torchtune.

SalmanMohammadi commented 1 month ago

Hey @sherlcok314159. Thanks for your interest in torchtune and for raising this!

It sounds like you're interested in a general next-token-prediction task? Could you possibly provide a link to a dataset you're interested in fine-tuning on?

It sounds like the text completion dataset could be suitable for your needs, which enables the use of unstructured datasets (and, optionally, the use of sample packing!).

Let me know if that's pointed you in the right direction - very happy to help more.

sherlcok314159 commented 1 month ago

Thanks for your quick response. For example, I want to let LLMs to predict the sentiment of movie reviews. https://huggingface.co/datasets/stanfordnlp/sst2. Generally speaking, we can have one or two inputs, and a label which can be a class (classification) or a sequence (generation).

SalmanMohammadi commented 1 month ago

Cool!!

So, for classification tasks we support classification models out of the box (e.g. here's a llama2 model for reward modelling which can be generalised through the num_classes arg).

You'd then need to add functionality for a classification dataset. @ebsmothers has far more intelligent things to say than I do: answer here. To quote:

I assume you'd need to feed in labels from your dataset in that case (instead of just using shifted tokens as we currently do). If so I think something like our text_completion_dataset could be a good starting point, but we would need to change the labels here to whatever is in your dataset (depending on the format) Finally you would probably want to change this line in the training recipe which shifts the labels (because it assumes we are doing next-token prediction). Also cc @RdoubleA here who may have more informed things to say than I do.

Also cc @RdoubleA to register interest in text classification dataset builders/docs (happy to help with this). This has also come up previously in #1124.

sherlcok314159 commented 1 month ago

Looks like I need to write my own dataset loader script and change the training script to train text classification tasks.

SalmanMohammadi commented 1 month ago

I've been working on a reward modelling recipe, so expect this to come to torchtune sometime in the future!

I'll update this weekend with some more guidance and concrete examples to help you get started for now. Let me know how you get on otherwise : )

SalmanMohammadi commented 1 month ago

or a sequence (generation).

And then, I think you should be able to do this out-of-the-box with one of our base models and our text completion dataset. We also support sample packing here for super speedy training!

sherlcok314159 commented 1 month ago

Thanks for your kind help! I am trying to use transformers trainer and peft to build a baseline. Then I can compare torchtune and transformers trainer.

SalmanMohammadi commented 1 month ago

Thanks for your kind help! I am trying to use transformers trainer and peft to build a baseline. Then I can compare torchtune and transformers trainer.

Smart! I would've done exactly the same, starting with a baseline helps keeps things sane.

I've been working on RLHF for torchtune, and to train a reward model for it I used TRL. Here's an example of a trained reward model and the command to replicate the training: https://huggingface.co/smohammadi/tinyllama_rm_sentiment_1b.

Here's the docs for the trainer.

Note: the trained model can then be used directly in torchtune.

rezadnayeri commented 1 month ago

@SalmanMohammadi super grateful if you could give us some more guidance and examples for the text classification using torchtune. I have been trying unsuccessfully, as I am new with touchtune. It is a fantastic tool, and using it for text classification is super exiting. Can't wait to try it.

rezadnayeri commented 1 month ago

any update on this? appreciate your comments.

rezadnayeri commented 1 month ago

@RdoubleA any chance you could give us some guidance for the text classification using torchtune. basically, how to prepare the dataset, and how to use mistral_classifier (https://pytorch.org/torchtune/stable/generated/torchtune.models.mistral.mistral_classifier.html#torchtune.models.mistral.mistral_classifier) ? thanks a lot!

SalmanMohammadi commented 1 month ago

Hey @rezadnayeri. Sorry for the late reply. We're on this now - I'm currently working on adding support for classification datasets out-of-the-box in torchtune alongside @RdoubleA's heroic refactor of our datasets functionality. I'll update very soon with a PR.

rezadnayeri commented 1 month ago

Hey @SalmanMohammadi, truly appreciate your effort (and @RdoubleA), this would be an amazing addition. Meanwhile I will try to be patient :-)

qqlabs commented 3 weeks ago

Hi @SalmanMohammadi, I'm just getting started with torchtune and saw that this classification use case is my exact use case. Wanted to check in on whether you had updates for the classification dataset PR. Thanks!

SalmanMohammadi commented 2 weeks ago

Hey folks. The ball is rolling here. Follow along in https://github.com/pytorch/torchtune/issues/1464! There's a couple TODOs, if you're interested in helping out I'd be happy to guide you through contributing.

rezadnayeri commented 2 weeks ago

@SalmanMohammadi looking forward to this, thank you for your efforts!

SalmanMohammadi commented 2 weeks ago

Would you be able to share any examples of classification datasets you're interested in finetuning on @rezadnayeri @qqlabs @sherlcok314159?. The IMDB dataset was raised - it'd really useful to see your use cases, particularly in terms of the format of labels/targets in the dataset.

rezadnayeri commented 1 week ago

Hi @SalmanMohammadi : the IMDB dataset is a good benchmark, what I usually do is to follow the same format, "text" is the sequence of string text to be classified, and "label" is the integer ID for the classes (for example 0,1,2,3 for a 4 class problem). Thanks!

qqlabs commented 1 week ago

@SalmanMohammadi The IMDB dataset is also a good starting point for me as well, though I am interested in multiclass classification and not just binary.

My current use case is something like the Amazon ESCI shopping queries dataset. I currently concat the query and product information together into a prompt and am trying to predict one of the classes "exact", "substitute", "complement", and "irrelevant".

I followed the tips from the past comments and modified the current text completion dataset and recipes to do prompt -> completion similar to the format for finetuning legacy openai models (babbage-002 and davinci-002). Having some trouble getting the model to actually generate things correctly since I think there are still a lot of next token assumptions that I didn't account for at the generation step - so going to try playing with the reward model to do direct classification next.

If I get something working, I can post up a messy sample PR!

SalmanMohammadi commented 1 week ago

That's really useful @qqlabs. Thanks so much!

I'm interested in hearing more about how you get on - feel free to ping me on Discord if you have any questions too!

rezadnayeri commented 4 days ago

Hello @SalmanMohammadi, wondering if you have any updates on #1464. Thanks