How to finetune with a own private data and then build chatbot on that?

meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama for WhatsApp & Messenger.

12.02k stars 1.87k forks source link

How to finetune with a own private data and then build chatbot on that? #71

Closed rjtshrm closed 1 month ago

rjtshrm commented 1 year ago

So far with the example of fine tuning I see examples of summarisation, chatbot based on specific use cases etc. However, I want to build the a chatbot based on my own private data (100s of PDF & word files). How can I fine tune on this. The approach I am thinking is 1-> LoRA fine tuning of the base alpaca model on my own private data 2-> LoRA fine tuning of the above model on some input output prompts.

Is it a good technique for build chatbot on private datasets. Please someone can suggest a good way of building model based on private data.

HamidShojanazeri commented 9 months ago

@rjtshrm this looks fine, has your methods been successful? happy to chat more on this.

IamExperimenting commented 7 months ago

@rjtshrm were you able succussfully complete your approach? how was the result?

@HamidShojanazeri, I'm also going try the same approach. Did the Llama team open source the instruction dataset? so, that I can download and use that dataset to fine tune the model.

@HamidShojanazeri I have one question, is it mandatory build the instruction dataset(like question and answer prompt) should be from the training dataset(my domain dataset)? or can I randomly use any instruction dataset to fine tune the model, just to make the model to adapt to this instruction prompt?

can you please provide your thoughts?

HamidShojanazeri commented 7 months ago

@IamExperimenting Llama team didn't open source the instruction dataset, I am working on e2e recipe for chatbot. But overall you can use this example of custom dataset which uses open assistant dataset and you can see if you are using llama-chat as your base model you would need this special tokens being added as shown in the script.

, is it mandatory build the instruction dataset(like question and answer prompt) should be from the training dataset(my domain dataset)? or can I randomly use any instruction dataset to fine tune the model, just to make the model to adapt to this instruction prompt?

If you want do a chat bot on your specific domain for sure you need that data either inform of Q&A or instruction and out sets.