Fine tune on domain dataset and convert it as a chat based model

IamExperimenting commented 7 months ago

Hi team,

I would like to fine tune Llama2 model with my domain data and eventually convert it as a chat model. so, I can directly ask questions related to my domain data and get right response from the model. This is my objective.

I would like to get your advice/guidance on my approach to this problem whether I'm on right direction to my objective or not.

step1. collect domain data -> step2. picture the base Llama model -> step3. fine tune the base Llama model with my domain data -> step4. prepare instruction dataset -> step5. pick the above finetuned model(which is fine tuned with my domain data) now fine tune that model with instruction dataset -> step6. save the model -> step7. load the model -> step8. ask questions related to my domain data and get answer from the finetuned model.

can you please provide your thoughts in this approach? would the above approach help me to achieve my goal?

Also, I have a fews in the above approach

for the above approach should I use the base Llama model or directly use the chat model in the step2? also I have question here, if I pick the chat model, will I be able to fine tune my domain dataset(multi raw text files) because Llama chat model will not have information about my dataset right? or my domain dataset is not included in the training set.
Did you make your instruction dataset(the dataset which you have used to fine the Llama chat model) open source? if so, can you please point me there?
do I need to add special token in my dataset? because I have multiple files in my folder?
is there any input data format for llama2? do I need to combine it together? or does Llama can pick up multiple files from the same folder to fine-tune the model

@HamidShojanazeri @albertodepaola

Note : domain data file size : each file size is about 1MB no of file: 1780 files

HamidShojanazeri commented 7 months ago

@IamExperimenting sure, let me share my thoughts on it, it may depend to various factors specially data accessibility and quality but here is how I think.

re:1 as usually making the instruction dataset is harder than your domain raw data, if thats the case in your domain I agree starting with base model, finetuned on domain then instruct fine-tune it might be a better path.

re:2 we haven't open sourced instruction dataset.

re:3 if you are using the base model you wont need special token but if you are using chat model take a look at this preprocessing step that add special tokens.

re:4 here we mostly support HF datasets, so this doc should be helpful to get you started with your custom dataset.

IamExperimenting commented 7 months ago

@HamidShojanazeri thanks for your response :) . Let me start fine tuning Llama2 base model with my domain data and get up back to you. I have another question, just wanted to clarify whether my understanding about the domain fine tuning is correct or wrong.

After fine tuning Llama2 base model with my domain data, should I expect the fine tuned model to generate sentence same as my domain data?

example:


Domain data:

DUKE OF YORK:
No; it is stopp'd with other flattering sounds,
As praises, of whose taste the wise are fond,
Lascivious metres, to whose venom sound
The open ear of youth doth always listen;
Report of fashions in proud Italy,
**Whose manners still our tardy apish nation**
Limps after in base imitation.
Where doth the world thrust forth a vanity--
So it be new, there's no respect how vile--
That is not quickly buzzed into his ears?
Then all too late comes counsel to be heard,
Where will doth mutiny with wit's regard.
Direct not him whose way himself will choose:
'Tis breath thou lack'st, and that breath wilt thou lose.

the prompt i'm sending to the fine tuned model.
prompt = "Whose manners still our tardy apish nation"

output = fine_tuned_llama2_base_model.generate(prompt)

print(output)
Whose manners still our tardy apish nation
Limps after in base imitation.
Where doth the world thrust forth a vanity--
So it be new, there's no respect how vile--
That is not quickly buzzed into his ears?
Then all too late comes counsel to be heard,
Where will doth mutiny with wit's regard.
Direct not him whose way himself will choose:
'Tis breath thou lack'st, and that breath wilt thou lose.

IamExperimenting commented 7 months ago

@HamidShojanazeri one more question, while preparing instructions dataset, should I only prepare questions and answers from my domain dataset or can I also use some open source question answering dataset?

Is there a minimum requirement for instruction dataset?

meta-llama / llama-recipes

Fine tune on domain dataset and convert it as a chat based model #378