Open PoojaYuvaraj opened 1 year ago
The base LLaMa model contains 7 billion trainable parameters trained on 1 trillion tokens. Thus, unless your custom dataset is of similar size (1T tokens), retraining such a large model from scratch would not work well. Instead, I suggest increasing the size of your custom dataset and num_empochs
when finetuning. The pre-trained dataset helps LLaMa learn important language understanding/reasoning abilities which may be of use in your custom application.
Thanks a lot for your response.
Are there any other possible LLMs and any other methodology that I can try to build my own domain-specific customized or finetuned chat bot? Thanks in advance.
Have you tried handing this in the prompt via few shotting? For example, fine tune on your specific domain data that's in a certain language style and then in inference you few shot the prompt to trigger the LLM in generation to prefer your domain data patterns in its response.
Thank you so much, but i think the problem is with the amount of data I'm using for training. Could you tell me the min amount of training data I would require to train it properly? I tried via few shotting but doesn't seem to work due to data insufficiency. Thanks in advance.
100 samples per new example should be sufficient. For example, if you want to teach the net how to write poems about frogs, include 100 JSON entries as such
{
"instruction": "Write a poem about three different colored frogs.",
"input": "",
"output": "The three frogs are red, blue, and yellow..."
}
More training data is always preferred. If you need to generate more data, look into using the gpt-3.5
API to create prompts + responses for you. This method is called model distillation.
Heyy,
I have been trying so hard to get this training done with a custom dataset. I have a list of dictionaries for training. Since always the base model is LLama, its weights would definitely influence my custom questions' answers. I would like to override my answer in the custom dataset over the open web text Llama is trained on. Best case, I completely want to train it from scratch without the information of open web text in LLama. Is it possible? could you guide me with any leads? Thanks in advance.