Training alpaca from scratch

tloen / alpaca-lora

Instruct-tune LLaMA on consumer hardware

Apache License 2.0

18.44k stars 2.2k forks source link

Training alpaca from scratch #460

Open PoojaYuvaraj opened 1 year ago

PoojaYuvaraj commented 1 year ago

Heyy,

I have been trying so hard to get this training done with a custom dataset. I have a list of dictionaries for training. Since always the base model is LLama, its weights would definitely influence my custom questions' answers. I would like to override my answer in the custom dataset over the open web text Llama is trained on. Best case, I completely want to train it from scratch without the information of open web text in LLama. Is it possible? could you guide me with any leads? Thanks in advance.

jb-01 commented 1 year ago

The base LLaMa model contains 7 billion trainable parameters trained on 1 trillion tokens. Thus, unless your custom dataset is of similar size (1T tokens), retraining such a large model from scratch would not work well. Instead, I suggest increasing the size of your custom dataset and num_empochs when finetuning. The pre-trained dataset helps LLaMa learn important language understanding/reasoning abilities which may be of use in your custom application.

PoojaYuvaraj commented 1 year ago

Thanks a lot for your response.

Are there any other possible LLMs and any other methodology that I can try to build my own domain-specific customized or finetuned chat bot? Thanks in advance.

mmealman commented 1 year ago

Have you tried handing this in the prompt via few shotting? For example, fine tune on your specific domain data that's in a certain language style and then in inference you few shot the prompt to trigger the LLM in generation to prefer your domain data patterns in its response.

PoojaYuvaraj commented 1 year ago

Thank you so much, but i think the problem is with the amount of data I'm using for training. Could you tell me the min amount of training data I would require to train it properly? I tried via few shotting but doesn't seem to work due to data insufficiency. Thanks in advance.

jb-01 commented 1 year ago

100 samples per new example should be sufficient. For example, if you want to teach the net how to write poems about frogs, include 100 JSON entries as such

{
      "instruction": "Write a poem about three different colored frogs.",
      "input": "",
      "output": "The three frogs are red, blue, and yellow..."
}

More training data is always preferred. If you need to generate more data, look into using the gpt-3.5 API to create prompts + responses for you. This method is called model distillation.