Open tensorflowt opened 1 year ago
I don't think there is a definitive answer to that.
You could either add your instructions to alpaca and train with the extended dataset, or train with your dataset only, starting with an Alpaca LoRA (What I do, from a large enough LoRA https://huggingface.co/Angainor/alpaca-lora-13b).
Then it may depend on how different what you want it to learn is different from Alpaca, both in terms of content and instructions. If you keep the instructions diverse and close to the alpaca ones, something like 100 samples per instruction can be enough. If you have many new domain specific facts to learn, x10 or x100 that maybe. (Just my 2 cents from my own experimentations)
I don't think there is a definitive answer to that.
You could either add your instructions to alpaca and train with the extended dataset, or train with your dataset only, starting with an Alpaca LoRA (What I do, from a large enough LoRA https://huggingface.co/Angainor/alpaca-lora-13b).
Then it may depend on how different what you want it to learn is different from Alpaca, both in terms of content and instructions. If you keep the instructions diverse and close to the alpaca ones, something like 100 samples per instruction can be enough. If you have many new domain specific facts to learn, x10 or x100 that maybe. (Just my 2 cents from my own experimentations)
@AngainorDev Question regarding epochs. Noticed your 13b model trained for 10 epochs, and no validation set?
Did you continue to see the loss go down past 3 epochs? Is your thought that by allowing fine-tuning of all 4 lora modules, 3 epochs is not enough?
Did you continue to see the loss go down past 3 epochs? Is your thought that by allowing fine-tuning of all 4 lora modules, 3 epochs is not enough?
Yep, see the loss graph on HF, I uploaded the pic. Loss begins to flatten at epoch 10 only.
More module + higher rank = more params more params = more steps needed.
I take the loss curve as a first order approx of the training success. When doing extra training, from this same 13b, r16, 4 modules alpaca with a custom dataset of mine, 1000 high quality samples only, different instructions set from alpaca, I had to push to 20 epochs to see the loss flatten, final loss around 0.45
I save every epoch instead of by steps, so I have all epochs, can evaluate each epoch on a custom eval set once done, and continue training with full optimizer state if needed. There I did 10 + extra 10 for instance
After this second fine tune (lora 13b from the HF, used as contuinue_from_checkpoint with my custom ds) the model still is able to work as Alpaca. It also can understand instructions that were not in the dataset, and are a mix of alpaca and custom instructions. When doing the same earlier, with smaller and 3 epochs models, the model after fine tune was ok for the new tasks, but forgot too much about alpaca and llama, ending up in lesser generalization capabilities.
Generally speaking, and given the scaling laws from cerebras/chinchilla for instance, I believe there is in fact such a relationship between number of params to be trained, and tokens to provide for optimal training. So I think yes, more layers or higher rank need more tokens: either more data, either more epochs. Not a formal proof, just my work hypothesis and feelings from my own tests.
I currently want to train a model in my own field based on the 7b model of LLaMA and the LORA strategy based on the alpaca 52k dataset, and then I want to know how much data I need to prepare to realize the training of my own model?