Open 0xbitches opened 1 year ago
Also, would be great if we can have 4bit support by incorporating GPTQ #2
With 256 tokens the loss slowly pulls further down to somewhere slightly above 0.8. You could maybe get away with using 2 epochs instead of 3, though.
With 256 tokens the loss slowly pulls further down to somewhere slightly above 0.8. You could maybe get away with using 2 epochs instead of 3, though.
Yeah I definitely saw it drop below 0.75 somewhere between epoch 1-2. Could still achieve pretty good loss with just one epoch though. Was testing this in a hurry so just sharing this information here.
did you get below 0.75 w current hyperparams? I wasnt able to get under 0.8 . wondering what others are getting (Im using A100 40GB)
I probably wouldn't anchor too much on the specific loss numbers until we've refactored the training code to use validation sets.
Not exactly an issue, but have just been trying to run one epoch of finetuning with llama-13b. On a 4090 looks like it will take roughly 4 hours with the setting `MICRO_BATCH_SIZE = 2'.
However, it looks like the loss already converged to ~1 within epoch 0.12 (roughly 30 minutes into training), so it doesn't really make sense to use epoch=3 and potentially a larger micro batch size.
I could be wrong here. Happy to hear some feedback on how to better tune the parameters.