Open gigadeplex opened 1 year ago
with TI you want flexibility to work with other parts of the prompt so hitting very low loss is not as ideal. Average loss less than 0.3 is ideal. I generally hit around 0.15. Check out the tensorboard integration to see the average loss more easily.
Too low loss often seems to indicate overfitting. It may be a good idea to try reducing the number of steps or lower learning rate.
It is also dependant on the steepness of the local minimum. You want to be in a fairly deep wide hole, so you can wander around on fairly level ground, not down at the bottom of a well.
(At least one paper demonstrates that too high of a batch size can actually hurt in this regard)
After training TI for 1500 steps, I can get down to a loss of about 0.05, much better than the previous 0.1. However, the results are still bad. Very bad. Here is what the training input params looks like:
Moreover, here is the last recorded training log:
steps: 100%|███████████████████████████████████████████████████████████████| 1500/1500 [09:24<00:00, 2.66it/s, loss=0.0542]