Closed suhmily closed 9 months ago
Yes, I think it's definitely worth trying! Another thing to keep in mind is there the warmup might also be for switching the model to the instruction tuning mode. So if the data distribution that leads to the optimization states is pretty different from what you are going to select, it might be suboptimal too.
Closing the issue now, feel free to reopen it if you have more questions!
I wonder if we can use the optim states of the original large model instead of warming up by lora if we have optim states at the begining?