Closed takosalad closed 1 year ago
Is this bits and bytes/pytorch? It looks like an error in one of those.
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.
I am having this issue with TheBloke template on runpod. For a while I was able to train StableLM 3B fine, but now, every time I load up the training, I get about 10 steps in and the training "completes" with the "ValueError: Attempting to unscale FP16 gradients." in the logs. Was there ever a solution to this issue? I am unable to use axolotl with StableLM, so this is my only alternative.
Describe the bug
I have trained a lot of models with various parameters, and this error never came up, I have trained LoRAs in maybe 15 (completely different) ways so far, that all completed fine. Several of them also with this exact same model. Now with these parameters this error popped up for the first time ever, after running for around an hour (expected total duration to run was actually 13 days).
Microbatch 1, Batch 1024 Epochs 950, learning 3e-4, scheduler cosine with restarts. Rank 256, alpha 512, cutoff 512 Overlap 256, newline 256.
Is there an existing issue for this?
Reproduction
I don't have the time to run these 13-day runs again to check for reproducibility, sorry.
Screenshot
No response
Logs
System Info