"ValueError: Attempting to unscale FP16 gradients."

takosalad commented 1 year ago

Describe the bug

I have trained a lot of models with various parameters, and this error never came up, I have trained LoRAs in maybe 15 (completely different) ways so far, that all completed fine. Several of them also with this exact same model. Now with these parameters this error popped up for the first time ever, after running for around an hour (expected total duration to run was actually 13 days).

Microbatch 1, Batch 1024 Epochs 950, learning 3e-4, scheduler cosine with restarts. Rank 256, alpha 512, cutoff 512 Overlap 256, newline 256.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

I don't have the time to run these 13-day runs again to check for reproducibility, sorry.

Screenshot

No response

Logs

2023-07-21 22:54:27 INFO:Loading raw text file dataset...
(Model has been modified by previous training, it needs to be reloaded...)
2023-07-21 22:54:36 INFO:Loading Wizard-Vicuna-13B-Uncensored-HF...
2023-07-21 22:54:36 WARNING:The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:13<00:00,  4.53s/it]
2023-07-21 22:54:50 WARNING:models/Wizard-Vicuna-13B-Uncensored-HF/tokenizer_config.json is different from the original LlamaTokenizer file. It is either customized or outdated.
2023-07-21 22:54:50 WARNING:models/Wizard-Vicuna-13B-Uncensored-HF/special_tokens_map.json is different from the original LlamaTokenizer file. It is either customized or outdated.
2023-07-21 22:54:50 INFO:Loaded the model in 14.19 seconds.

Model reloaded OK, continue with training.
2023-07-21 22:54:50 INFO:Getting model ready...
2023-07-21 22:54:50 INFO:Prepping for training...
2023-07-21 22:54:50 INFO:Creating LoRA model...
2023-07-21 22:55:24 INFO:Starting training...
Training 'llama' model using (q, v) projections
Trainable params: 209,715,200 (1.5857 %), All params: 13,225,579,520 (Model: 13,015,864,320)
2023-07-21 22:55:24 INFO:Log file 'train_dataset_sample.json' created in the 'logs' directory.
Exception in thread Thread-20 (threaded_run):
Traceback (most recent call last):
  File "/home/tako/Applications/oobabooga_linux/installer_files/env/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/tako/Applications/oobabooga_linux/installer_files/env/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/tako/Applications/oobabooga_linux/text-generation-webui/modules/training.py", line 665, in threaded_run
    trainer.train()
  File "/home/tako/Applications/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/home/tako/Applications/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/transformers/trainer.py", line 1987, in _inner_training_loop
    self.accelerator.clip_grad_norm_(
  File "/home/tako/Applications/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/accelerate/accelerator.py", line 1893, in clip_grad_norm_
    self.unscale_gradients()
  File "/home/tako/Applications/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/accelerate/accelerator.py", line 1856, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/home/tako/Applications/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 284, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/home/tako/Applications/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 212, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
2023-07-21 23:14:09 INFO:Training complete, saving...
2023-07-21 23:14:09 INFO:Training complete!

System Info

i7-8700K, 64GB RAM, RTX 4090 (24GB).
GPU power limited to around 350 W, usage around 80% on average, temperature stable 73 degree C, no heat issues.

Ph0rk0z commented 1 year ago

Is this bits and bytes/pytorch? It looks like an error in one of those.

github-actions[bot] commented 1 year ago

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

bubbabug commented 9 months ago

I am having this issue with TheBloke template on runpod. For a while I was able to train StableLM 3B fine, but now, every time I load up the training, I get about 10 steps in and the training "completes" with the "ValueError: Attempting to unscale FP16 gradients." in the logs. Was there ever a solution to this issue? I am unable to use axolotl with StableLM, so this is my only alternative.

oobabooga / text-generation-webui