Open msmmpts opened 1 year ago
Hi @msmmpts,
NaN values are more likely if you use a really high learning rate. I would recommend retrying with a learning rate that's an order magnitude smaller, like 0.0001
.
Hi @justinxzhao ,
I tried with a learning rate of 0.0001. Same issue persists.
Training: 18%|█▊ | 719/4000 [22:29<44:32, 1.23it/s]training: completed batch 719 memory used: 2984.25MB
/usr/local/lib/python3.10/dist-packages/torchmetrics/aggregation.py:77: UserWarning: Encounted `nan` values in tensor. Will be removed.
warnings.warn("Encounted `nan` values in tensor. Will be removed.", UserWarning)```
+1 on this. I see this warning, and then get the following error at the end of the first epoch each time:
Starting with step 0, epoch: 0
Training: 33%|███▎ | 429/1287 [32:07<1:08:57, 4.82s/it, loss=nan]Found NaN or inf values in parameter 'model.base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight' of module 'LLM'
NaN or inf tensors found in the model. Stopping training.
Could not load best checkpoint state from /mnt/disk/AI/ludwig/ludwig-lora/results/experiment_run/model/training_checkpoints/best.ckpt. Best checkpoint may not exist.
Traceback (most recent call last):
File "/home/constellate/anaconda3/envs/ludwig/bin/ludwig", line 8, in <module>
sys.exit(main())
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/cli.py", line 197, in main
CLI()
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/cli.py", line 72, in __init__
getattr(self, args.command)()
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/cli.py", line 77, in train
train.cli(sys.argv[2:])
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/train.py", line 395, in cli
train_cli(**vars(args))
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/train.py", line 185, in train_cli
model.train(
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/api.py", line 678, in train
train_stats = trainer.train(
File "/home/constellate/anaconda3/envs/ludwig/lib/python3.10/site-packages/ludwig/trainers/trainer.py", line 1130, in train
raise RuntimeError(error_message)
RuntimeError: Training ran into an error. No checkpoint was saved. This is because training was terminated early due to the presence of NaN or Inf values in the model weights before a single valid checkpoint could be saved.
Here's my model.yaml
file:
model_type: llm
backend:
type: local
base_model: mistralai/Mistral-7B-v0.1
quantization:
bits: 4
adapter:
type: lora
prompt:
template: >-
You are given a premise and a hypothesis below. If the premise entails the hypothesis, return 0. If the premise contradicts the hypothesis, return 2. Otherwise, if the premise does neither, return 1.
### Premise: {premise}
### Hypothesis: {hypothesis}
### Label:
input_features:
- name: input
type: text
output_features:
- name: label
type: text
preprocessing:
max_sequence_length: 1
trainer:
type: finetune
batch_size: auto
gradient_accumulation_steps: 16
enable_gradient_checkpointing: true
epochs: 3
learning_rate: 2.0e-4
optimizer:
type: paged_adam
Hi team,
I was fine tuning an LLM with Ludwig on a NVIDIA A 100 instance.
I get the error message - Encounted
nan
values in tensor. Will be removed.", UserWarning) My loss and perplexity shows NaN.`""" model_type: llm base_model: elyza/ELYZA-japanese-Llama-2-7b-instruct
{ "evaluation_frequency": { "frequency": 1, "period": "epoch" }, "test": { "combined": { "loss": [ NaN ] }, "output": { "char_error_rate": [ 1.0 ], "loss": [ NaN ], "next_token_perplexity": [ NaN ], "perplexity": [ NaN ], "sequence_accuracy": [ 0.0 ], "token_accuracy": [ 0.0 ] } }, "training": { "combined": { "loss": [ 1.7828550338745117 ] }, "output": { "char_error_rate": [ 0.9905372858047485 ], "loss": [ 1.7828550338745117 ], "next_token_perplexity": [ 16787.67578125 ], "perplexity": [ NaN ], "sequence_accuracy": [ 0.0 ], "token_accuracy": [ 3.948421363020316e-05 ] } }, "validation": { "combined": { "loss": [ NaN ] }, "output": { "char_error_rate": [ 1.0 ], "loss": [ NaN ], "next_token_perplexity": [ NaN ], "perplexity": [ NaN ], "sequence_accuracy": [ 0.0 ], "token_accuracy": [ 0.0 ] } } }