huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
210 stars 63 forks source link

Llama 3 8B fine tuning shows nan value as loss #660

Open BaiqingL opened 4 months ago

BaiqingL commented 4 months ago

System Info

Platform:

- Platform: Linux-5.15.0-1056-aws-x86_64-with-glibc2.29
- Python version: 3.8.10

Python packages:

- `optimum-neuron` version: 0.0.24.dev0
- `neuron-sdk` version: 2.18.0
- `optimum` version: 1.20.0
- `transformers` version: 4.41.1
- `huggingface_hub` version: 0.23.2
- `torch` version: 2.1.2+cu121
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 2.0.965
- `neuronx-cc` version: 2.13.66.0+6dfecc895
- `neuronx-distributed` version: 0.7.0
- `neuronx-hwm` version: NA
- `torch-neuronx` version: 2.1.2.2.1.0
- `torch-xla` version: 2.1.2
- `transformers-neuronx` version: 0.10.0.21

Neuron Driver:

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

aws-neuronx-collectives/unknown,now 2.20.22.0-c101c322e amd64 [installed,upgradable to: 2.21.46.0-69b77134b]
aws-neuronx-dkms/unknown,now 2.16.7.0 amd64 [installed,upgradable to: 2.17.17.0]
aws-neuronx-oci-hook/unknown,now 2.3.0.0 amd64 [installed,upgradable to: 2.4.4.0]
aws-neuronx-runtime-lib/unknown,now 2.20.22.0-1b3ca6425 amd64 [installed,upgradable to: 2.21.41.0-fb1705f5f]
aws-neuronx-tools/unknown,now 2.17.1.0 amd64 [installed,upgradable to: 2.18.3.0]

Who can help?

@michaelbenayoun

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Change the model id to meta-llama/Meta-Llama-3-8B, add env variable via os.environ['XLA_USE_BF16'] = "1", training loss shows up as nan. Here is an example of the training log

{'loss': nan, 'learning_rate': 4.166666666666667e-05, 'epoch': 0.48}            
{'loss': nan, 'learning_rate': 3.3333333333333335e-05, 'epoch': 0.96}           
{'loss': nan, 'learning_rate': 2.5e-05, 'epoch': 1.45}                          
{'loss': nan, 'learning_rate': 1.6666666666666667e-05, 'epoch': 1.93}           
{'loss': nan, 'learning_rate': 8.333333333333334e-06, 'epoch': 2.41}            
{'loss': nan, 'learning_rate': 0.0, 'epoch': 2.89}                              
100%|███████████████████████████████████████████| 60/60 [09:55<00:00,  9.09s/it]

Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 595.4478, 'train_samples_per_second': 1.673, 'train_steps_per_second': 0.101, 'train_loss': nan, 'epoch': 2.89}
100%|███████████████████████████████████████████| 60/60 [09:55<00:00,  9.92s/it]

In addition, trying to run inference on this model creates the following error:

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Expected behavior

Normal training, normal inferencing as the notebook intended

jianyinglangaws commented 4 months ago

I saw the same.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.