Platform:
- Platform: Linux-5.15.0-1056-aws-x86_64-with-glibc2.29
- Python version: 3.8.10
Python packages:
- `optimum-neuron` version: 0.0.24.dev0
- `neuron-sdk` version: 2.18.0
- `optimum` version: 1.20.0
- `transformers` version: 4.41.1
- `huggingface_hub` version: 0.23.2
- `torch` version: 2.1.2+cu121
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 2.0.965
- `neuronx-cc` version: 2.13.66.0+6dfecc895
- `neuronx-distributed` version: 0.7.0
- `neuronx-hwm` version: NA
- `torch-neuronx` version: 2.1.2.2.1.0
- `torch-xla` version: 2.1.2
- `transformers-neuronx` version: 0.10.0.21
Neuron Driver:
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
aws-neuronx-collectives/unknown,now 2.20.22.0-c101c322e amd64 [installed,upgradable to: 2.21.46.0-69b77134b]
aws-neuronx-dkms/unknown,now 2.16.7.0 amd64 [installed,upgradable to: 2.17.17.0]
aws-neuronx-oci-hook/unknown,now 2.3.0.0 amd64 [installed,upgradable to: 2.4.4.0]
aws-neuronx-runtime-lib/unknown,now 2.20.22.0-1b3ca6425 amd64 [installed,upgradable to: 2.21.41.0-fb1705f5f]
aws-neuronx-tools/unknown,now 2.17.1.0 amd64 [installed,upgradable to: 2.18.3.0]
Who can help?
@michaelbenayoun
Information
[X] The official example scripts
[ ] My own modified scripts
Tasks
[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)
Reproduction (minimal, reproducible, runnable)
Change the model id to meta-llama/Meta-Llama-3-8B, add env variable via os.environ['XLA_USE_BF16'] = "1", training loss shows up as nan. Here is an example of the training log
{'loss': nan, 'learning_rate': 4.166666666666667e-05, 'epoch': 0.48}
{'loss': nan, 'learning_rate': 3.3333333333333335e-05, 'epoch': 0.96}
{'loss': nan, 'learning_rate': 2.5e-05, 'epoch': 1.45}
{'loss': nan, 'learning_rate': 1.6666666666666667e-05, 'epoch': 1.93}
{'loss': nan, 'learning_rate': 8.333333333333334e-06, 'epoch': 2.41}
{'loss': nan, 'learning_rate': 0.0, 'epoch': 2.89}
100%|███████████████████████████████████████████| 60/60 [09:55<00:00, 9.09s/it]
Training completed. Do not forget to share your model on huggingface.co/models =)
{'train_runtime': 595.4478, 'train_samples_per_second': 1.673, 'train_steps_per_second': 0.101, 'train_loss': nan, 'epoch': 2.89}
100%|███████████████████████████████████████████| 60/60 [09:55<00:00, 9.92s/it]
In addition, trying to run inference on this model creates the following error:
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Expected behavior
Normal training, normal inferencing as the notebook intended
System Info
Who can help?
@michaelbenayoun
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
Change the model id to meta-llama/Meta-Llama-3-8B, add env variable via
os.environ['XLA_USE_BF16'] = "1"
, training loss shows up as nan. Here is an example of the training logIn addition, trying to run inference on this model creates the following error:
Expected behavior
Normal training, normal inferencing as the notebook intended