huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
176 stars 53 forks source link

Cache files ignored when running a multi-host distributed training #507

Open samir-souza opened 4 months ago

samir-souza commented 4 months ago

System Info

Platform:

- Platform: Linux-5.15.0-1053-aws-x86_64-with-glibc2.31
- Python version: 3.10.12

Python packages:

- `optimum-neuron` version: 0.0.19
- `neuron-sdk` version: 2.17.0
- `optimum` version: 1.17.1
- `transformers` version: 4.36.2
- `huggingface_hub` version: 0.20.3
- `torch` version: 1.13.1+cu117
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 0.5.809
- `neuronx-cc` version: 2.12.68.0+4480452af
- `neuronx-distributed` version: 0.6.0
- `neuronx-hwm` version: 2.12.0.0+422c9037c
- `torch-neuronx` version: 1.13.1.1.13.1
- `torch-xla` version: 1.13.1+torchneurond
- `transformers-neuronx` version: 0.9.474

Neuron Driver:

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

aws-neuronx-collectives/now 2.20.11.0-c101c322e amd64 [installed,local]
aws-neuronx-runtime-lib/now 2.20.11.0-b7d33e68b amd64 [installed,local]
aws-neuronx-tools/now 2.17.0.0 amd64 [installed,local]

Who can help?

@michaelbenayoun

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Please, check the notebook: https://github.com/samir-souza/laboratory/blob/master/15_Trainium/01_LLMFineTuning/LLMFineTuning.ipynb it contains the code and all the details to reproduce the error.

Expected behavior

When training (Llama or Mistral... maybe other models as well) using 2 or more instances on SageMaker, cache files are created and uploaded to HF repo in the 1st training, but are ignored when you run the same training for the 2nd time. It re-compiles the cache files every time.

It should reuse cache files and complete the task faster.