Cache files ignored when running a multi-host distributed training

System Info

Platform:

- Platform: Linux-5.15.0-1053-aws-x86_64-with-glibc2.31
- Python version: 3.10.12

Python packages:

- `optimum-neuron` version: 0.0.19
- `neuron-sdk` version: 2.17.0
- `optimum` version: 1.17.1
- `transformers` version: 4.36.2
- `huggingface_hub` version: 0.20.3
- `torch` version: 1.13.1+cu117
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 0.5.809
- `neuronx-cc` version: 2.12.68.0+4480452af
- `neuronx-distributed` version: 0.6.0
- `neuronx-hwm` version: 2.12.0.0+422c9037c
- `torch-neuronx` version: 1.13.1.1.13.1
- `torch-xla` version: 1.13.1+torchneurond
- `transformers-neuronx` version: 0.9.474

Neuron Driver:

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

aws-neuronx-collectives/now 2.20.11.0-c101c322e amd64 [installed,local]
aws-neuronx-runtime-lib/now 2.20.11.0-b7d33e68b amd64 [installed,local]
aws-neuronx-tools/now 2.17.0.0 amd64 [installed,local]

Who can help?

@michaelbenayoun

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Please, check the notebook: https://github.com/samir-souza/laboratory/blob/master/15_Trainium/01_LLMFineTuning/LLMFineTuning.ipynb it contains the code and all the details to reproduce the error.

Expected behavior

When training (Llama or Mistral... maybe other models as well) using 2 or more instances on SageMaker, cache files are created and uploaded to HF repo in the 1st training, but are ignored when you run the same training for the 2nd time. It re-compiles the cache files every time.

It should reuse cache files and complete the task faster.

huggingface / optimum-neuron

Cache files ignored when running a multi-host distributed training #507