When training (Llama or Mistral... maybe other models as well) using 2 or more instances on SageMaker, cache files are created and uploaded to HF repo in the 1st training, but are ignored when you run the same training for the 2nd time. It re-compiles the cache files every time.
It should reuse cache files and complete the task faster.
System Info
Who can help?
@michaelbenayoun
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
Please, check the notebook: https://github.com/samir-souza/laboratory/blob/master/15_Trainium/01_LLMFineTuning/LLMFineTuning.ipynb it contains the code and all the details to reproduce the error.
Expected behavior
When training (Llama or Mistral... maybe other models as well) using 2 or more instances on SageMaker, cache files are created and uploaded to HF repo in the 1st training, but are ignored when you run the same training for the 2nd time. It re-compiles the cache files every time.
It should reuse cache files and complete the task faster.