[Finetune] Fix fault-tolerant training - Githubissues

intel / llm-on-ray

Pretrain, finetune and serve LLMs on Intel platforms with Ray

Apache License 2.0

103 stars 30 forks source link

[Finetune] Fix fault-tolerant training #245

Closed xwu99 closed 5 months ago

xwu99 commented 5 months ago

Should report metrics and checkpoints from local TransformerTrainer to Ray Train to ensure fault-tolerant training

https://docs.ray.io/en/latest/train/getting-started-transformers.html#report-checkpoints-and-metrics https://docs.ray.io/en/latest/train/user-guides/checkpoints.html

@harborn @KepingYan Could you study this and clarify the correct process for the new TorchTrainer?