houminz / paper-reading

Paper Reading:涉及分布式、虚拟化、网络、机器学习
https://houmin.cc/papers
22 stars 0 forks source link

Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints #24

Open houminz opened 4 months ago

houminz commented 4 months ago

Paper: https://zhuangwang93.github.io/docs/Gemini_SOSP23.pdf