pentium3 / sys_reading

system paper reading notes
229 stars 12 forks source link

Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints #305

Open pentium3 opened 8 months ago

pentium3 commented 8 months ago

https://dl.acm.org/doi/10.1145/3600006.3613145

pentium3 commented 7 months ago

https://zhuanlan.zhihu.com/p/660282411