pentium3 / sys_reading

system paper reading notes
235 stars 12 forks source link

Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints #305

Open pentium3 opened 1 year ago

pentium3 commented 1 year ago

https://dl.acm.org/doi/10.1145/3600006.3613145

pentium3 commented 11 months ago

https://zhuanlan.zhihu.com/p/660282411