Snapshot consistency discussion revival

pfnet / pytorch-pfn-extras

Supplementary components to accelerate research and development in PyTorch

https://medium.com/pytorch/migration-from-chainer-to-pytorch-8ed92c12c8

MIT License

271 stars 51 forks source link

Snapshot consistency discussion revival #81

Open kuenishi opened 3 years ago

kuenishi commented 3 years ago

Looking around the snapshot code, I just found that they were almost ported from Chainer. That said, discussion in https://github.com/chainer/chainer/issues/6763 also applies to PPE. Just for heads up...

emcastillo commented 3 years ago

Just to get a heads up here, The issue is that the parameter buffers could be changed while we are writing them to storage?

kuenishi commented 3 years ago

Yes, exactly. We haven't have any precision issues so far, because storage is not so slow compared to training speed; 5-10 (iters/sec). But in case where training iteration is much faster than storage, it will be more problematic.