pytorch / examples

A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc.
https://pytorch.org/examples
BSD 3-Clause "New" or "Revised" License
22.23k stars 9.52k forks source link

distributed validate #739

Open wuzhi19931128 opened 4 years ago

wuzhi19931128 commented 4 years ago

It seems like validate run in every GPU when distributed. What should be changed to save time by run validate distributed?

soumyasanyal commented 4 years ago

Same question - given that the models across GPUs are synced after optimizer.step(), then every validate run is effectively the same. As an optimization, if we run validate in a distributed manner (like training), then how do we average the accuracies across gpus and nodes to decide saving checkpoints?