pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 480 forks source link

Hangs when trying to save model on master ordinal with `xm.rendezvous` #2224

Closed harpone closed 4 years ago

harpone commented 4 years ago

🐛 Bug

Trying to save model on each master ordinal when training on TPU pods. Not getting any problems on e.g. v3-8, but doesn't work on pods.

To Reproduce

if not xm.is_master_ordinal():
    xm.rendezvous('save_model')
if xm.is_master_ordinal():
    os.makedirs(train_args.run_name, exist_ok=True)
    xm.save(model.state_dict(), os.path.join(train_args.run_name, 'model_best.pth'))
    xm.save(optimizer.state_dict(), os.path.join(train_args.run_name, 'optimizer_best.pth'))
    print(colored('Saved best model params'), 'green')
if xm.is_master_ordinal():
    xm.rendezvous('save_model')

Getting

2020-06-15 10:45:21.337106: I 3221 tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) 10.164.0.77:8477 so apparently the master process is somehow unable to rendezvous with the others? I checked on the master that the model params are saved just fine...

Environment

Please clarify xm.rendezvous docs, it's just really confusing at the moment...

harpone commented 4 years ago

Oops didn't realize xm.save already has a built in rendezvous :/