Trying to save model on each master ordinal when training on TPU pods. Not getting any problems on e.g. v3-8, but doesn't work on pods.
To Reproduce
if not xm.is_master_ordinal():
xm.rendezvous('save_model')
if xm.is_master_ordinal():
os.makedirs(train_args.run_name, exist_ok=True)
xm.save(model.state_dict(), os.path.join(train_args.run_name, 'model_best.pth'))
xm.save(optimizer.state_dict(), os.path.join(train_args.run_name, 'optimizer_best.pth'))
print(colored('Saved best model params'), 'green')
if xm.is_master_ordinal():
xm.rendezvous('save_model')
Getting
2020-06-15 10:45:21.337106: I 3221 tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) 10.164.0.77:8477
so apparently the master process is somehow unable to rendezvous with the others? I checked on the master that the model params are saved just fine...
Environment
Reproducible on XLA backend [TPU]:
torch_xla version: torch-xla-1.5
Please clarify xm.rendezvous docs, it's just really confusing at the moment...
🐛 Bug
Trying to save model on each master ordinal when training on TPU pods. Not getting any problems on e.g. v3-8, but doesn't work on pods.
To Reproduce
Getting
2020-06-15 10:45:21.337106: I 3221 tensorflow/compiler/xla/xla_client/mesh_service.cc:208] Waiting to connect to client mesh master (300 seconds) 10.164.0.77:8477
so apparently the master process is somehow unable to rendezvous with the others? I checked on the master that the model params are saved just fine...Environment
Please clarify
xm.rendezvous
docs, it's just really confusing at the moment...