Open chongxiaoc opened 2 years ago
Driver node and rank 0 use same path to save and load weights in ModelCheckpointCallback.
It is possible driver node and rank 0 are not on the same machine, or they don't even share the same file system.
Sending local best_model_path on rank 0 back to driver node is meaningless.
best_model_path
Probably rank 0 has to push best model weights to a persistent storage in a custom callback on train_end_stage.
train_end_stage
Driver node and rank 0 use same path to save and load weights in ModelCheckpointCallback.
It is possible driver node and rank 0 are not on the same machine, or they don't even share the same file system.
Sending local
best_model_path
on rank 0 back to driver node is meaningless.Probably rank 0 has to push best model weights to a persistent storage in a custom callback on
train_end_stage
.best_model_path