ray-project / ray_lightning

Pytorch Lightning Distributed Accelerators using Ray
Apache License 2.0
211 stars 34 forks source link

[Code] best_model_path in ModelCheckpointCallback (rank 0 and driver node) #202

Open chongxiaoc opened 2 years ago

chongxiaoc commented 2 years ago

Driver node and rank 0 use same path to save and load weights in ModelCheckpointCallback.

It is possible driver node and rank 0 are not on the same machine, or they don't even share the same file system.

Sending local best_model_path on rank 0 back to driver node is meaningless.

Probably rank 0 has to push best model weights to a persistent storage in a custom callback on train_end_stage.

best_model_path