pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
1.29k stars 115 forks source link

Only include checkpoints that have .metadata written #315

Closed liangluofb closed 1 month ago

liangluofb commented 1 month ago

.metadata may be missing in some checkpoints if some ranks did not checkpoint properly. This PR filters out checkpoints that do not have .metadata in them.