Open 5cp opened 7 months ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!
When training with examples like run_clm.py and using TP, the checkpoints are saved in sharded format and require consolidation before they can be used with downstream tools.
For example, the following code can be added to the end of the training block in run_clm.py to perform consolidation and remove the shards:
Should this be the default behaviour? Or should we add a flag to automatically trigger this consolidation?