Closed 3outeille closed 4 months ago
The code to merge checkpoint is already working but for a very specific usecase (resume training with a different TP value), there is a little bug:
Assume you want to do the following:
This will return the following error:
[default0]: File "/fsx/ferdinandmom/ferdinand-hf/nanotron/src/nanotron/serialize/optimizer.py", line 188, in load_optimizer [default0]: OPTIMIZER_STATE_NAMES = sorted(ckp_sharded_optim_states[(0, 0)]["state"][0].keys() - ["step"])
The bug is fixed
The code to merge checkpoint is already working but for a very specific usecase (resume training with a different TP value), there is a little bug:
Assume you want to do the following:
This will return the following error: