Resume training with a different Tensor parallel value

The code to merge checkpoint is already working but for a very specific usecase (resume training with a different TP value), there is a little bug:

Assume you want to do the following:

Train a model for 10 steps in TP=2
Save a checkpoints at step=10
Relaunch training for 10 more steps in TP=1 this time

This will return the following error:

[default0]:  File "/fsx/ferdinandmom/ferdinand-hf/nanotron/src/nanotron/serialize/optimizer.py", line 188, in load_optimizer
[default0]:    OPTIMIZER_STATE_NAMES = sorted(ckp_sharded_optim_states[(0, 0)]["state"][0].keys() - ["step"])

huggingface / nanotron

Resume training with a different Tensor parallel value #106