huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.14k stars 107 forks source link

Resume training with a different Tensor parallel value #106

Closed 3outeille closed 4 months ago

3outeille commented 6 months ago

The code to merge checkpoint is already working but for a very specific usecase (resume training with a different TP value), there is a little bug:

Assume you want to do the following:

This will return the following error:

[default0]:  File "/fsx/ferdinandmom/ferdinand-hf/nanotron/src/nanotron/serialize/optimizer.py", line 188, in load_optimizer
[default0]:    OPTIMIZER_STATE_NAMES = sorted(ckp_sharded_optim_states[(0, 0)]["state"][0].keys() - ["step"])
xrsrke commented 4 months ago

The bug is fixed