[Bug] Resuming training for data stages

huggingface / nanotron

Minimalistic large language model 3D-parallelism training

Apache License 2.0

1.23k stars 122 forks source link

[Bug] Resuming training for data stages #144

Closed xrsrke closed 6 months ago

xrsrke commented 6 months ago

Step 1: Train up to 15 steps, but only save checkpoint at step 10th CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=4 run_train.py --config-file examples/config_tiny_llama.yaml

Step 2: Resume training from step 2, and you should see that the training loss from step 10th to 15th in step 2 is the same as in step 1