Step 1: Train up to 15 steps, but only save checkpoint at step 10th CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=4 run_train.py --config-file examples/config_tiny_llama.yaml
Step 2: Resume training from step 2, and you should see that the training loss from step 10th to 15th in step 2 is the same as in step 1
Step 1: Train up to 15 steps, but only save checkpoint at step 10th
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=4 run_train.py --config-file examples/config_tiny_llama.yaml
Step 2: Resume training from step 2, and you should see that the training loss from step 10th to 15th in step 2 is the same as in step 1