Continue with checkpoint

kirklandWater1 commented 2 years ago

Hi, I was using the script for supervised training. My dataset is fairly large, and some checkpoints were made during the training. However, the training was not completed, and I was wondering how can I use the checkpoints to continue the training?

I tried setting the checkpoint as the model name, but it seems like the training will start from the begining. Here is the script I used:

python3 run_train.py\ --output_dir=$OUTPUT_DIR \ --model_name_or_path=checkpoint-12000 \ --extraction 'softmax' \ --do_train \ --train_so \ --train_data_file=$TRAIN_FILE \ --train_gold_file=$TRAIN_GOLD_FILE \ --per_gpu_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --num_train_epochs 5 \ --learning_rate 1e-4 \ --save_steps 2000

zdou0830 commented 2 years ago

Hi, for now the code only supports reloading the model weights but not the optimizer states. You may have to tune the learning rate and max_steps for continuing training.

kirklandWater1 commented 2 years ago

Thanks a lot!

neulab / awesome-align

Continue with checkpoint #39