How to recover training process?

super-buster commented 2 years ago

Hi @timoschick, thanks for your generously providing us good code reproduction enviroments. I have to stop my experiments sometimes for some resons, but to find there is no way to continual training. It is uncomforting to crash everything and restart again since ipet is not fast. So, Is there arguments to do that ?( hope just because I miss that)

timoschick commented 2 years ago

Hi @super-buster, unfortunately, there aren't any arguments for that. However, iPET basically works by independently training many different models with each model having its own folder, and it detects when a folder already exists (and doesn't retrain the corresponding model). Each folder is of the form g<X>/p<Y>-i<Z> where <X> is the iPET generation, <Y> is the pattern id and <Z> is the iteration for that particular pattern. So, what you can do is the following:

Search for the folder g<X>/p<Y>-i<Z> that was created last before training stopped and delete that folder (it is very likely that training of the corresponding model didn't complete, so it needs to be started from scratch).
Restart iPET with the exact same arguments as before, but add --overwrite_output_dir. In the output, you should see many lines of the form Path g<X>/p<Y>-i<Z> already exists, skipping it... (see here), indicating that PET detects that model g<X>/p<Y>-i<Z> already exists and doesn't need to be trained from scratch.

There's one caveat: Continuing training this way will reset the random number generator's state, so results will be slightly different compared to doing the entire training in one run. However, this only affects models within one generation, as the RNG is reset after each generation anways. That means, if your iPET run has completed g0 and g1 and was halfway through with g2 when training stopped, instead of removing only the folder g2/p<Y>-i<Z> that was created last, you would have to remove the entire g2 folder if you want the exact same results.

super-buster commented 2 years ago

Very helpful, thank you again @timoschick. Today I run over wsc experiment and found test predictions file, namely, predictions.jsonl. Cause I don't eval validation dataset in my command, so I just modify that as followings:

#! /usr/bin/bash
set -e 

METHOD=pet
PATTERN_IDS="0 1 2" #modify here
OUTPUT_DIR=wsc_outputs #modify here
DATA_DIR=fewglue/FewGLUE/WSC #modify here
MODEL_TYPE=albert
MODEL_NAME_OR_PATH=albert-xxlarge-v2
TASK=wsc #modify here
EVAL_SET=dev
SC_MAX_STEPS=5000
SC_MAX_SEQ_LENGTH=128
SC_PER_GPU_TRAIN_BSZ=4
SC_PER_GPU_UNLABELED_BSZ=4
SC_GRADIENT_ACC=4
PET_MAX_SEQ_LENGTH=128
PET_PER_GPU_TRAIN_BSZ=4
PET_MAX_STEPS=250
PET_GRADIENT_ACC=4
PET_PER_GPU_EVAL_BSZ=1

device=1 #modify here
export CUDA_VISIBLE_DEVICES=$device

python3 cli.py \
--method $METHOD \
--pattern_ids $PATTERN_IDS \
--data_dir $DATA_DIR \
--model_type $MODEL_TYPE \
--model_name_or_path $MODEL_NAME_OR_PATH \
--task_name $TASK \
--output_dir $OUTPUT_DIR \
--sc_max_steps $SC_MAX_STEPS \
--sc_per_gpu_train_batch_size $SC_PER_GPU_TRAIN_BSZ \
--sc_per_gpu_unlabeled_batch_size $SC_PER_GPU_UNLABELED_BSZ \
--sc_gradient_accumulation_steps $SC_GRADIENT_ACC \
--sc_max_steps 5000 \
--sc_max_seq_length $SC_MAX_SEQ_LENGTH \
--pet_max_seq_length $PET_MAX_SEQ_LENGTH \
--pet_per_gpu_train_batch_size $PET_PER_GPU_TRAIN_BSZ \
--pet_max_steps $PET_MAX_STEPS \
--pet_gradient_accumulation_steps $PET_GRADIENT_ACC \
--pet_per_gpu_eval_batch_size $PET_PER_GPU_EVAL_BSZ \
--eval_set $EVAL_SET \
--no_distillation \
--do_eval

As you can see, the difference compared with training command is that I just remove--do_train and change EVAL_SET=dev. But it just outputs:

2021-12-14 16:43:18,128 - INFO - cli - Parameters: Namespace(adam_epsilon=1e-08, alpha=0.9999, cache_dir='', data_dir='fewglue/FewGLUE/WSC', decoding_strategy='default', do_eval=True, do_train=False, eval_set='dev', ipet_generations=3, ipet_logits_percentage=0.25, ipet_n_most_likely=-1, ipet_scale_factor=5, learning_rate=1e-05, lm_training=False, logging_steps=50, max_grad_norm=1.0, method='pet', model_name_or_path='albert-xxlarge-v2', model_type='albert', no_cuda=False, no_distillation=True, output_dir='wsc_outputs', overwrite_output_dir=False, pattern_ids=[0, 1, 2], pet_gradient_accumulation_steps=4, pet_max_seq_length=128, pet_max_steps=250, pet_num_train_epochs=3, pet_per_gpu_eval_batch_size=1, pet_per_gpu_train_batch_size=4, pet_per_gpu_unlabeled_batch_size=4, pet_repetitions=3, priming=False, reduction='wmean', sc_gradient_accumulation_steps=4, sc_max_seq_length=128, sc_max_steps=5000, sc_num_train_epochs=3, sc_per_gpu_eval_batch_size=8, sc_per_gpu_train_batch_size=4, sc_per_gpu_unlabeled_batch_size=4, sc_repetitions=1, seed=42, split_examples_evenly=False, task_name='wsc', temperature=2, test_examples=-1, train_examples=-1, unlabeled_examples=-1, verbalizer_file=None, warmup_steps=0, weight_decay=0.01, wrapper_type='mlm')
2021-12-14 16:43:18,235 - INFO - tasks - Creating features from dataset file at fewglue/FewGLUE/WSC (num_examples=-1, set_type=train)
2021-12-14 16:43:18,236 - INFO - tasks - Returning 32 train examples with label dist.: [('True', 32)]
2021-12-14 16:43:18,236 - INFO - tasks - Creating features from dataset file at fewglue/FewGLUE/WSC (num_examples=-1, set_type=dev)
2021-12-14 16:43:18,248 - INFO - tasks - Returning 104 dev examples with label dist.: [('False', 66), ('True', 38)]
2021-12-14 16:43:18,248 - INFO - tasks - Creating features from dataset file at fewglue/FewGLUE/WSC (num_examples=-1, set_type=unlabeled)
2021-12-14 16:43:18,262 - WARNING - tasks - Got '["emma's"]' but expected '['emma']' at index 0 for '["Emma's", 'mother', 'had', 'died', 'long', 'ago,', 'and', 'her', 'place', 'had', 'been', 'taken', 'by', 'an', 'excellent', 'woman', 'as', 'governess.']'
2021-12-14 16:43:18,267 - INFO - tasks - Returning 554 unlabeled examples with label dist.: [('False', 554)]
2021-12-14 16:43:18,269 - WARNING - modeling - Path wsc_outputs/p0-i0 already exists, skipping it...
2021-12-14 16:43:18,270 - WARNING - modeling - Path wsc_outputs/p0-i1 already exists, skipping it...
2021-12-14 16:43:18,270 - WARNING - modeling - Path wsc_outputs/p0-i2 already exists, skipping it...
2021-12-14 16:43:18,270 - WARNING - modeling - Path wsc_outputs/p1-i0 already exists, skipping it...
2021-12-14 16:43:18,270 - WARNING - modeling - Path wsc_outputs/p1-i1 already exists, skipping it...
2021-12-14 16:43:18,270 - WARNING - modeling - Path wsc_outputs/p1-i2 already exists, skipping it...
2021-12-14 16:43:18,270 - WARNING - modeling - Path wsc_outputs/p2-i0 already exists, skipping it...
2021-12-14 16:43:18,270 - WARNING - modeling - Path wsc_outputs/p2-i1 already exists, skipping it...
2021-12-14 16:43:18,270 - WARNING - modeling - Path wsc_outputs/p2-i2 already exists, skipping it...
2021-12-14 16:43:18,270 - INFO - modeling - === OVERALL RESULTS ===

However, I don't find anything like result_dev.txt in my output folder(final/p0-i0/). So I hope you can point out what's wrong with that command and share us a correct one. Thank you.

timoschick / pet

How to recover training process? #72