Closed MohitIntel closed 2 years ago
Actually the issue was caused by wrong checkpoint location. Previously we gave the location like this. '--resume_from_checkpoint ./output/checkpoint-3500' but it's supposed to be just ./output
It's working fine with the correct checkpoint path. This is an example command to verify it.
$ python run_qa.py --model_name_or_path roberta-base --gaudi_config_name ../gaudi_config.json --dataset_name squad --do_train --do_eval --per_device_train_batch_size 24 --per_device_eval_batch_size 8 --use_habana --use_lazy_mode --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir ./output/ --resume_from_checkpoint ./output/
Actually it's supposed to work with giving last saved checkpoint folder. e.g. --resume_from_checkpoint ./output/checkpoint-3500
We found that there's an issue in trainer side.
Currently, the checkpoint resume does not work if the training run ends abruptly amidst an epoch. It does not pick up the global last saved checkpoint step. Instead, it picks up the last step that ended gracefully.
Could you tell me if you still encounter this issue with an up to date version of the package?
can't reproduce after pull request 11
Error Message:
Command used to run training :
Method for reproducing the issue:
Attached Log file: albert_xxlarge_bf16_squad_continued.log