huggingface / transformers

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.25k stars 26.09k forks source link

Bug in trainer: substantially different results from restarting from a checkpoint and without #11323

Closed dorooddorood606 closed 3 years ago

dorooddorood606 commented 3 years ago

Environment info

Who can help

@sgugger @patrickvonplaten, @patil-suraj

Information

then I find the last checkpoint to resume from it from the saved one in output directory as below:

def get_last_checkpoint(output_dir):
    if os.path.exists(os.path.join(output_dir, 'pytorch_model.bin')):
        return output_dir
    return None

Here is the results without resume for 10 times evaluation:

{'loss': 5.0483, 'learning_rate': 6e-07, 'epoch': 0.02}                                                                                                                                       
  0%|                                                                                                                                                   | 10/60000 [00:07<11:11:04,  1.49it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.54it/s]
{'mrpc_en_eval_loss': 5.382528305053711, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0, 'mrpc_en_eval_runtime': 1.8421, 'mrpc_en_eval_samples_per_second': 110.741, 'epoch': 0.22}                                                                                                                                                                      
{'mrpc_en_eval_loss': 5.382528305053711, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0, 'mrpc_en_eval_runtime': 1.8421, 'mrpc_en_eval_samples_per_second': 110.741, 'epoch': 0.22, 'eval_average_metrics': 0.0}                                                                                                                                         
  0%|                                                                                                                                                   | 20/60000 [00:20<11:57:29,  1.39it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.56it/s]
{'mrpc_en_eval_loss': 5.180729389190674, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0, 'mrpc_en_eval_runtime': 1.8179, 'mrpc_en_eval_samples_per_second': 112.218, 'epoch': 0.43}                                                                                                                                                                      
{'mrpc_en_eval_loss': 5.180729389190674, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0, 'mrpc_en_eval_runtime': 1.8179, 'mrpc_en_eval_samples_per_second': 112.218, 'epoch': 0.43, 'eval_average_metrics': 0.0}                                                                                                                                         
  0%|                                                                                                                                                   | 30/60000 [00:33<12:01:13,  1.39it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.52it/s]
{'mrpc_en_eval_loss': 4.810805320739746, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0, 'mrpc_en_eval_runtime': 1.8421, 'mrpc_en_eval_samples_per_second': 110.743, 'epoch': 0.65}                                                                                                                                                                      
{'mrpc_en_eval_loss': 4.810805320739746, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0, 'mrpc_en_eval_runtime': 1.8421, 'mrpc_en_eval_samples_per_second': 110.743, 'epoch': 0.65, 'eval_average_metrics': 0.0}                                                                                                                                         
  0%|                                                                                                                                                   | 40/60000 [00:45<11:17:50,  1.47it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.54it/s]
{'mrpc_en_eval_loss': 4.203256607055664, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0098039215686274, 'mrpc_en_eval_runtime': 2.031, 'mrpc_en_eval_samples_per_second': 100.441, 'epoch': 0.87}                                                                                                                                                        
{'mrpc_en_eval_loss': 4.203256607055664, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0098039215686274, 'mrpc_en_eval_runtime': 2.031, 'mrpc_en_eval_samples_per_second': 100.441, 'epoch': 0.87, 'eval_average_metrics': 0.0}                                                                                                                           
  0%|                                                                                                                                                   | 50/60000 [00:58<11:42:57,  1.42it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.39it/s]
{'mrpc_en_eval_loss': 3.262455463409424, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0098039215686274, 'mrpc_en_eval_runtime': 2.1069, 'mrpc_en_eval_samples_per_second': 96.825, 'epoch': 1.09}                                                                                                                                                        
{'mrpc_en_eval_loss': 3.262455463409424, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0098039215686274, 'mrpc_en_eval_runtime': 2.1069, 'mrpc_en_eval_samples_per_second': 96.825, 'epoch': 1.09, 'eval_average_metrics': 0.0}                                                                                                                           
  0%|▏                                                                                                                                                  | 60/60000 [01:13<11:57:15,  1.39it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  1.78it/s]
{'mrpc_en_eval_loss': 1.9655567407608032, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.49019607843137253, 'mrpc_en_eval_gen_len': 3.053921568627451, 'mrpc_en_eval_runtime': 2.8657, 'mrpc_en_eval_samples_per_second': 71.186, 'epoch': 1.3}                                                                                                                                         
{'mrpc_en_eval_loss': 1.9655567407608032, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.49019607843137253, 'mrpc_en_eval_gen_len': 3.053921568627451, 'mrpc_en_eval_runtime': 2.8657, 'mrpc_en_eval_samples_per_second': 71.186, 'epoch': 1.3, 'eval_average_metrics': 0.24509803921568626}                                                                                            
  0%|▏                                                                                                                                                  | 70/60000 [01:27<12:14:11,  1.36it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.08it/s]
{'mrpc_en_eval_loss': 0.7519775032997131, 'mrpc_en_eval_f1': 18.404907975460123, 'mrpc_en_eval_accuracy': 34.80392156862745, 'mrpc_en_eval_gen_len': 2.9411764705882355, 'mrpc_en_eval_runtime': 2.6193, 'mrpc_en_eval_samples_per_second': 77.884, 'epoch': 1.52}                                                                                                                          
{'mrpc_en_eval_loss': 0.7519775032997131, 'mrpc_en_eval_f1': 18.404907975460123, 'mrpc_en_eval_accuracy': 34.80392156862745, 'mrpc_en_eval_gen_len': 2.9411764705882355, 'mrpc_en_eval_runtime': 2.6193, 'mrpc_en_eval_samples_per_second': 77.884, 'epoch': 1.52, 'eval_average_metrics': 26.60441477204379}                                                                               
  0%|▏                                                                                                                                                  | 80/60000 [01:41<12:02:22,  1.38it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.60it/s]
{'mrpc_en_eval_loss': 0.4142318665981293, 'mrpc_en_eval_f1': 75.62500000000001, 'mrpc_en_eval_accuracy': 61.76470588235294, 'mrpc_en_eval_gen_len': 2.1176470588235294, 'mrpc_en_eval_runtime': 1.7878, 'mrpc_en_eval_samples_per_second': 114.109, 'epoch': 1.74}                                                                                                                          
{'mrpc_en_eval_loss': 0.4142318665981293, 'mrpc_en_eval_f1': 75.62500000000001, 'mrpc_en_eval_accuracy': 61.76470588235294, 'mrpc_en_eval_gen_len': 2.1176470588235294, 'mrpc_en_eval_runtime': 1.7878, 'mrpc_en_eval_samples_per_second': 114.109, 'epoch': 1.74, 'eval_average_metrics': 68.69485294117648}                                                                               
  0%|▏                                                                                                                                                  | 90/60000 [01:54<11:41:23,  1.42it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.54it/s]
{'mrpc_en_eval_loss': 0.3786551058292389, 'mrpc_en_eval_f1': 51.18483412322274, 'mrpc_en_eval_accuracy': 49.50980392156863, 'mrpc_en_eval_gen_len': 2.6519607843137254, 'mrpc_en_eval_runtime': 1.8265, 'mrpc_en_eval_samples_per_second': 111.69, 'epoch': 1.96}                                                                                                                           
{'mrpc_en_eval_loss': 0.3786551058292389, 'mrpc_en_eval_f1': 51.18483412322274, 'mrpc_en_eval_accuracy': 49.50980392156863, 'mrpc_en_eval_gen_len': 2.6519607843137254, 'mrpc_en_eval_runtime': 1.8265, 'mrpc_en_eval_samples_per_second': 111.69, 'epoch': 1.96, 'eval_average_metrics': 50.34731902239569}                                                                                
  0%|▏                                                                                                                                                 | 100/60000 [02:07<12:01:27,  1.38it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.58it/s]
{'mrpc_en_eval_loss': 0.29472649097442627, 'mrpc_en_eval_f1': 71.01449275362319, 'mrpc_en_eval_accuracy': 60.78431372549019, 'mrpc_en_eval_gen_len': 2.3333333333333335, 'mrpc_en_eval_runtime': 1.812, 'mrpc_en_eval_samples_per_second': 112.581, 'epoch': 2.17}                                                                                                                          
{'mrpc_en_eval_loss': 0.29472649097442627, 'mrpc_en_eval_f1': 71.01449275362319, 'mrpc_en_eval_accuracy': 60.78431372549019, 'mrpc_en_eval_gen_len': 2.3333333333333335, 'mrpc_en_eval_runtime': 1.812, 'mrpc_en_eval_samples_per_second': 112.581, 'epoch': 2.17, 'eval_average_metrics': 65.89940323955669}                                                                               

Now lets resume from step = 40, while the first 40 steps would get the same results, after resuming the results differ a lot:

0%|                                                                                                                                                    | 40/60000 [00:07<9:49:41,  1.69it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.62it/s]
{'mrpc_en_eval_loss': 4.203643321990967, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0098039215686274, 'mrpc_en_eval_runtime': 2.0033, 'mrpc_en_eval_samples_per_second': 101.834, 'epoch': 0.87}                                                                                                                                                       
{'mrpc_en_eval_loss': 4.203643321990967, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0098039215686274, 'mrpc_en_eval_runtime': 2.0033, 'mrpc_en_eval_samples_per_second': 101.834, 'epoch': 0.87, 'eval_average_metrics': 0.0}                                                                                                                          
  0%|                                                                                                                                                   | 50/60000 [00:21<12:09:50,  1.37it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.30it/s]
{'mrpc_en_eval_loss': 3.2706634998321533, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0098039215686274, 'mrpc_en_eval_runtime': 2.2048, 'mrpc_en_eval_samples_per_second': 92.524, 'epoch': 1.09}                                                                                                                                                       
{'mrpc_en_eval_loss': 3.2706634998321533, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0098039215686274, 'mrpc_en_eval_runtime': 2.2048, 'mrpc_en_eval_samples_per_second': 92.524, 'epoch': 1.09, 'eval_average_metrics': 0.0}                                                                                                                          
  0%|▏                                                                                                                                                  | 60/60000 [00:35<12:27:28,  1.34it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.54it/s]
{'mrpc_en_eval_loss': 1.9863247871398926, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.49019607843137253, 'mrpc_en_eval_gen_len': 3.019607843137255, 'mrpc_en_eval_runtime': 2.4126, 'mrpc_en_eval_samples_per_second': 84.557, 'epoch': 1.3}                                                                                                                                         
{'mrpc_en_eval_loss': 1.9863247871398926, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.49019607843137253, 'mrpc_en_eval_gen_len': 3.019607843137255, 'mrpc_en_eval_runtime': 2.4126, 'mrpc_en_eval_samples_per_second': 84.557, 'epoch': 1.3, 'eval_average_metrics': 0.24509803921568626}                                                                                            
  0%|▏                                                                                                                                                  | 70/60000 [00:49<12:02:36,  1.38it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.07it/s]
{'mrpc_en_eval_loss': 0.7721647620201111, 'mrpc_en_eval_f1': 18.404907975460123, 'mrpc_en_eval_accuracy': 34.80392156862745, 'mrpc_en_eval_gen_len': 2.946078431372549, 'mrpc_en_eval_runtime': 2.5655, 'mrpc_en_eval_samples_per_second': 79.518, 'epoch': 1.52}                                                                                                                           
{'mrpc_en_eval_loss': 0.7721647620201111, 'mrpc_en_eval_f1': 18.404907975460123, 'mrpc_en_eval_accuracy': 34.80392156862745, 'mrpc_en_eval_gen_len': 2.946078431372549, 'mrpc_en_eval_runtime': 2.5655, 'mrpc_en_eval_samples_per_second': 79.518, 'epoch': 1.52, 'eval_average_metrics': 26.60441477204379}                                                                                
  0%|▏                                                                                                                                                  | 80/60000 [01:02<12:08:06,  1.37it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.55it/s]
{'mrpc_en_eval_loss': 0.42692506313323975, 'mrpc_en_eval_f1': 74.28571428571428, 'mrpc_en_eval_accuracy': 60.29411764705882, 'mrpc_en_eval_gen_len': 2.142156862745098, 'mrpc_en_eval_runtime': 1.8243, 'mrpc_en_eval_samples_per_second': 111.824, 'epoch': 1.74}                                                                                                                          
{'mrpc_en_eval_loss': 0.42692506313323975, 'mrpc_en_eval_f1': 74.28571428571428, 'mrpc_en_eval_accuracy': 60.29411764705882, 'mrpc_en_eval_gen_len': 2.142156862745098, 'mrpc_en_eval_runtime': 1.8243, 'mrpc_en_eval_samples_per_second': 111.824, 'epoch': 1.74, 'eval_average_metrics': 67.28991596638654}                                                                               
  0%|▏                                                                                                                                                  | 90/60000 [01:16<12:00:53,  1.39it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.50it/s]
{'mrpc_en_eval_loss': 0.39015302062034607, 'mrpc_en_eval_f1': 45.685279187817265, 'mrpc_en_eval_accuracy': 47.549019607843135, 'mrpc_en_eval_gen_len': 2.7205882352941178, 'mrpc_en_eval_runtime': 1.856, 'mrpc_en_eval_samples_per_second': 109.915, 'epoch': 1.96}                                                                                                                        
{'mrpc_en_eval_loss': 0.39015302062034607, 'mrpc_en_eval_f1': 45.685279187817265, 'mrpc_en_eval_accuracy': 47.549019607843135, 'mrpc_en_eval_gen_len': 2.7205882352941178, 'mrpc_en_eval_runtime': 1.856, 'mrpc_en_eval_samples_per_second': 109.915, 'epoch': 1.96, 'eval_average_metrics': 46.617149397830204}                                                                            
  0%|▏                                                                                                                                                 | 100/60000 [01:31<12:02:17,  1.38it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.55it/s]
{'mrpc_en_eval_loss': 0.30966323614120483, 'mrpc_en_eval_f1': 68.48249027237354, 'mrpc_en_eval_accuracy': 60.29411764705882, 'mrpc_en_eval_gen_len': 2.426470588235294, 'mrpc_en_eval_runtime': 1.8275, 'mrpc_en_eval_samples_per_second': 111.625, 'epoch': 2.17}                                                                                                                          
{'mrpc_en_eval_loss': 0.30966323614120483, 'mrpc_en_eval_f1': 68.48249027237354, 'mrpc_en_eval_accuracy': 60.29411764705882, 'mrpc_en_eval_gen_len': 2.426470588235294, 'mrpc_en_eval_runtime': 1.8275, 'mrpc_en_eval_samples_per_second': 111.625, 'epoch': 2.17, 'eval_average_metrics': 64.38830395971618}                                                                               

Expected behavior

Resuming from a checkpoint needs to get the same results as without

Thank you for your help @sgugger

sgugger commented 3 years ago

You will only have perfectly reproducible results using checkpointing if the only randomness comes from the shuffling in your data (this is enforced by the CI). The way this is programmed inside the Trainer is to go through each epoch before the current one (which triggers the random shuffling) and then each batch (which puts you in the same position as before the checkpoint).

Since your results differ slightly, it looks like there are other random calls in your training code, which you did not share. There is no way to have the exact same results while resuming from a checkpoint if this is the case.

dorooddorood606 commented 3 years ago

Hi @sgugger thanks for the reply, I do not have any other randomness in my codes, and I am using run_seq2seq.py codes to train t5 models on mrpc dataset, without modifications, I really appreciate your help on this issue as this is really crucial for me to have this working thanks a lot

I initialize only the weights randomly, but I assume huggnigface well taking care of setting seeds, and there is really no other randomness

dorooddorood606 commented 3 years ago

@sgugger I confirm also training the vanilla t5 have the same issue exists: Here is the run for t5-base for 100 steps:

{'loss': 6.1045, 'learning_rate': 6e-07, 'epoch': 0.02}                                                                                                                                       
  0%|                                                                                                                                                   | 10/60000 [00:06<10:25:12,  1.60it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.44it/s]
{'mrpc_en_eval_loss': 6.924696445465088, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.3137254901960786, 'mrpc_en_eval_runtime': 1.9287, 'mrpc_en_eval_samples_per_second': 105.771, 'epoch': 0.22}                                                                                                                                                       
{'mrpc_en_eval_loss': 6.924696445465088, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.3137254901960786, 'mrpc_en_eval_runtime': 1.9287, 'mrpc_en_eval_samples_per_second': 105.771, 'epoch': 0.22, 'eval_average_metrics': 0.0}                                                                                                                          
  0%|                                                                                                                                                   | 20/60000 [00:27<13:37:00,  1.22it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.49it/s]
{'mrpc_en_eval_loss': 5.22016716003418, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.764705882352941, 'mrpc_en_eval_runtime': 1.8761, 'mrpc_en_eval_samples_per_second': 108.737, 'epoch': 0.43}                                                                                                                                                         
{'mrpc_en_eval_loss': 5.22016716003418, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.764705882352941, 'mrpc_en_eval_runtime': 1.8761, 'mrpc_en_eval_samples_per_second': 108.737, 'epoch': 0.43, 'eval_average_metrics': 0.0}                                                                                                                            
  0%|                                                                                                                                                   | 30/60000 [00:47<12:58:53,  1.28it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:01<00:00,  2.37it/s]
{'mrpc_en_eval_loss': 1.3517154455184937, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 18.137254901960784, 'mrpc_en_eval_gen_len': 3.2205882352941178, 'mrpc_en_eval_runtime': 1.9678, 'mrpc_en_eval_samples_per_second': 103.67, 'epoch': 0.65}                                                                                                                                        
{'mrpc_en_eval_loss': 1.3517154455184937, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 18.137254901960784, 'mrpc_en_eval_gen_len': 3.2205882352941178, 'mrpc_en_eval_runtime': 1.9678, 'mrpc_en_eval_samples_per_second': 103.67, 'epoch': 0.65, 'eval_average_metrics': 9.068627450980392}                                                                                             
  0%|                                                                                                                                                   | 40/60000 [01:08<13:00:06,  1.28it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00,  4.62it/s]
{'mrpc_en_eval_loss': 0.4487058222293854, 'mrpc_en_eval_f1': 81.3953488372093, 'mrpc_en_eval_accuracy': 68.62745098039215, 'mrpc_en_eval_gen_len': 2.0, 'mrpc_en_eval_runtime': 1.0261, 'mrpc_en_eval_samples_per_second': 198.811, 'epoch': 0.87}                                                                                                                                          
{'mrpc_en_eval_loss': 0.4487058222293854, 'mrpc_en_eval_f1': 81.3953488372093, 'mrpc_en_eval_accuracy': 68.62745098039215, 'mrpc_en_eval_gen_len': 2.0, 'mrpc_en_eval_runtime': 1.0261, 'mrpc_en_eval_samples_per_second': 198.811, 'epoch': 0.87, 'eval_average_metrics': 75.01139990880073}                                                                                               
  0%|                                                                                                                                                   | 50/60000 [01:27<12:31:06,  1.33it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00,  3.72it/s]
{'mrpc_en_eval_loss': 0.25695744156837463, 'mrpc_en_eval_f1': 83.79204892966361, 'mrpc_en_eval_accuracy': 74.01960784313727, 'mrpc_en_eval_gen_len': 2.0833333333333335, 'mrpc_en_eval_runtime': 1.2653, 'mrpc_en_eval_samples_per_second': 161.228, 'epoch': 1.09}                                                                                                                         
{'mrpc_en_eval_loss': 0.25695744156837463, 'mrpc_en_eval_f1': 83.79204892966361, 'mrpc_en_eval_accuracy': 74.01960784313727, 'mrpc_en_eval_gen_len': 2.0833333333333335, 'mrpc_en_eval_runtime': 1.2653, 'mrpc_en_eval_samples_per_second': 161.228, 'epoch': 1.09, 'eval_average_metrics': 78.90582838640043}                                                                              
  0%|▏                                                                                                                                                  | 60/60000 [01:47<12:36:18,  1.32it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00,  4.29it/s]
{'mrpc_en_eval_loss': 0.27573078870773315, 'mrpc_en_eval_f1': 82.11143695014663, 'mrpc_en_eval_accuracy': 70.09803921568627, 'mrpc_en_eval_gen_len': 2.014705882352941, 'mrpc_en_eval_runtime': 1.1521, 'mrpc_en_eval_samples_per_second': 177.063, 'epoch': 1.3}                                                                                                                           
{'mrpc_en_eval_loss': 0.27573078870773315, 'mrpc_en_eval_f1': 82.11143695014663, 'mrpc_en_eval_accuracy': 70.09803921568627, 'mrpc_en_eval_gen_len': 2.014705882352941, 'mrpc_en_eval_runtime': 1.1521, 'mrpc_en_eval_samples_per_second': 177.063, 'epoch': 1.3, 'eval_average_metrics': 76.10473808291644}                                                                                
  0%|▏                                                                                                                                                  | 70/60000 [02:09<13:15:00,  1.26it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00,  3.75it/s]
{'mrpc_en_eval_loss': 0.16758881509304047, 'mrpc_en_eval_f1': 87.04318936877075, 'mrpc_en_eval_accuracy': 80.88235294117648, 'mrpc_en_eval_gen_len': 2.2107843137254903, 'mrpc_en_eval_runtime': 1.2665, 'mrpc_en_eval_samples_per_second': 161.075, 'epoch': 1.52}                                                                                                                         
{'mrpc_en_eval_loss': 0.16758881509304047, 'mrpc_en_eval_f1': 87.04318936877075, 'mrpc_en_eval_accuracy': 80.88235294117648, 'mrpc_en_eval_gen_len': 2.2107843137254903, 'mrpc_en_eval_runtime': 1.2665, 'mrpc_en_eval_samples_per_second': 161.075, 'epoch': 1.52, 'eval_average_metrics': 83.96277115497361}                                                                              
  0%|▏                                                                                                                                                  | 80/60000 [02:30<13:18:49,  1.25it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00,  3.64it/s]
{'mrpc_en_eval_loss': 0.1627584546804428, 'mrpc_en_eval_f1': 89.86486486486486, 'mrpc_en_eval_accuracy': 85.29411764705883, 'mrpc_en_eval_gen_len': 2.235294117647059, 'mrpc_en_eval_runtime': 1.2734, 'mrpc_en_eval_samples_per_second': 160.198, 'epoch': 1.74}                                                                                                                           
{'mrpc_en_eval_loss': 0.1627584546804428, 'mrpc_en_eval_f1': 89.86486486486486, 'mrpc_en_eval_accuracy': 85.29411764705883, 'mrpc_en_eval_gen_len': 2.235294117647059, 'mrpc_en_eval_runtime': 1.2734, 'mrpc_en_eval_samples_per_second': 160.198, 'epoch': 1.74, 'eval_average_metrics': 87.57949125596184}                                                                                
  0%|▏                                                                                                                                                  | 90/60000 [02:50<12:35:38,  1.32it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00,  3.71it/s]
{'mrpc_en_eval_loss': 0.178583025932312, 'mrpc_en_eval_f1': 90.78014184397163, 'mrpc_en_eval_accuracy': 87.25490196078431, 'mrpc_en_eval_gen_len': 2.303921568627451, 'mrpc_en_eval_runtime': 1.2507, 'mrpc_en_eval_samples_per_second': 163.108, 'epoch': 1.96}                                                                                                                            
{'mrpc_en_eval_loss': 0.178583025932312, 'mrpc_en_eval_f1': 90.78014184397163, 'mrpc_en_eval_accuracy': 87.25490196078431, 'mrpc_en_eval_gen_len': 2.303921568627451, 'mrpc_en_eval_runtime': 1.2507, 'mrpc_en_eval_samples_per_second': 163.108, 'epoch': 1.96, 'eval_average_metrics': 89.01752190237798}                                                                                 
  0%|▏                                                                                                                                                 | 100/60000 [03:09<12:29:36,  1.33it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00,  3.70it/s]
{'mrpc_en_eval_loss': 0.18296584486961365, 'mrpc_en_eval_f1': 88.72727272727272, 'mrpc_en_eval_accuracy': 84.80392156862744, 'mrpc_en_eval_gen_len': 2.338235294117647, 'mrpc_en_eval_runtime': 1.2762, 'mrpc_en_eval_samples_per_second': 159.845, 'epoch': 2.17}                                                                                                                          
{'mrpc_en_eval_loss': 0.18296584486961365, 'mrpc_en_eval_f1': 88.72727272727272, 'mrpc_en_eval_accuracy': 84.80392156862744, 'mrpc_en_eval_gen_len': 2.338235294117647, 'mrpc_en_eval_runtime': 1.2762, 'mrpc_en_eval_samples_per_second': 159.845, 'epoch': 2.17, 'eval_average_metrics': 86.76559714795007}   

Now lets see the results of t5-base after resuming from step = 60

  0%|▏                                                                                                                                                   | 60/60000 [00:06<9:21:55,  1.78it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00,  4.00it/s]
{'mrpc_en_eval_loss': 0.2794328033924103, 'mrpc_en_eval_f1': 82.11143695014663, 'mrpc_en_eval_accuracy': 70.09803921568627, 'mrpc_en_eval_gen_len': 2.014705882352941, 'mrpc_en_eval_runtime': 1.2224, 'mrpc_en_eval_samples_per_second': 166.887, 'epoch': 1.3}                                                                                                                            
{'mrpc_en_eval_loss': 0.2794328033924103, 'mrpc_en_eval_f1': 82.11143695014663, 'mrpc_en_eval_accuracy': 70.09803921568627, 'mrpc_en_eval_gen_len': 2.014705882352941, 'mrpc_en_eval_runtime': 1.2224, 'mrpc_en_eval_samples_per_second': 166.887, 'epoch': 1.3, 'eval_average_metrics': 76.10473808291644}                                                                                 
  0%|▏                                                                                                                                                  | 70/60000 [00:28<13:22:56,  1.24it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00,  3.59it/s]
{'mrpc_en_eval_loss': 0.16057834029197693, 'mrpc_en_eval_f1': 88.43537414965986, 'mrpc_en_eval_accuracy': 83.33333333333334, 'mrpc_en_eval_gen_len': 2.2450980392156863, 'mrpc_en_eval_runtime': 1.3058, 'mrpc_en_eval_samples_per_second': 156.222, 'epoch': 1.52}                                                                                                                         
{'mrpc_en_eval_loss': 0.16057834029197693, 'mrpc_en_eval_f1': 88.43537414965986, 'mrpc_en_eval_accuracy': 83.33333333333334, 'mrpc_en_eval_gen_len': 2.2450980392156863, 'mrpc_en_eval_runtime': 1.3058, 'mrpc_en_eval_samples_per_second': 156.222, 'epoch': 1.52, 'eval_average_metrics': 85.8843537414966}                                                                               
  0%|▏                                                                                                                                                  | 80/60000 [00:48<12:55:04,  1.29it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00,  3.69it/s]
{'mrpc_en_eval_loss': 0.15957750380039215, 'mrpc_en_eval_f1': 88.81118881118881, 'mrpc_en_eval_accuracy': 84.31372549019608, 'mrpc_en_eval_gen_len': 2.284313725490196, 'mrpc_en_eval_runtime': 1.291, 'mrpc_en_eval_samples_per_second': 158.021, 'epoch': 1.74}                                                                                                                           
{'mrpc_en_eval_loss': 0.15957750380039215, 'mrpc_en_eval_f1': 88.81118881118881, 'mrpc_en_eval_accuracy': 84.31372549019608, 'mrpc_en_eval_gen_len': 2.284313725490196, 'mrpc_en_eval_runtime': 1.291, 'mrpc_en_eval_samples_per_second': 158.021, 'epoch': 1.74, 'eval_average_metrics': 86.56245715069244}                                                                                
  0%|▏                                                                                                                                                  | 90/60000 [01:11<13:47:58,  1.21it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00,  3.67it/s]
{'mrpc_en_eval_loss': 0.19618992507457733, 'mrpc_en_eval_f1': 87.17948717948718, 'mrpc_en_eval_accuracy': 82.84313725490196, 'mrpc_en_eval_gen_len': 2.3480392156862746, 'mrpc_en_eval_runtime': 1.2811, 'mrpc_en_eval_samples_per_second': 159.235, 'epoch': 1.96}                                                                                                                         
{'mrpc_en_eval_loss': 0.19618992507457733, 'mrpc_en_eval_f1': 87.17948717948718, 'mrpc_en_eval_accuracy': 82.84313725490196, 'mrpc_en_eval_gen_len': 2.3480392156862746, 'mrpc_en_eval_runtime': 1.2811, 'mrpc_en_eval_samples_per_second': 159.235, 'epoch': 1.96, 'eval_average_metrics': 85.01131221719457}                                                                              
  0%|▏                                                                                                                                                 | 100/60000 [01:33<12:55:11,  1.29it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00,  3.75it/s]
{'mrpc_en_eval_loss': 0.21464459598064423, 'mrpc_en_eval_f1': 87.96992481203009, 'mrpc_en_eval_accuracy': 84.31372549019608, 'mrpc_en_eval_gen_len': 2.3823529411764706, 'mrpc_en_eval_runtime': 1.2654, 'mrpc_en_eval_samples_per_second': 161.214, 'epoch': 2.17}                                                                                                                         
{'mrpc_en_eval_loss': 0.21464459598064423, 'mrpc_en_eval_f1': 87.96992481203009, 'mrpc_en_eval_accuracy': 84.31372549019608, 'mrpc_en_eval_gen_len': 2.3823529411764706, 'mrpc_en_eval_runtime': 1.2654, 'mrpc_en_eval_samples_per_second': 161.214, 'epoch': 2.17, 'eval_average_metrics': 86.14182515111308}                                                                              
  0%|▏    
dorooddorood606 commented 3 years ago

Dear @sgugger @patrickvonplaten @patil-suraj Could you kindly have a look into this issue, this is really important to have the checkpointing workings, as in many cases one cannot train the models for larger periods, thnaks

stas00 commented 3 years ago

Following up on @sgugger's suggestion, if I understand the methodology correctly it doesn't quite apply to the generic checkpointing method, but one could subclass the Trainer to save the RNG state at the moment of saving the checkpoint, and then restore the same RNG state on resume. You'd probably need to do that for at least python and pytorch (and numpy and other libraries if you use those).

@dorooddorood606, look into:

# before saving
py_rng_state = random.getstate()
pt_rng_state = torch.get_rng_state()
np_rng_state = numpy.random.get_state()

# post resume
random.setstate(py_rng_state)
torch.set_rng_state(pt_rng_state)
numpy.random.set_state(np_rng_state)
dorooddorood606 commented 3 years ago

Dear @stas00
Thank you very much for following up on this, I implemented this suggestion, and I still see the discrepancies after resuming the checkpoints. I emphasize I tried with "vanilla t5-base" so no changes from huggingface codes. In my own codes, I have some initialization which is the only part with randomness, I would be grateful if you could tell me if there might be an issue with these lines:

nn.init.normal_(linear_layer.weight, std=std)
nn.init.zeros_(linear_layer.bias)

but still since vanillat t5-base also has this issue, I was wondering if you might think this might be relevant to the trainer code as a general issue? I greatly appreciate it if you could kindly consider this issue.

thanks a lot in advance for the great work you do and your hard efforts.

stas00 commented 3 years ago

Thank you very much for following up on this, I implemented this suggestion,

Could we first validate that this was done correctly?

To test you can debug print some random number generated immediately after saving the checkpoint and RNG state and doing the same right after the checkpoint and RNG states were restored when you run the program 2nd time with resume. If you get the same number generated then we know you restored the RNG state. You probably want to check one for torch and one for python.

I have some initialization which is the only part with randomness, I would be grateful if you could tell me if there might be an issue with these lines:


nn.init.normal_(linear_layer.weight, std=std)

This line would definitely impact the RNG state. If you're uncertain you can always debug and generate a random number with that line of code and w/o it and see if it's the same.

So for example one workaround you could do is to restore the RNG state after your custom code above.

Or better don't re-run this line, but save the outcome with the checkpoint and then restore it on subsequent runs, rather the needing to fiddle with RNG states.

dorooddorood606 commented 3 years ago

Dear @stas00 First, I would like to thank you very much for taking your precious time and answering to my question. I observe that between different runs my codes generate different results. I was assuming since HuggingFace run_glue.py codes set the seeds initially, then it is well taking care of randomness. All my code has is some initialization, like what I sent, coming all after the "set_seed()" function. Considering only one run, putting check-pointing aside, could you kindly tell me if one needs to set seeds before each initialization? shall I bring them all in init_weights function of BERT? I appreciate your response a lot. Thank you.

stas00 commented 3 years ago

First a few requests, @dorooddorood606

Thank you!


Now, let's try to summarize what doesn't work.

  1. From what I understand you extended the library with your own modifications. And now you're experiencing inconsistent randomness issues when you resume the model, correct?

    Does the library produce the expected results if you remove your modifications?

  2. Is there an easy way to provide a reproducible example that shows how the main library works correctly and then it breaks when with your modification? Perhaps a simple google colab notebook? If you do that please make sure that it's very easy to quickly see what the problem is and where it comes from. So no production-level hundreds of lines of code, but toy examples if possible.

dorooddorood606 commented 3 years ago

Dear @stas00 Thank you for the remind, I will follow the points you mentioned. I was thinking there is also a bug in the trainer as I was also observing it for the Bert-base model unchanged, but the randomness issue resolved with upgrading to 4.6.0 version of transformers.

dorooddorood606 commented 3 years ago

Dear @stas00

I appreciate your input on the issue of reproducibility from resuming from checkpoints a lot. I tried to follow your points to state it in a clearer way.

Problem statement

If a user train a model till some steps and then reload the model from a checkpoint, the results differs from the training the model without breaks.

How to reproduce the issue

Transformer version: I am using 4.6.0dev version of transformer

https://github.com/huggingface/transformers/commit/04ab2ca639ee6fd1002ce0a498d892245f0c9093

Please kindly clone this repository with a minimal example

git clone git@github.com:dorooddorood606/reproducibility.git

To run the codes, please kindly run this command, between the runs in every 50 steps after save of the model, kill the model for like 2-3 times. Please then compare the final results of running for the full iterations with resuming, with raining without any breaks. The results would differ.

TASK_NAME=mrpc
python run_glue.py   --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 2   --output_dir /temp/$TASK_NAME/  --eval_steps 50 --evaluation_strategy steps --load_best_model_at_end --fp16 --do_predict

Please let me know if you need any further information on this.

Which modifications done on Trainer class to make it reproducible:

I apply the following modifications to the trainer class: 1) Following your suggestions. I save the random states and I reload them before reloading the checkpoint in the trainer class. Please see https://github.com/dorooddorood606/reproducibility/blob/f5902af4669bba8aaee326efdb0cd459e25be675/trainer.py#L126 and https://github.com/dorooddorood606/reproducibility/blob/f5902af4669bba8aaee326efdb0cd459e25be675/trainer.py#L200

2) In each saving of checkpoints, I also save a copy of checkpoint in the output_dir, this is because I personally believe we need to also keep the last checkpoint to resume from in addition to keeping only checkpoint of the best model so far, to be able to continue training from the last state. Please see https://github.com/dorooddorood606/reproducibility/blob/f5902af4669bba8aaee326efdb0cd459e25be675/trainer.py#L87

3) I get the last checkpoint in run_glue.py based on the checkpoint saved in the main output_dir, please see https://github.com/dorooddorood606/reproducibility/blob/f5902af4669bba8aaee326efdb0cd459e25be675/run_glue.py#L46

Larger impact of this issue

To me this issue with resuming from checkpoint, can also help other users and would be beneficial to all users who need to use this option. I appreciate a lot if you could sparse me some time from your precious time and help on this issue.

stas00 commented 3 years ago

Thank you for your detailed followup, @dorooddorood606. And sharing what experiments you have tried.

I agree that it'd be awesome to be able to resume as if there was no stopping.

Please give us some time, we are going discuss whether it is feasible to make it happen as there are so many moving parts to consider and if so will build this ground up.

We will keep you posted.

dorooddorood606 commented 3 years ago

Dear @stas00 thank you. Sure, meanwhile if you might have some ideas and suggestions for me to try, I greatly appreciate your help. I searched for this issue a lot, and apart from the things HuggingFace repo has already implemented I could not find more tricks to do to solve the issue. Thanks a lot in advance for your time and assistance.

stas00 commented 3 years ago

@sgugger is working on it in https://github.com/huggingface/transformers/pull/11582

dorooddorood606 commented 3 years ago

Hi I cannot really express how much I appreciate this. Thank you very much both for working on this. This would be wonderful to have resuming fixed in trainer. Thanks for your efforts.

stas00 commented 3 years ago

I totally agree!

All kudos go to @sgugger , who has a much better understanding of the nooks and crannies of the HF Trainer.

dorooddorood606 commented 3 years ago

Dear @sgugger

Thanks for the hard work. I tested it but the issue is not resolved, specially for small datasets it can make large changes in final results, I appreciate if you could share with me some suggestions on how to resolve the issue:

The original one:

checkpoint: 200
{'eval_loss': 0.44332757592201233, 'eval_accuracy': 0.7941176470588235, 'eval_f1': 0.8521126760563381, 'eval_combined_score': 0.8231151615575808, 'eval_runtime': 1.5259, 'eval_samples_per_second': 133.692, 'eval_average_metrics': 0.8231151615575808, 'epoch': 1.74}

The resumed one:

checkpoint: 200
{'eval_loss': 0.4352119266986847, 'eval_accuracy': 0.7941176470588235, 'eval_f1': 0.85, 'eval_combined_score': 0.8220588235294117, 'eval_runtime': 1.4451, 'eval_samples_per_second': 141.165, 'eval_average_metrics': 0.8220588235294117, 'epoch': 1.74}                                                                                                                                   

The differences accumulate a lot over time

To reproduce please run:

TASK_NAME=mrpc 
python run_glue.py   --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 3   --output_dir /temp/results   --eval_steps 50 --evaluation_strategy steps --load_best_model_at_end --fp16 --do_test   --save_total_limit 1 

Here are the final results without drop:

[INFO|trainer_pt_utils.py:907] 2021-05-09 17:35:14,973 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,973 >>   epoch                     =                3.0
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,973 >>   eval_accuracy             =              0.701
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_average_metrics      = 0.7605196946035051
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_combined_score       =             0.7605
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_f1                   =             0.8201
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_loss                 =              0.604
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_mem_cpu_alloc_delta  =                2MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_mem_cpu_peaked_delta =                2MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_mem_gpu_alloc_delta  =                0MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_mem_gpu_peaked_delta =               33MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_runtime              =         0:00:01.95
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_samples              =                204
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_samples_per_second   =            104.502
05/09/2021 17:35:14 - INFO - __main__ -   *** Test ***
[INFO|trainer.py:515] 2021-05-09 17:35:15,036 >> The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, sentence1, sentence2.
[INFO|trainer.py:2089] 2021-05-09 17:35:15,040 >> ***** Running Evaluation *****
[INFO|trainer.py:2091] 2021-05-09 17:35:15,041 >>   Num examples = 204
[INFO|trainer.py:2094] 2021-05-09 17:35:15,041 >>   Batch size = 8
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 26/26 [00:01<00:00, 13.77it/s]
[INFO|trainer_pt_utils.py:907] 2021-05-09 17:35:17,070 >> ***** test metrics *****
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   epoch                     =                3.0
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_accuracy             =             0.6863
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_average_metrics      = 0.7490196078431373
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_combined_score       =              0.749
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_f1                   =             0.8118
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_loss                 =             0.6198
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_mem_cpu_alloc_delta  =                0MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_mem_cpu_peaked_delta =                2MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_mem_gpu_alloc_delta  =                0MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_mem_gpu_peaked_delta =               33MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_runtime              =         0:00:01.95
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_samples_per_second   =            104.281
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   test_samples              =                204

with breaking in between:

[INFO|trainer_pt_utils.py:907] 2021-05-09 17:41:22,953 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   epoch                     =                3.0
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_accuracy             =             0.6863
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_average_metrics      = 0.7467517127332861
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_combined_score       =             0.7468
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_f1                   =             0.8072
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_loss                 =             0.6106
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_mem_cpu_alloc_delta  =                2MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_mem_cpu_peaked_delta =                1MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_mem_gpu_alloc_delta  =                0MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_mem_gpu_peaked_delta =               33MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_runtime              =         0:00:01.82
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_samples              =                204
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,954 >>   eval_samples_per_second   =            111.603
05/09/2021 17:41:22 - INFO - __main__ -   *** Test ***
[INFO|trainer.py:515] 2021-05-09 17:41:23,014 >> The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, sentence2, sentence1.
[INFO|trainer.py:2089] 2021-05-09 17:41:23,018 >> ***** Running Evaluation *****
[INFO|trainer.py:2091] 2021-05-09 17:41:23,019 >>   Num examples = 204
[INFO|trainer.py:2094] 2021-05-09 17:41:23,019 >>   Batch size = 8
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 26/26 [00:01<00:00, 14.71it/s]
[INFO|trainer_pt_utils.py:907] 2021-05-09 17:41:24,916 >> ***** test metrics *****
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,916 >>   epoch                     =                3.0
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,916 >>   eval_accuracy             =              0.701
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,916 >>   eval_average_metrics      = 0.7572180248246088
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,916 >>   eval_combined_score       =             0.7572
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,916 >>   eval_f1                   =             0.8135
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,916 >>   eval_loss                 =             0.6068
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,917 >>   eval_mem_cpu_alloc_delta  =                0MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,917 >>   eval_mem_cpu_peaked_delta =                1MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,917 >>   eval_mem_gpu_alloc_delta  =                0MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,917 >>   eval_mem_gpu_peaked_delta =               33MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,917 >>   eval_runtime              =         0:00:01.83
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,917 >>   eval_samples_per_second   =            111.455
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,917 >>   test_samples              =                204

This is that different that still does not allow using checkpointing, I only have access to gpus which are interruptable and really appreciate your help

I also have added CUBLAS_WORKSPACE_CONFIG=:16:8 as described in https://discuss.pytorch.org/t/random-seed-with-external-gpu/102260/3 to make torch deterministic, still does not work,

sgugger commented 3 years ago

Are you sure you are running on a source install of Transformers? The command produces the exact same results on my end.

dorooddorood606 commented 3 years ago

Dear Sylvain, Thanks for the response. Yes, I install transformers as pip install git+https://github.com/huggingface/transformers.git

but the results differs a lot. Please kindly run this command and break it after first checkpoint (iterations = 50)

TASK_NAME=mrpc
python run_glue.py   --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 3   --output_dir /tmp/  --eval_steps 50 --evaluation_strategy steps --load_best_model_at_end --fp16 --do_test 
sgugger commented 3 years ago

This might be due to the FP16 parameter. Could you check if you get the same result without FP16? The reason is due to the fact we don't save the state of the gradient scaler in mixed precision training, which is another thing to restore to its state. Can make a PR to fix that tomorrow.

dorooddorood606 commented 3 years ago

Dear Sylvain

Thank you for taking your precious time and answering this issue. you are absolutely right. I checked it without fp16 and I confirm this works fine without fp16, it would be wonderful to have the fp16 mode also working when you have time.

Thank you for your hard work and great job you do :)

sgugger commented 3 years ago

Problem was fixed on my side with the PR above. Let me know if this is not the case for you.

dorooddorood606 commented 3 years ago

Dear @sgugger

Thank you for the PR, I checked it with the last version of transformers now, and the issue still exists, please kindly run this command and break this after first 50 steps:

TASK_NAME=mrpc
python run_glue.py   --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/  --eval_steps 50 --evaluation_strategy steps --load_best_model_at_end --fp16 --do_test  

Here are the results: If you do not break:

After 50 steps:

{'eval_loss': 0.6383711695671082, 'eval_accuracy': 0.6764705882352942, 'eval_f1': 0.8070175438596491, 'eval_combined_score': 0.7417440660474717, 'eval_runtime': 2.1914, 'eval_samples_per_second': 93.091, 'eval_average_metrics': 0.7417440660474717, 'epoch': 0.43}

After 100 steps:
{'eval_loss': 0.6184656023979187, 'eval_accuracy': 0.6862745098039216, 'eval_f1': 0.813953488372093, 'eval_combined_score': 0.7501139990880072, 'eval_runtime': 2.1089, 'eval_samples_per_second': 96.731, 'eval_average_metrics': 0.7501139990880072, 'epoch': 0.87}

if you break after 50 steps:

After 100 steps
{'eval_loss': 0.6308265328407288, 'eval_accuracy': 0.6862745098039216, 'eval_f1': 0.813953488372093, 'eval_combined_score': 0.7501139990880072, 'eval_runtime': 2.1549, 'eval_samples_per_second': 94.668, 'eval_average_metrics': 0.7501139990880072, 'epoch': 0.87}                                                    

The differences accumulates and the results at the end varies a lot that resumed results are not usable. I really appreciate if you could kindly have another look. Could you kindly reopen this issue as well?

thanks.

sgugger commented 3 years ago

I sadly cannot reproduce (get the exact same results with the command you indicated using a source install on current master) so this comes from something in your particular setup at this stage.