Environment info

transformers version: 4.5.5
Platform: linux
Python version: 3.7
PyTorch version (GPU?): 1.8
Tensorflow version (GPU?): -
Using GPU in script?: -
Using distributed or parallel set-up in script?: -

Who can help

@sgugger @patrickvonplaten, @patil-suraj

Information

I am training T5 model and I am resuming the training from a checkpoint
I have fixed the issue here https://github.com/huggingface/transformers/issues/11294 by freezing the parameters back right after this loads the model from the checkpoint
I am using "evaluation_strategy": "steps" and I evaluate the model every 10 steps with "save_total_limit": 1

I modified the save_checkpoint class as below to "save last copy of the model in output_dir" as one need to load a checkpoint from the place the model is left trained, and not from the checkpoint with best evaluation:

def _save_checkpoint(self, model, trial, metrics=None):
    super()._save_checkpoint(model, trial, metrics)
    # Saves the models checkpoints in the main folder.
    if self.is_world_process_zero():
        # remove the older global_steps.
        global_steps = [str(x) for x in Path(self.args.output_dir).glob("global_step*")]
        for global_step in global_steps:
            shutil.rmtree(global_step)
        self.save_model(self.args.output_dir)
        if self.deepspeed:
            self.deepspeed.save_checkpoint(self.args.output_dir)
        else:
            # deepspeed.save_checkpoint above saves model/optim/sched
            torch.save(self.optimizer.state_dict(), os.path.join(self.args.output_dir, "optimizer.pt"))
            with warnings.catch_warnings(record=True) as caught_warnings:
                torch.save(self.lr_scheduler.state_dict(), os.path.join(self.args.output_dir, "scheduler.pt"))
            reissue_pt_warnings(caught_warnings)
        self.state.save_to_json(os.path.join(self.args.output_dir, "trainer_state.json"))

then I find the last checkpoint to resume from it from the saved one in output directory as below:

def get_last_checkpoint(output_dir):
    if os.path.exists(os.path.join(output_dir, 'pytorch_model.bin')):
        return output_dir
    return None

Here is the results without resume for 10 times evaluation:

{'loss': 5.0483, 'learning_rate': 6e-07, 'epoch': 0.02}                                                                                                                                       
  0%|                                                                                                                                                   | 10/60000 [00:07<11:11:04,  1.49it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.54it/s]
{'mrpc_en_eval_loss': 5.382528305053711, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0, 'mrpc_en_eval_runtime': 1.8421, 'mrpc_en_eval_samples_per_second': 110.741, 'epoch': 0.22}                                                                                                                                                                      
{'mrpc_en_eval_loss': 5.382528305053711, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0, 'mrpc_en_eval_runtime': 1.8421, 'mrpc_en_eval_samples_per_second': 110.741, 'epoch': 0.22, 'eval_average_metrics': 0.0}                                                                                                                                         
  0%|                                                                                                                                                   | 20/60000 [00:20<11:57:29,  1.39it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.56it/s]
{'mrpc_en_eval_loss': 5.180729389190674, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0, 'mrpc_en_eval_runtime': 1.8179, 'mrpc_en_eval_samples_per_second': 112.218, 'epoch': 0.43}                                                                                                                                                                      
{'mrpc_en_eval_loss': 5.180729389190674, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0, 'mrpc_en_eval_runtime': 1.8179, 'mrpc_en_eval_samples_per_second': 112.218, 'epoch': 0.43, 'eval_average_metrics': 0.0}                                                                                                                                         
  0%|                                                                                                                                                   | 30/60000 [00:33<12:01:13,  1.39it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.52it/s]
{'mrpc_en_eval_loss': 4.810805320739746, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0, 'mrpc_en_eval_runtime': 1.8421, 'mrpc_en_eval_samples_per_second': 110.743, 'epoch': 0.65}                                                                                                                                                                      
{'mrpc_en_eval_loss': 4.810805320739746, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0, 'mrpc_en_eval_runtime': 1.8421, 'mrpc_en_eval_samples_per_second': 110.743, 'epoch': 0.65, 'eval_average_metrics': 0.0}                                                                                                                                         
  0%|                                                                                                                                                   | 40/60000 [00:45<11:17:50,  1.47it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.54it/s]
{'mrpc_en_eval_loss': 4.203256607055664, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0098039215686274, 'mrpc_en_eval_runtime': 2.031, 'mrpc_en_eval_samples_per_second': 100.441, 'epoch': 0.87}                                                                                                                                                        
{'mrpc_en_eval_loss': 4.203256607055664, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0098039215686274, 'mrpc_en_eval_runtime': 2.031, 'mrpc_en_eval_samples_per_second': 100.441, 'epoch': 0.87, 'eval_average_metrics': 0.0}                                                                                                                           
  0%|                                                                                                                                                   | 50/60000 [00:58<11:42:57,  1.42it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.39it/s]
{'mrpc_en_eval_loss': 3.262455463409424, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0098039215686274, 'mrpc_en_eval_runtime': 2.1069, 'mrpc_en_eval_samples_per_second': 96.825, 'epoch': 1.09}                                                                                                                                                        
{'mrpc_en_eval_loss': 3.262455463409424, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0098039215686274, 'mrpc_en_eval_runtime': 2.1069, 'mrpc_en_eval_samples_per_second': 96.825, 'epoch': 1.09, 'eval_average_metrics': 0.0}                                                                                                                           
  0%|▏                                                                                                                                                  | 60/60000 [01:13<11:57:15,  1.39it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.78it/s]
{'mrpc_en_eval_loss': 1.9655567407608032, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.49019607843137253, 'mrpc_en_eval_gen_len': 3.053921568627451, 'mrpc_en_eval_runtime': 2.8657, 'mrpc_en_eval_samples_per_second': 71.186, 'epoch': 1.3}                                                                                                                                         
{'mrpc_en_eval_loss': 1.9655567407608032, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.49019607843137253, 'mrpc_en_eval_gen_len': 3.053921568627451, 'mrpc_en_eval_runtime': 2.8657, 'mrpc_en_eval_samples_per_second': 71.186, 'epoch': 1.3, 'eval_average_metrics': 0.24509803921568626}                                                                                            
  0%|▏                                                                                                                                                  | 70/60000 [01:27<12:14:11,  1.36it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.08it/s]
{'mrpc_en_eval_loss': 0.7519775032997131, 'mrpc_en_eval_f1': 18.404907975460123, 'mrpc_en_eval_accuracy': 34.80392156862745, 'mrpc_en_eval_gen_len': 2.9411764705882355, 'mrpc_en_eval_runtime': 2.6193, 'mrpc_en_eval_samples_per_second': 77.884, 'epoch': 1.52}                                                                                                                          
{'mrpc_en_eval_loss': 0.7519775032997131, 'mrpc_en_eval_f1': 18.404907975460123, 'mrpc_en_eval_accuracy': 34.80392156862745, 'mrpc_en_eval_gen_len': 2.9411764705882355, 'mrpc_en_eval_runtime': 2.6193, 'mrpc_en_eval_samples_per_second': 77.884, 'epoch': 1.52, 'eval_average_metrics': 26.60441477204379}                                                                               
  0%|▏                                                                                                                                                  | 80/60000 [01:41<12:02:22,  1.38it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.60it/s]
{'mrpc_en_eval_loss': 0.4142318665981293, 'mrpc_en_eval_f1': 75.62500000000001, 'mrpc_en_eval_accuracy': 61.76470588235294, 'mrpc_en_eval_gen_len': 2.1176470588235294, 'mrpc_en_eval_runtime': 1.7878, 'mrpc_en_eval_samples_per_second': 114.109, 'epoch': 1.74}                                                                                                                          
{'mrpc_en_eval_loss': 0.4142318665981293, 'mrpc_en_eval_f1': 75.62500000000001, 'mrpc_en_eval_accuracy': 61.76470588235294, 'mrpc_en_eval_gen_len': 2.1176470588235294, 'mrpc_en_eval_runtime': 1.7878, 'mrpc_en_eval_samples_per_second': 114.109, 'epoch': 1.74, 'eval_average_metrics': 68.69485294117648}                                                                               
  0%|▏                                                                                                                                                  | 90/60000 [01:54<11:41:23,  1.42it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.54it/s]
{'mrpc_en_eval_loss': 0.3786551058292389, 'mrpc_en_eval_f1': 51.18483412322274, 'mrpc_en_eval_accuracy': 49.50980392156863, 'mrpc_en_eval_gen_len': 2.6519607843137254, 'mrpc_en_eval_runtime': 1.8265, 'mrpc_en_eval_samples_per_second': 111.69, 'epoch': 1.96}                                                                                                                           
{'mrpc_en_eval_loss': 0.3786551058292389, 'mrpc_en_eval_f1': 51.18483412322274, 'mrpc_en_eval_accuracy': 49.50980392156863, 'mrpc_en_eval_gen_len': 2.6519607843137254, 'mrpc_en_eval_runtime': 1.8265, 'mrpc_en_eval_samples_per_second': 111.69, 'epoch': 1.96, 'eval_average_metrics': 50.34731902239569}                                                                                
  0%|▏                                                                                                                                                 | 100/60000 [02:07<12:01:27,  1.38it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.58it/s]
{'mrpc_en_eval_loss': 0.29472649097442627, 'mrpc_en_eval_f1': 71.01449275362319, 'mrpc_en_eval_accuracy': 60.78431372549019, 'mrpc_en_eval_gen_len': 2.3333333333333335, 'mrpc_en_eval_runtime': 1.812, 'mrpc_en_eval_samples_per_second': 112.581, 'epoch': 2.17}                                                                                                                          
{'mrpc_en_eval_loss': 0.29472649097442627, 'mrpc_en_eval_f1': 71.01449275362319, 'mrpc_en_eval_accuracy': 60.78431372549019, 'mrpc_en_eval_gen_len': 2.3333333333333335, 'mrpc_en_eval_runtime': 1.812, 'mrpc_en_eval_samples_per_second': 112.581, 'epoch': 2.17, 'eval_average_metrics': 65.89940323955669}

Now lets resume from step = 40, while the first 40 steps would get the same results, after resuming the results differ a lot:

0%|                                                                                                                                                    | 40/60000 [00:07<9:49:41,  1.69it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.62it/s]
{'mrpc_en_eval_loss': 4.203643321990967, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0098039215686274, 'mrpc_en_eval_runtime': 2.0033, 'mrpc_en_eval_samples_per_second': 101.834, 'epoch': 0.87}                                                                                                                                                       
{'mrpc_en_eval_loss': 4.203643321990967, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0098039215686274, 'mrpc_en_eval_runtime': 2.0033, 'mrpc_en_eval_samples_per_second': 101.834, 'epoch': 0.87, 'eval_average_metrics': 0.0}                                                                                                                          
  0%|                                                                                                                                                   | 50/60000 [00:21<12:09:50,  1.37it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.30it/s]
{'mrpc_en_eval_loss': 3.2706634998321533, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0098039215686274, 'mrpc_en_eval_runtime': 2.2048, 'mrpc_en_eval_samples_per_second': 92.524, 'epoch': 1.09}                                                                                                                                                       
{'mrpc_en_eval_loss': 3.2706634998321533, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.0098039215686274, 'mrpc_en_eval_runtime': 2.2048, 'mrpc_en_eval_samples_per_second': 92.524, 'epoch': 1.09, 'eval_average_metrics': 0.0}                                                                                                                          
  0%|▏                                                                                                                                                  | 60/60000 [00:35<12:27:28,  1.34it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.54it/s]
{'mrpc_en_eval_loss': 1.9863247871398926, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.49019607843137253, 'mrpc_en_eval_gen_len': 3.019607843137255, 'mrpc_en_eval_runtime': 2.4126, 'mrpc_en_eval_samples_per_second': 84.557, 'epoch': 1.3}                                                                                                                                         
{'mrpc_en_eval_loss': 1.9863247871398926, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.49019607843137253, 'mrpc_en_eval_gen_len': 3.019607843137255, 'mrpc_en_eval_runtime': 2.4126, 'mrpc_en_eval_samples_per_second': 84.557, 'epoch': 1.3, 'eval_average_metrics': 0.24509803921568626}                                                                                            
  0%|▏                                                                                                                                                  | 70/60000 [00:49<12:02:36,  1.38it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.07it/s]
{'mrpc_en_eval_loss': 0.7721647620201111, 'mrpc_en_eval_f1': 18.404907975460123, 'mrpc_en_eval_accuracy': 34.80392156862745, 'mrpc_en_eval_gen_len': 2.946078431372549, 'mrpc_en_eval_runtime': 2.5655, 'mrpc_en_eval_samples_per_second': 79.518, 'epoch': 1.52}                                                                                                                           
{'mrpc_en_eval_loss': 0.7721647620201111, 'mrpc_en_eval_f1': 18.404907975460123, 'mrpc_en_eval_accuracy': 34.80392156862745, 'mrpc_en_eval_gen_len': 2.946078431372549, 'mrpc_en_eval_runtime': 2.5655, 'mrpc_en_eval_samples_per_second': 79.518, 'epoch': 1.52, 'eval_average_metrics': 26.60441477204379}                                                                                
  0%|▏                                                                                                                                                  | 80/60000 [01:02<12:08:06,  1.37it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.55it/s]
{'mrpc_en_eval_loss': 0.42692506313323975, 'mrpc_en_eval_f1': 74.28571428571428, 'mrpc_en_eval_accuracy': 60.29411764705882, 'mrpc_en_eval_gen_len': 2.142156862745098, 'mrpc_en_eval_runtime': 1.8243, 'mrpc_en_eval_samples_per_second': 111.824, 'epoch': 1.74}                                                                                                                          
{'mrpc_en_eval_loss': 0.42692506313323975, 'mrpc_en_eval_f1': 74.28571428571428, 'mrpc_en_eval_accuracy': 60.29411764705882, 'mrpc_en_eval_gen_len': 2.142156862745098, 'mrpc_en_eval_runtime': 1.8243, 'mrpc_en_eval_samples_per_second': 111.824, 'epoch': 1.74, 'eval_average_metrics': 67.28991596638654}                                                                               
  0%|▏                                                                                                                                                  | 90/60000 [01:16<12:00:53,  1.39it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.50it/s]
{'mrpc_en_eval_loss': 0.39015302062034607, 'mrpc_en_eval_f1': 45.685279187817265, 'mrpc_en_eval_accuracy': 47.549019607843135, 'mrpc_en_eval_gen_len': 2.7205882352941178, 'mrpc_en_eval_runtime': 1.856, 'mrpc_en_eval_samples_per_second': 109.915, 'epoch': 1.96}                                                                                                                        
{'mrpc_en_eval_loss': 0.39015302062034607, 'mrpc_en_eval_f1': 45.685279187817265, 'mrpc_en_eval_accuracy': 47.549019607843135, 'mrpc_en_eval_gen_len': 2.7205882352941178, 'mrpc_en_eval_runtime': 1.856, 'mrpc_en_eval_samples_per_second': 109.915, 'epoch': 1.96, 'eval_average_metrics': 46.617149397830204}                                                                            
  0%|▏                                                                                                                                                 | 100/60000 [01:31<12:02:17,  1.38it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.55it/s]
{'mrpc_en_eval_loss': 0.30966323614120483, 'mrpc_en_eval_f1': 68.48249027237354, 'mrpc_en_eval_accuracy': 60.29411764705882, 'mrpc_en_eval_gen_len': 2.426470588235294, 'mrpc_en_eval_runtime': 1.8275, 'mrpc_en_eval_samples_per_second': 111.625, 'epoch': 2.17}                                                                                                                          
{'mrpc_en_eval_loss': 0.30966323614120483, 'mrpc_en_eval_f1': 68.48249027237354, 'mrpc_en_eval_accuracy': 60.29411764705882, 'mrpc_en_eval_gen_len': 2.426470588235294, 'mrpc_en_eval_runtime': 1.8275, 'mrpc_en_eval_samples_per_second': 111.625, 'epoch': 2.17, 'eval_average_metrics': 64.38830395971618}

Expected behavior

Resuming from a checkpoint needs to get the same results as without

Thank you for your help @sgugger

You will only have perfectly reproducible results using checkpointing if the only randomness comes from the shuffling in your data (this is enforced by the CI). The way this is programmed inside the Trainer is to go through each epoch before the current one (which triggers the random shuffling) and then each batch (which puts you in the same position as before the checkpoint).

Since your results differ slightly, it looks like there are other random calls in your training code, which you did not share. There is no way to have the exact same results while resuming from a checkpoint if this is the case.

Hi @sgugger thanks for the reply, I do not have any other randomness in my codes, and I am using run_seq2seq.py codes to train t5 models on mrpc dataset, without modifications, I really appreciate your help on this issue as this is really crucial for me to have this working thanks a lot

I initialize only the weights randomly, but I assume huggnigface well taking care of setting seeds, and there is really no other randomness

@sgugger I confirm also training the vanilla t5 have the same issue exists: Here is the run for t5-base for 100 steps:

{'loss': 6.1045, 'learning_rate': 6e-07, 'epoch': 0.02}                                                                                                                                       
  0%|                                                                                                                                                   | 10/60000 [00:06<10:25:12,  1.60it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.44it/s]
{'mrpc_en_eval_loss': 6.924696445465088, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.3137254901960786, 'mrpc_en_eval_runtime': 1.9287, 'mrpc_en_eval_samples_per_second': 105.771, 'epoch': 0.22}                                                                                                                                                       
{'mrpc_en_eval_loss': 6.924696445465088, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.3137254901960786, 'mrpc_en_eval_runtime': 1.9287, 'mrpc_en_eval_samples_per_second': 105.771, 'epoch': 0.22, 'eval_average_metrics': 0.0}                                                                                                                          
  0%|                                                                                                                                                   | 20/60000 [00:27<13:37:00,  1.22it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.49it/s]
{'mrpc_en_eval_loss': 5.22016716003418, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.764705882352941, 'mrpc_en_eval_runtime': 1.8761, 'mrpc_en_eval_samples_per_second': 108.737, 'epoch': 0.43}                                                                                                                                                         
{'mrpc_en_eval_loss': 5.22016716003418, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 0.0, 'mrpc_en_eval_gen_len': 3.764705882352941, 'mrpc_en_eval_runtime': 1.8761, 'mrpc_en_eval_samples_per_second': 108.737, 'epoch': 0.43, 'eval_average_metrics': 0.0}                                                                                                                            
  0%|                                                                                                                                                   | 30/60000 [00:47<12:58:53,  1.28it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.37it/s]
{'mrpc_en_eval_loss': 1.3517154455184937, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 18.137254901960784, 'mrpc_en_eval_gen_len': 3.2205882352941178, 'mrpc_en_eval_runtime': 1.9678, 'mrpc_en_eval_samples_per_second': 103.67, 'epoch': 0.65}                                                                                                                                        
{'mrpc_en_eval_loss': 1.3517154455184937, 'mrpc_en_eval_f1': 0.0, 'mrpc_en_eval_accuracy': 18.137254901960784, 'mrpc_en_eval_gen_len': 3.2205882352941178, 'mrpc_en_eval_runtime': 1.9678, 'mrpc_en_eval_samples_per_second': 103.67, 'epoch': 0.65, 'eval_average_metrics': 9.068627450980392}                                                                                             
  0%|                                                                                                                                                   | 40/60000 [01:08<13:00:06,  1.28it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  4.62it/s]
{'mrpc_en_eval_loss': 0.4487058222293854, 'mrpc_en_eval_f1': 81.3953488372093, 'mrpc_en_eval_accuracy': 68.62745098039215, 'mrpc_en_eval_gen_len': 2.0, 'mrpc_en_eval_runtime': 1.0261, 'mrpc_en_eval_samples_per_second': 198.811, 'epoch': 0.87}                                                                                                                                          
{'mrpc_en_eval_loss': 0.4487058222293854, 'mrpc_en_eval_f1': 81.3953488372093, 'mrpc_en_eval_accuracy': 68.62745098039215, 'mrpc_en_eval_gen_len': 2.0, 'mrpc_en_eval_runtime': 1.0261, 'mrpc_en_eval_samples_per_second': 198.811, 'epoch': 0.87, 'eval_average_metrics': 75.01139990880073}                                                                                               
  0%|                                                                                                                                                   | 50/60000 [01:27<12:31:06,  1.33it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.72it/s]
{'mrpc_en_eval_loss': 0.25695744156837463, 'mrpc_en_eval_f1': 83.79204892966361, 'mrpc_en_eval_accuracy': 74.01960784313727, 'mrpc_en_eval_gen_len': 2.0833333333333335, 'mrpc_en_eval_runtime': 1.2653, 'mrpc_en_eval_samples_per_second': 161.228, 'epoch': 1.09}                                                                                                                         
{'mrpc_en_eval_loss': 0.25695744156837463, 'mrpc_en_eval_f1': 83.79204892966361, 'mrpc_en_eval_accuracy': 74.01960784313727, 'mrpc_en_eval_gen_len': 2.0833333333333335, 'mrpc_en_eval_runtime': 1.2653, 'mrpc_en_eval_samples_per_second': 161.228, 'epoch': 1.09, 'eval_average_metrics': 78.90582838640043}                                                                              
  0%|▏                                                                                                                                                  | 60/60000 [01:47<12:36:18,  1.32it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  4.29it/s]
{'mrpc_en_eval_loss': 0.27573078870773315, 'mrpc_en_eval_f1': 82.11143695014663, 'mrpc_en_eval_accuracy': 70.09803921568627, 'mrpc_en_eval_gen_len': 2.014705882352941, 'mrpc_en_eval_runtime': 1.1521, 'mrpc_en_eval_samples_per_second': 177.063, 'epoch': 1.3}                                                                                                                           
{'mrpc_en_eval_loss': 0.27573078870773315, 'mrpc_en_eval_f1': 82.11143695014663, 'mrpc_en_eval_accuracy': 70.09803921568627, 'mrpc_en_eval_gen_len': 2.014705882352941, 'mrpc_en_eval_runtime': 1.1521, 'mrpc_en_eval_samples_per_second': 177.063, 'epoch': 1.3, 'eval_average_metrics': 76.10473808291644}                                                                                
  0%|▏                                                                                                                                                  | 70/60000 [02:09<13:15:00,  1.26it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.75it/s]
{'mrpc_en_eval_loss': 0.16758881509304047, 'mrpc_en_eval_f1': 87.04318936877075, 'mrpc_en_eval_accuracy': 80.88235294117648, 'mrpc_en_eval_gen_len': 2.2107843137254903, 'mrpc_en_eval_runtime': 1.2665, 'mrpc_en_eval_samples_per_second': 161.075, 'epoch': 1.52}                                                                                                                         
{'mrpc_en_eval_loss': 0.16758881509304047, 'mrpc_en_eval_f1': 87.04318936877075, 'mrpc_en_eval_accuracy': 80.88235294117648, 'mrpc_en_eval_gen_len': 2.2107843137254903, 'mrpc_en_eval_runtime': 1.2665, 'mrpc_en_eval_samples_per_second': 161.075, 'epoch': 1.52, 'eval_average_metrics': 83.96277115497361}                                                                              
  0%|▏                                                                                                                                                  | 80/60000 [02:30<13:18:49,  1.25it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.64it/s]
{'mrpc_en_eval_loss': 0.1627584546804428, 'mrpc_en_eval_f1': 89.86486486486486, 'mrpc_en_eval_accuracy': 85.29411764705883, 'mrpc_en_eval_gen_len': 2.235294117647059, 'mrpc_en_eval_runtime': 1.2734, 'mrpc_en_eval_samples_per_second': 160.198, 'epoch': 1.74}                                                                                                                           
{'mrpc_en_eval_loss': 0.1627584546804428, 'mrpc_en_eval_f1': 89.86486486486486, 'mrpc_en_eval_accuracy': 85.29411764705883, 'mrpc_en_eval_gen_len': 2.235294117647059, 'mrpc_en_eval_runtime': 1.2734, 'mrpc_en_eval_samples_per_second': 160.198, 'epoch': 1.74, 'eval_average_metrics': 87.57949125596184}                                                                                
  0%|▏                                                                                                                                                  | 90/60000 [02:50<12:35:38,  1.32it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.71it/s]
{'mrpc_en_eval_loss': 0.178583025932312, 'mrpc_en_eval_f1': 90.78014184397163, 'mrpc_en_eval_accuracy': 87.25490196078431, 'mrpc_en_eval_gen_len': 2.303921568627451, 'mrpc_en_eval_runtime': 1.2507, 'mrpc_en_eval_samples_per_second': 163.108, 'epoch': 1.96}                                                                                                                            
{'mrpc_en_eval_loss': 0.178583025932312, 'mrpc_en_eval_f1': 90.78014184397163, 'mrpc_en_eval_accuracy': 87.25490196078431, 'mrpc_en_eval_gen_len': 2.303921568627451, 'mrpc_en_eval_runtime': 1.2507, 'mrpc_en_eval_samples_per_second': 163.108, 'epoch': 1.96, 'eval_average_metrics': 89.01752190237798}                                                                                 
  0%|▏                                                                                                                                                 | 100/60000 [03:09<12:29:36,  1.33it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.70it/s]
{'mrpc_en_eval_loss': 0.18296584486961365, 'mrpc_en_eval_f1': 88.72727272727272, 'mrpc_en_eval_accuracy': 84.80392156862744, 'mrpc_en_eval_gen_len': 2.338235294117647, 'mrpc_en_eval_runtime': 1.2762, 'mrpc_en_eval_samples_per_second': 159.845, 'epoch': 2.17}                                                                                                                          
{'mrpc_en_eval_loss': 0.18296584486961365, 'mrpc_en_eval_f1': 88.72727272727272, 'mrpc_en_eval_accuracy': 84.80392156862744, 'mrpc_en_eval_gen_len': 2.338235294117647, 'mrpc_en_eval_runtime': 1.2762, 'mrpc_en_eval_samples_per_second': 159.845, 'epoch': 2.17, 'eval_average_metrics': 86.76559714795007}

Now lets see the results of t5-base after resuming from step = 60

  0%|▏                                                                                                                                                   | 60/60000 [00:06<9:21:55,  1.78it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  4.00it/s]
{'mrpc_en_eval_loss': 0.2794328033924103, 'mrpc_en_eval_f1': 82.11143695014663, 'mrpc_en_eval_accuracy': 70.09803921568627, 'mrpc_en_eval_gen_len': 2.014705882352941, 'mrpc_en_eval_runtime': 1.2224, 'mrpc_en_eval_samples_per_second': 166.887, 'epoch': 1.3}                                                                                                                            
{'mrpc_en_eval_loss': 0.2794328033924103, 'mrpc_en_eval_f1': 82.11143695014663, 'mrpc_en_eval_accuracy': 70.09803921568627, 'mrpc_en_eval_gen_len': 2.014705882352941, 'mrpc_en_eval_runtime': 1.2224, 'mrpc_en_eval_samples_per_second': 166.887, 'epoch': 1.3, 'eval_average_metrics': 76.10473808291644}                                                                                 
  0%|▏                                                                                                                                                  | 70/60000 [00:28<13:22:56,  1.24it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.59it/s]
{'mrpc_en_eval_loss': 0.16057834029197693, 'mrpc_en_eval_f1': 88.43537414965986, 'mrpc_en_eval_accuracy': 83.33333333333334, 'mrpc_en_eval_gen_len': 2.2450980392156863, 'mrpc_en_eval_runtime': 1.3058, 'mrpc_en_eval_samples_per_second': 156.222, 'epoch': 1.52}                                                                                                                         
{'mrpc_en_eval_loss': 0.16057834029197693, 'mrpc_en_eval_f1': 88.43537414965986, 'mrpc_en_eval_accuracy': 83.33333333333334, 'mrpc_en_eval_gen_len': 2.2450980392156863, 'mrpc_en_eval_runtime': 1.3058, 'mrpc_en_eval_samples_per_second': 156.222, 'epoch': 1.52, 'eval_average_metrics': 85.8843537414966}                                                                               
  0%|▏                                                                                                                                                  | 80/60000 [00:48<12:55:04,  1.29it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.69it/s]
{'mrpc_en_eval_loss': 0.15957750380039215, 'mrpc_en_eval_f1': 88.81118881118881, 'mrpc_en_eval_accuracy': 84.31372549019608, 'mrpc_en_eval_gen_len': 2.284313725490196, 'mrpc_en_eval_runtime': 1.291, 'mrpc_en_eval_samples_per_second': 158.021, 'epoch': 1.74}                                                                                                                           
{'mrpc_en_eval_loss': 0.15957750380039215, 'mrpc_en_eval_f1': 88.81118881118881, 'mrpc_en_eval_accuracy': 84.31372549019608, 'mrpc_en_eval_gen_len': 2.284313725490196, 'mrpc_en_eval_runtime': 1.291, 'mrpc_en_eval_samples_per_second': 158.021, 'epoch': 1.74, 'eval_average_metrics': 86.56245715069244}                                                                                
  0%|▏                                                                                                                                                  | 90/60000 [01:11<13:47:58,  1.21it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.67it/s]
{'mrpc_en_eval_loss': 0.19618992507457733, 'mrpc_en_eval_f1': 87.17948717948718, 'mrpc_en_eval_accuracy': 82.84313725490196, 'mrpc_en_eval_gen_len': 2.3480392156862746, 'mrpc_en_eval_runtime': 1.2811, 'mrpc_en_eval_samples_per_second': 159.235, 'epoch': 1.96}                                                                                                                         
{'mrpc_en_eval_loss': 0.19618992507457733, 'mrpc_en_eval_f1': 87.17948717948718, 'mrpc_en_eval_accuracy': 82.84313725490196, 'mrpc_en_eval_gen_len': 2.3480392156862746, 'mrpc_en_eval_runtime': 1.2811, 'mrpc_en_eval_samples_per_second': 159.235, 'epoch': 1.96, 'eval_average_metrics': 85.01131221719457}                                                                              
  0%|▏                                                                                                                                                 | 100/60000 [01:33<12:55:11,  1.29it/s]***** Running Evaluation *****
  Num examples = 204
  Batch size = 80
                                                                                                                                                                                             ### n_samples  204███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.75it/s]
{'mrpc_en_eval_loss': 0.21464459598064423, 'mrpc_en_eval_f1': 87.96992481203009, 'mrpc_en_eval_accuracy': 84.31372549019608, 'mrpc_en_eval_gen_len': 2.3823529411764706, 'mrpc_en_eval_runtime': 1.2654, 'mrpc_en_eval_samples_per_second': 161.214, 'epoch': 2.17}                                                                                                                         
{'mrpc_en_eval_loss': 0.21464459598064423, 'mrpc_en_eval_f1': 87.96992481203009, 'mrpc_en_eval_accuracy': 84.31372549019608, 'mrpc_en_eval_gen_len': 2.3823529411764706, 'mrpc_en_eval_runtime': 1.2654, 'mrpc_en_eval_samples_per_second': 161.214, 'epoch': 2.17, 'eval_average_metrics': 86.14182515111308}                                                                              
  0%|▏

Dear @sgugger @patrickvonplaten @patil-suraj Could you kindly have a look into this issue, this is really important to have the checkpointing workings, as in many cases one cannot train the models for larger periods, thnaks

Following up on @sgugger's suggestion, if I understand the methodology correctly it doesn't quite apply to the generic checkpointing method, but one could subclass the Trainer to save the RNG state at the moment of saving the checkpoint, and then restore the same RNG state on resume. You'd probably need to do that for at least python and pytorch (and numpy and other libraries if you use those).

@dorooddorood606, look into:

# before saving
py_rng_state = random.getstate()
pt_rng_state = torch.get_rng_state()
np_rng_state = numpy.random.get_state()

# post resume
random.setstate(py_rng_state)
torch.set_rng_state(pt_rng_state)
numpy.random.set_state(np_rng_state)

Dear @stas00
Thank you very much for following up on this, I implemented this suggestion, and I still see the discrepancies after resuming the checkpoints. I emphasize I tried with "vanilla t5-base" so no changes from huggingface codes. In my own codes, I have some initialization which is the only part with randomness, I would be grateful if you could tell me if there might be an issue with these lines:

nn.init.normal_(linear_layer.weight, std=std)
nn.init.zeros_(linear_layer.bias)

but still since vanillat t5-base also has this issue, I was wondering if you might think this might be relevant to the trainer code as a general issue? I greatly appreciate it if you could kindly consider this issue.

thanks a lot in advance for the great work you do and your hard efforts.

Thank you very much for following up on this, I implemented this suggestion,

Could we first validate that this was done correctly?

To test you can debug print some random number generated immediately after saving the checkpoint and RNG state and doing the same right after the checkpoint and RNG states were restored when you run the program 2nd time with resume. If you get the same number generated then we know you restored the RNG state. You probably want to check one for torch and one for python.

I have some initialization which is the only part with randomness, I would be grateful if you could tell me if there might be an issue with these lines:
nn.init.normal_(linear_layer.weight, std=std)

This line would definitely impact the RNG state. If you're uncertain you can always debug and generate a random number with that line of code and w/o it and see if it's the same.

So for example one workaround you could do is to restore the RNG state after your custom code above.

Or better don't re-run this line, but save the outcome with the checkpoint and then restore it on subsequent runs, rather the needing to fiddle with RNG states.

Dear @stas00 First, I would like to thank you very much for taking your precious time and answering to my question. I observe that between different runs my codes generate different results. I was assuming since HuggingFace run_glue.py codes set the seeds initially, then it is well taking care of randomness. All my code has is some initialization, like what I sent, coming all after the "set_seed()" function. Considering only one run, putting check-pointing aside, could you kindly tell me if one needs to set seeds before each initialization? shall I bring them all in init_weights function of BERT? I appreciate your response a lot. Thank you.

First a few requests, @dorooddorood606

please don't re-post the same question on Issues and forums, once is plenty - honestly I'm lost at what we are trying to solve here.
we all appreciate your appreciations, you're clearly a very nice person, but it becomes overbearing when we get copious amounts of it in every post.
let's focus on the problem only so that the sound-to-noise ratio is manageable.

Thank you!

Now, let's try to summarize what doesn't work.

From what I understand you extended the library with your own modifications. And now you're experiencing inconsistent randomness issues when you resume the model, correct?

Does the library produce the expected results if you remove your modifications?
Is there an easy way to provide a reproducible example that shows how the main library works correctly and then it breaks when with your modification? Perhaps a simple google colab notebook? If you do that please make sure that it's very easy to quickly see what the problem is and where it comes from. So no production-level hundreds of lines of code, but toy examples if possible.

Dear @stas00 Thank you for the remind, I will follow the points you mentioned. I was thinking there is also a bug in the trainer as I was also observing it for the Bert-base model unchanged, but the randomness issue resolved with upgrading to 4.6.0 version of transformers.

Dear @stas00

I appreciate your input on the issue of reproducibility from resuming from checkpoints a lot. I tried to follow your points to state it in a clearer way.

Problem statement

If a user train a model till some steps and then reload the model from a checkpoint, the results differs from the training the model without breaks.

How to reproduce the issue

Transformer version: I am using 4.6.0dev version of transformer

https://github.com/huggingface/transformers/commit/04ab2ca639ee6fd1002ce0a498d892245f0c9093

Please kindly clone this repository with a minimal example

git clone git@github.com:dorooddorood606/reproducibility.git

To run the codes, please kindly run this command, between the runs in every 50 steps after save of the model, kill the model for like 2-3 times. Please then compare the final results of running for the full iterations with resuming, with raining without any breaks. The results would differ.

TASK_NAME=mrpc
python run_glue.py   --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 2   --output_dir /temp/$TASK_NAME/  --eval_steps 50 --evaluation_strategy steps --load_best_model_at_end --fp16 --do_predict

Please let me know if you need any further information on this.

Which modifications done on Trainer class to make it reproducible:

I apply the following modifications to the trainer class: 1) Following your suggestions. I save the random states and I reload them before reloading the checkpoint in the trainer class. Please see https://github.com/dorooddorood606/reproducibility/blob/f5902af4669bba8aaee326efdb0cd459e25be675/trainer.py#L126 and https://github.com/dorooddorood606/reproducibility/blob/f5902af4669bba8aaee326efdb0cd459e25be675/trainer.py#L200

2) In each saving of checkpoints, I also save a copy of checkpoint in the output_dir, this is because I personally believe we need to also keep the last checkpoint to resume from in addition to keeping only checkpoint of the best model so far, to be able to continue training from the last state. Please see https://github.com/dorooddorood606/reproducibility/blob/f5902af4669bba8aaee326efdb0cd459e25be675/trainer.py#L87

3) I get the last checkpoint in run_glue.py based on the checkpoint saved in the main output_dir, please see https://github.com/dorooddorood606/reproducibility/blob/f5902af4669bba8aaee326efdb0cd459e25be675/run_glue.py#L46

Larger impact of this issue

To me this issue with resuming from checkpoint, can also help other users and would be beneficial to all users who need to use this option. I appreciate a lot if you could sparse me some time from your precious time and help on this issue.

Thank you for your detailed followup, @dorooddorood606. And sharing what experiments you have tried.

I agree that it'd be awesome to be able to resume as if there was no stopping.

Please give us some time, we are going discuss whether it is feasible to make it happen as there are so many moving parts to consider and if so will build this ground up.

We will keep you posted.

Dear @stas00 thank you. Sure, meanwhile if you might have some ideas and suggestions for me to try, I greatly appreciate your help. I searched for this issue a lot, and apart from the things HuggingFace repo has already implemented I could not find more tricks to do to solve the issue. Thanks a lot in advance for your time and assistance.

Hi I cannot really express how much I appreciate this. Thank you very much both for working on this. This would be wonderful to have resuming fixed in trainer. Thanks for your efforts.

I totally agree!

All kudos go to @sgugger , who has a much better understanding of the nooks and crannies of the HF Trainer.

Dear @sgugger

Thanks for the hard work. I tested it but the issue is not resolved, specially for small datasets it can make large changes in final results, I appreciate if you could share with me some suggestions on how to resolve the issue:

The original one:

checkpoint: 200
{'eval_loss': 0.44332757592201233, 'eval_accuracy': 0.7941176470588235, 'eval_f1': 0.8521126760563381, 'eval_combined_score': 0.8231151615575808, 'eval_runtime': 1.5259, 'eval_samples_per_second': 133.692, 'eval_average_metrics': 0.8231151615575808, 'epoch': 1.74}

The resumed one:

checkpoint: 200
{'eval_loss': 0.4352119266986847, 'eval_accuracy': 0.7941176470588235, 'eval_f1': 0.85, 'eval_combined_score': 0.8220588235294117, 'eval_runtime': 1.4451, 'eval_samples_per_second': 141.165, 'eval_average_metrics': 0.8220588235294117, 'epoch': 1.74}

The differences accumulate a lot over time

To reproduce please run:

TASK_NAME=mrpc 
python run_glue.py   --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 3   --output_dir /temp/results   --eval_steps 50 --evaluation_strategy steps --load_best_model_at_end --fp16 --do_test   --save_total_limit 1

Here are the final results without drop:

[INFO|trainer_pt_utils.py:907] 2021-05-09 17:35:14,973 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,973 >>   epoch                     =                3.0
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,973 >>   eval_accuracy             =              0.701
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_average_metrics      = 0.7605196946035051
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_combined_score       =             0.7605
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_f1                   =             0.8201
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_loss                 =              0.604
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_mem_cpu_alloc_delta  =                2MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_mem_cpu_peaked_delta =                2MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_mem_gpu_alloc_delta  =                0MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_mem_gpu_peaked_delta =               33MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_runtime              =         0:00:01.95
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_samples              =                204
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:14,974 >>   eval_samples_per_second   =            104.502
05/09/2021 17:35:14 - INFO - __main__ -   *** Test ***
[INFO|trainer.py:515] 2021-05-09 17:35:15,036 >> The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, sentence1, sentence2.
[INFO|trainer.py:2089] 2021-05-09 17:35:15,040 >> ***** Running Evaluation *****
[INFO|trainer.py:2091] 2021-05-09 17:35:15,041 >>   Num examples = 204
[INFO|trainer.py:2094] 2021-05-09 17:35:15,041 >>   Batch size = 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26/26 [00:01<00:00, 13.77it/s]
[INFO|trainer_pt_utils.py:907] 2021-05-09 17:35:17,070 >> ***** test metrics *****
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   epoch                     =                3.0
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_accuracy             =             0.6863
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_average_metrics      = 0.7490196078431373
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_combined_score       =              0.749
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_f1                   =             0.8118
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_loss                 =             0.6198
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_mem_cpu_alloc_delta  =                0MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_mem_cpu_peaked_delta =                2MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_mem_gpu_alloc_delta  =                0MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_mem_gpu_peaked_delta =               33MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_runtime              =         0:00:01.95
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   eval_samples_per_second   =            104.281
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:35:17,070 >>   test_samples              =                204

with breaking in between:

[INFO|trainer_pt_utils.py:907] 2021-05-09 17:41:22,953 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   epoch                     =                3.0
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_accuracy             =             0.6863
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_average_metrics      = 0.7467517127332861
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_combined_score       =             0.7468
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_f1                   =             0.8072
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_loss                 =             0.6106
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_mem_cpu_alloc_delta  =                2MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_mem_cpu_peaked_delta =                1MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_mem_gpu_alloc_delta  =                0MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_mem_gpu_peaked_delta =               33MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_runtime              =         0:00:01.82
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,953 >>   eval_samples              =                204
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:22,954 >>   eval_samples_per_second   =            111.603
05/09/2021 17:41:22 - INFO - __main__ -   *** Test ***
[INFO|trainer.py:515] 2021-05-09 17:41:23,014 >> The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: idx, sentence2, sentence1.
[INFO|trainer.py:2089] 2021-05-09 17:41:23,018 >> ***** Running Evaluation *****
[INFO|trainer.py:2091] 2021-05-09 17:41:23,019 >>   Num examples = 204
[INFO|trainer.py:2094] 2021-05-09 17:41:23,019 >>   Batch size = 8
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26/26 [00:01<00:00, 14.71it/s]
[INFO|trainer_pt_utils.py:907] 2021-05-09 17:41:24,916 >> ***** test metrics *****
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,916 >>   epoch                     =                3.0
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,916 >>   eval_accuracy             =              0.701
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,916 >>   eval_average_metrics      = 0.7572180248246088
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,916 >>   eval_combined_score       =             0.7572
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,916 >>   eval_f1                   =             0.8135
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,916 >>   eval_loss                 =             0.6068
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,917 >>   eval_mem_cpu_alloc_delta  =                0MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,917 >>   eval_mem_cpu_peaked_delta =                1MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,917 >>   eval_mem_gpu_alloc_delta  =                0MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,917 >>   eval_mem_gpu_peaked_delta =               33MB
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,917 >>   eval_runtime              =         0:00:01.83
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,917 >>   eval_samples_per_second   =            111.455
[INFO|trainer_pt_utils.py:912] 2021-05-09 17:41:24,917 >>   test_samples              =                204

This is that different that still does not allow using checkpointing, I only have access to gpus which are interruptable and really appreciate your help

I also have added CUBLAS_WORKSPACE_CONFIG=:16:8 as described in https://discuss.pytorch.org/t/random-seed-with-external-gpu/102260/3 to make torch deterministic, still does not work,

Are you sure you are running on a source install of Transformers? The command produces the exact same results on my end.

Dear Sylvain, Thanks for the response. Yes, I install transformers as pip install git+https://github.com/huggingface/transformers.git

but the results differs a lot. Please kindly run this command and break it after first checkpoint (iterations = 50)

TASK_NAME=mrpc
python run_glue.py   --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 3   --output_dir /tmp/  --eval_steps 50 --evaluation_strategy steps --load_best_model_at_end --fp16 --do_test

This might be due to the FP16 parameter. Could you check if you get the same result without FP16? The reason is due to the fact we don't save the state of the gradient scaler in mixed precision training, which is another thing to restore to its state. Can make a PR to fix that tomorrow.

Dear Sylvain

Thank you for taking your precious time and answering this issue. you are absolutely right. I checked it without fp16 and I confirm this works fine without fp16, it would be wonderful to have the fp16 mode also working when you have time.

Thank you for your hard work and great job you do :)

Problem was fixed on my side with the PR above. Let me know if this is not the case for you.

Dear @sgugger

Thank you for the PR, I checked it with the last version of transformers now, and the issue still exists, please kindly run this command and break this after first 50 steps:

TASK_NAME=mrpc
python run_glue.py   --model_name_or_path bert-base-cased   --task_name $TASK_NAME   --do_train   --do_eval   --max_seq_length 128   --per_device_train_batch_size 32   --learning_rate 2e-5   --num_train_epochs 3   --output_dir /tmp/$TASK_NAME/  --eval_steps 50 --evaluation_strategy steps --load_best_model_at_end --fp16 --do_test

Here are the results: If you do not break:

After 50 steps:

{'eval_loss': 0.6383711695671082, 'eval_accuracy': 0.6764705882352942, 'eval_f1': 0.8070175438596491, 'eval_combined_score': 0.7417440660474717, 'eval_runtime': 2.1914, 'eval_samples_per_second': 93.091, 'eval_average_metrics': 0.7417440660474717, 'epoch': 0.43}

After 100 steps:
{'eval_loss': 0.6184656023979187, 'eval_accuracy': 0.6862745098039216, 'eval_f1': 0.813953488372093, 'eval_combined_score': 0.7501139990880072, 'eval_runtime': 2.1089, 'eval_samples_per_second': 96.731, 'eval_average_metrics': 0.7501139990880072, 'epoch': 0.87}

if you break after 50 steps:

After 100 steps
{'eval_loss': 0.6308265328407288, 'eval_accuracy': 0.6862745098039216, 'eval_f1': 0.813953488372093, 'eval_combined_score': 0.7501139990880072, 'eval_runtime': 2.1549, 'eval_samples_per_second': 94.668, 'eval_average_metrics': 0.7501139990880072, 'epoch': 0.87}

The differences accumulates and the results at the end varies a lot that resumed results are not usable. I really appreciate if you could kindly have another look. Could you kindly reopen this issue as well?

thanks.

I sadly cannot reproduce (get the exact same results with the command you indicated using a source install on current master) so this comes from something in your particular setup at this stage.

huggingface / transformers

Bug in trainer: substantially different results from restarting from a checkpoint and without #11323