Low accuracy after load custom pretrained model in a text binary classification problem

Environment info

transformers version: 3.4.0
Platform: Linux-4.15.0-122-generic-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
PyTorch version (GPU?): 1.5.1+cpu (False)
Tensorflow version (GPU?): 2.3.0 (False)
Using GPU in script?: No
Using distributed or parallel set-up in script?: Distributed (not really sure)

Who can help

@LysandreJik

Information

Posted in StackOverflow. Received a comment with two similar issues regarding save and load custom models. The original question can be found at: https://stackoverflow.com/questions/64666510/huggingface-transformers-low-accuracy-after-load-custom-pretrained-model-in-a-t?noredirect=1#comment114344159_64666510

In a nutshell I am using BertForSequenceClassification (PyTorch) with dccuchile/bert-base-spanish-wwm-cased for solving a binary classification problem. I have trained the network and evaluate the model with a testing dataset (different from the training dataset). I have achieved an acc and val_acc between 0.85 and 0.9. However, after I save the model and retrieve it again in another script, the accuracy is similar to a random classifier (0.41).

The problem arises when using:

[ ] the official example scripts: (give details below)
[X] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[X] my own task or dataset: (give details below)

To reproduce

This is the code I am using for training and evaluating (during training):

criterion = torch.nn.CrossEntropyLoss ()
criterion = criterion.to (device)
optimizer = AdamW (model.parameters(), lr=5e-5)

for epoch in range (4):

    i = 0

    # Train this epoch
    model.train ()
    for batch in train_loader:
        optimizer.zero_grad ()
        input_ids = batch['input_ids'].to (device)
        attention_mask = batch['attention_mask'].to (device)
        labels = batch['label'].to (device)

        loss, _ = model (input_ids, attention_mask=attention_mask, labels=labels)
        _, preds = torch.max (_, dim=1)
        correct_predictions += torch.sum (preds == labels)
        i += 1
        acc = correct_predictions.item ()  / (batch_size * i)

        loss.backward ()
        optimizer.step ()

    # Eval this epoch with the testing dataset
    model = model.eval ()
    correct_predictions = 0
    with torch.no_grad ():
        for batch in test_loader:
            input_ids = batch['input_ids'].to (device)
            attention_mask = batch['attention_mask'].to (device)
            labels = batch['label'].to (device)

            loss, _ = model (input_ids, attention_mask=attention_mask, labels=labels)
            _, preds = torch.max (_, dim=1)
            correct_predictions += torch.sum (preds == labels)

model.bert.save_pretrained ("my-model")
tokenizer.save_pretrained ("my-model")

After this step, I got good accuracy after the first epoch

Then, I load the model again in another script

model = BertForSequenceClassification.from_pretrained ("my-model")

# Eval this epoch with the testing dataset
model = model.eval ()
correct_predictions = 0
with torch.no_grad ():
    for batch in test_loader:
        input_ids = batch['input_ids'].to (device)
        attention_mask = batch['attention_mask'].to (device)
        labels = batch['label'].to (device)

        loss, _ = model (input_ids, attention_mask=attention_mask, labels=labels)
        _, preds = torch.max (_, dim=1)
        correct_predictions += torch.sum (preds == labels)

print (correct_predictions.item () / len (test_df))

but the accuracy is similar as If I retrieved a non-trained model.

Expected behavior

After load a model saved with save_pretrained, the model should provide similar accuracy and loss for the same data.

huggingface / transformers