Bug SqueezeBERT stops with no error

HRezaeiM commented 3 years ago

Environment info

transformers version:
Platform: Ubuntu
Python version: anaconda python 3.7
PyTorch version (GPU?):
Tensorflow version (GPU?):
Using GPU in script?:
Using distributed or parallel set-up in the script?:

The GPUs available were :

Geforce GTX 980 4gb
Geforce GTX Titan 12gb

transformers == 4.1.1
torch==1.7.0
torchvision == 0.8.1

Who can help

Information

Model I am using (Bert, XLNet ...): SqueezeBERT

The problem arises when using:

[ ] the official example scripts: (give details below) using Sequence Classification with IMDb Reviews example I have made my own script
[x] my own modified scripts: (give details below) The only changes I have made in this script are

to use Yelp dataset,
use SqueezeBERT instead of DistilBERT,
also do a 5label sentiment...

training_args = TrainingArguments(
    output_dir='./SqueezeBERT_10ep_result',          # output directory
    per_device_train_batch_size=3,  # batch size per device during training
    per_device_eval_batch_size=3,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./SqueezeBERT_10ep_log',            # directory for storing logs
    logging_steps=500,
    num_train_epochs=10,              # total number of training epochs
    evaluation_strategy="epoch",
    do_train=True,
    do_eval=True,
)

model = SqueezeBertForSequenceClassification.from_pretrained('squeezebert/squeezebert-mnli-headless', return_dict=True)
model.num_labels = 5
model.classifier = nn.Linear(768,5)

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):    

    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)

    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

print("Displaying model architecture... !\n")
print(model)
print("Training model starting...!\n")

trainer = Trainer(
    model=model,             # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,             # evaluation dataset
    compute_metrics=compute_metrics,
)

trainer.train()

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below) Using the yelp full dataset

To reproduce

Steps to reproduce the behavior:

Run the scripts as mentioned
Reaching Epoch 3, it will suddenly stop using the GPU and although no error is showing up, nothing changes...
the Last checkpoint that saves is checkpoint-310000

Expected behavior

it should have just go on and finished the training process.

LysandreJik commented 3 years ago

Hi! Is there a way for you to reproduce this error in a colab notebook?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions.

If you think this still needs to be addressed please comment on this thread.

huggingface / transformers