huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.41k stars 27.1k forks source link

Bug SqueezeBERT stops with no error #9238

Closed HRezaeiM closed 3 years ago

HRezaeiM commented 3 years ago

Environment info

The GPUs available were :

Geforce GTX 980 4gb
Geforce GTX Titan 12gb
transformers == 4.1.1
torch==1.7.0
torchvision == 0.8.1

Who can help

Information

Model I am using (Bert, XLNet ...): SqueezeBERT

The problem arises when using:

  1. to use Yelp dataset,
  2. use SqueezeBERT instead of DistilBERT,
  3. also do a 5label sentiment...
training_args = TrainingArguments(
    output_dir='./SqueezeBERT_10ep_result',          # output directory
    per_device_train_batch_size=3,  # batch size per device during training
    per_device_eval_batch_size=3,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./SqueezeBERT_10ep_log',            # directory for storing logs
    logging_steps=500,
    num_train_epochs=10,              # total number of training epochs
    evaluation_strategy="epoch",
    do_train=True,
    do_eval=True,
)

model = SqueezeBertForSequenceClassification.from_pretrained('squeezebert/squeezebert-mnli-headless', return_dict=True)
model.num_labels = 5
model.classifier = nn.Linear(768,5)

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):    

    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)

    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

print("Displaying model architecture... !\n")
print(model)
print("Training model starting...!\n")

trainer = Trainer(
    model=model,             # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,             # evaluation dataset
    compute_metrics=compute_metrics,
)

trainer.train()

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

  1. Run the scripts as mentioned
  2. Reaching Epoch 3, it will suddenly stop using the GPU and although no error is showing up, nothing changes...
  3. the Last checkpoint that saves is checkpoint-310000

Expected behavior

it should have just go on and finished the training process.

LysandreJik commented 3 years ago

Hi! Is there a way for you to reproduce this error in a colab notebook?

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions.

If you think this still needs to be addressed please comment on this thread.