huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.25k stars 26.09k forks source link

MNLI evaluation on pretrained models #10386

Closed AliHadizadeh closed 3 years ago

AliHadizadeh commented 3 years ago

Environment info

Who can help

@patil-suraj , @sgugger, @LysandreJik ## Information Model I am using (Bert, XLNet ...): huggingface/distilbert-base-uncased-finetuned-mnli - microsoft/deberta-v2-xxlarge-mnli - roberta-large-mnli - squeezebert/squeezebert-mnli - BERT-Base-MNLI.... The problem arises when using: * [x] the official example scripts: (give details below) * [ ] my own modified scripts: (give details below) The tasks I am working on is: * [x] an official GLUE/SQUaD task: (give the name) * [ ] my own task or dataset: (give details below) I use run_glue.py on fine-tuned models to reproduce the evaluation result (only `--do_eval`). But the accuracy is about 7%. Other tasks like MRPC or STS-B are ok when I use their fine-tuned models. ## To reproduce Steps to reproduce the behavior: 1. Run `python run_glue.py --model_name_or_path huggingface/distilbert-base-uncased-finetuned-mnli --task_name mnli --do_eval --max_seq_length 128 --output_dir temp/distill` or any other MNLI fine-tuned model. I even tried a model that I fine-tuned myself using V2.10.0 and that again results in 6%-7% accuracy.
python run_glue.py --model_name_or_path huggingface/distilbert-base-uncased-finetuned-mnli --task_name mnli --do_eval --max_seq_length 128 --output_dir temp/distill
02/24/2021 11:38:34 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
02/24/2021 11:38:34 - INFO - main - Training/evaluation parameters TrainingArguments(output_dir=temp/distill, overwrite_output_dir=False, do_train=False, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs\Feb24_11-38-34_Ali_Workstation, logging_first_step=False, logging_steps=500, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=temp/distill, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=[], ddp_find_unused_parameters=None, dataloader_pin_memory=True, n_gpu=1)
02/24/2021 11:38:36 - WARNING - datasets.builder - Reusing dataset glue (C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4)
[INFO|configuration_utils.py:449] 2021-02-24 11:38:36,777 >> loading configuration file h***://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/config.json from cache at C:\Users\Ali/.cache\huggingface\transformers\240bd330b0e7919215436efe944c4073bfcc0bac4b7ed0a3378ab3d1793beb1a.acfb235b208288614b764ad50394132d4751a48a6c81fc382dc669e4d8a80a55
[INFO|configuration_utils.py:485] 2021-02-24 11:38:36,779 >> Model config DistilBertConfig {
“activation”: “gelu”,
“architectures”: [
“DistilBertForMaskedLM”
],
“attention_dropout”: 0.1,
“bos_token_id”: 0,
“dim”: 768,
“dropout”: 0.1,
“eos_token_ids”: 0,
“finetuning_task”: “mnli”,
“hidden_dim”: 3072,
“id2label”: {
“0”: “LABEL_0”,
“1”: “LABEL_1”,
“2”: “LABEL_2”
},
“initializer_range”: 0.02,
“label2id”: {
“LABEL_0”: 0,
“LABEL_1”: 1,
“LABEL_2”: 2
},
“max_position_embeddings”: 512,
“model_type”: “distilbert”,
“n_heads”: 12,
“n_layers”: 6,
“output_past”: true,
“pad_token_id”: 0,
“qa_dropout”: 0.1,
“seq_classif_dropout”: 0.2,
“sinusoidal_pos_embds”: false,
"tie_weights": true,
“transformers_version”: “4.3.2”,
“vocab_size”: 30522
}[INFO|configuration_utils.py:449] 2021-02-24 11:38:36,923 >> loading configuration file hs://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/config.json from cache at C:\Users\Ali/.cache\huggingface\transformers\240bd330b0e7919215436efe944c4073bfcc0bac4b7ed0a3378ab3d1793beb1a.acfb235b208288614b764ad50394132d4751a48a6c81fc382dc669e4d8a80a55
[INFO|configuration_utils.py:485] 2021-02-24 11:38:36,924 >> Model config DistilBertConfig {
“activation”: “gelu”,
“architectures”: [
“DistilBertForMaskedLM”
],
“attention_dropout”: 0.1,
“bos_token_id”: 0,
“dim”: 768,
“dropout”: 0.1,
“eos_token_ids”: 0,
“finetuning_task”: “mnli”,
“hidden_dim”: 3072,
“id2label”: {
“0”: “contradiction”,
“1”: “neutral”,
“2”: “entailment”
},
“initializer_range”: 0.02,
“label2id”: {
“contradiction”: “0”,
“entailment”: “2”,
“neutral”: “1”
},
“max_position_embeddings”: 512,
“model_type”: “distilbert”,
“n_heads”: 12,
“n_layers”: 6,
“output_past”: true,
“pad_token_id”: 0,
“qa_dropout”: 0.1,
“seq_classif_dropout”: 0.2,
“sinusoidal_pos_embds”: false,
“tie_weights_”: true,
“transformers_version”: “4.3.2”,
“vocab_size”: 30522
}
[INFO|tokenization_utils_base.py:1688] 2021-02-24 11:38:36,928 >> Model name ‘huggingface/distilbert-base-uncased-finetuned-mnli’ not found in model shortcut name list (distilbert-base-uncased, distilbert-base-uncased-distilled-squad, distilbert-base-cased, distilbert-base-cased-distilled-squad, distilbert-base-german-cased, distilbert-base-multilingual-cased). Assuming ‘huggingface/distilbert-base-uncased-finetuned-mnli’ is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1786] 2021-02-24 11:38:37,946 >> loading file hps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/vocab.txt from cache at C:\Users\Ali/.cache\huggingface\transformers\3aa49bfb368cde995cea246a5c5ca4d75f769e74b3e6d450776805f998c78366.d789d64ebfe299b0e416afc4a169632f903f693095b4629a7ea271d5a0cf2c99
[INFO|tokenization_utils_base.py:1786] 2021-02-24 11:38:37,947 >> loading file hps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/tokenizer.json from cache at None
[INFO|tokenization_utils_base.py:1786] 2021-02-24 11:38:37,950 >> loading file htps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/added_tokens.json from cache at C:\Users\Ali/.cache\huggingface\transformers\603dca04f5c89cbdcdb8021ec21c4376c7334fa6393347c80a54c942a93e50cb.5cc6e825eb228a7a5cfd27cb4d7151e97a79fb962b31aaf1813aa102e746584b
[INFO|tokenization_utils_base.py:1786] 2021-02-24 11:38:37,951 >> loading file ht*ps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/special_tokens_map.json from cache at C:\Users\Ali/.cache\huggingface\transformers\dea17c39d149e23cb97e2a2829c6170489551d2454352fd18488f17bf90c54db.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
[INFO|tokenization_utils_base.py:1786] 2021-02-24 11:38:37,952 >> loading file hps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/tokenizer_config.json from cache at C:\Users\Ali/.cache\huggingface\transformers\ce6fb0f339483f5ca331e9631b13bc5e9c842e64e9a40aa60defb3898b99dbed.11d9edb6b1301b5af13d33c1585ff45ff84dd55cc6915c2872f856d1ee2dc409
[INFO|modeling_utils.py:1027] 2021-02-24 11:38:38,148 >> loading weights file hps://huggingface.co/huggingface/distilbert-base-uncased-finetuned-mnli/resolve/main/pytorch_model.bin from cache at C:\Users\Ali/.cache\huggingface\transformers\16516ebd442e5f41cd8caf2de88c478fe8a3a0948e20eaf1fdae0bf2d4998be6.73881288e7255a28dacc8ad53661dde9248c11f6e2d10f3b6db193dddee2a2bc
[INFO|modeling_utils.py:1143] 2021-02-24 11:38:39,218 >> All model checkpoint weights were used when initializing DistilBertForSequenceClassification.
[INFO|modeling_utils.py:1152] 2021-02-24 11:38:39,221 >> All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at huggingface/distilbert-base-uncased-finetuned-mnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForSequenceClassification for predictions without further training.
02/24/2021 11:38:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4\cache-0a88ac8e6b3bd378.arrow
02/24/2021 11:38:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4\cache-e1993e6695981db0.arrow
02/24/2021 11:38:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4\cache-133d62ae090971a5.arrow
02/24/2021 11:38:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4\cache-497afbfcce3a8a9d.arrow
02/24/2021 11:38:39 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at C:\Users\Ali.cache\huggingface\datasets\glue\mnli\1.0.0\7c99657241149a24692c402a5c3f34d4c9f1df5ac2e4c3759fadea38f6cb29c4\cache-7146b31017748988.arrow
02/24/2021 11:38:39 - INFO - main - Sample 335243 of the training set: {‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘hypothesis’: “Parents are busy and it’s sometimes hard to get them out.”, ‘idx’: 335243, ‘input_ids’: [101, 2017, 2113, 2043, 2037, 3008, 2272, 1998, 2009, 1005, 1055, 2524, 2000, 2131, 2068, 2041, 1998, 1037, 2843, 1997, 3008, 2031, 3182, 2000, 2175, 1998, 1998, 2477, 2066, 2008, 1998, 2009, 1005, 1055, 2397, 2012, 2305, 2061, 102, 3008, 2024, 5697, 1998, 2009, 1005, 1055, 2823, 2524, 2000, 2131, 2068, 2041, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘label’: 0, ‘premise’: “you know when their parents come and it’s hard to get them out and a lot of parents have places to go and and things like that and it’s late at night so”}.
02/24/2021 11:38:39 - INFO - main - Sample 58369 of the training set: {‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘hypothesis’: 'Where and what is art? ', ‘idx’: 58369, ‘input_ids’: [101, 2073, 2003, 2396, 1029, 102, 2073, 1998, 2054, 2003, 2396, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘label’: 1, ‘premise’: ‘Where is art?’}.
02/24/2021 11:38:39 - INFO - main - Sample 13112 of the training set: {‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘hypothesis’: ‘The list says alcohol and injury are negatives facing staff.’, ‘idx’: 13112, ‘input_ids’: [101, 6544, 1998, 4544, 1010, 2004, 2092, 2004, 4766, 19388, 1010, 2024, 2006, 1996, 2862, 1012, 102, 1996, 2862, 2758, 6544, 1998, 4544, 2024, 4997, 2015, 5307, 3095, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘label’: 1, ‘premise’: ‘Alcohol and injury, as well as brief interventions, are on the list.’}.
[INFO|trainer.py:432] 2021-02-24 11:38:41,361 >> The following columns in the training set don’t have a corresponding argument in DistilBertForSequenceClassification.forward and have been ignored: premise, hypothesis, idx.
[INFO|trainer.py:432] 2021-02-24 11:38:41,362 >> The following columns in the evaluation set don’t have a corresponding argument in DistilBertForSequenceClassification.forward and have been ignored: premise, hypothesis, idx.
02/24/2021 11:38:41 - INFO - main - *** Evaluate ***
[INFO|trainer.py:432] 2021-02-24 11:38:41,366 >> The following columns in the evaluation set don’t have a corresponding argument in DistilBertForSequenceClassification.forward and have been ignored: premise, hypothesis, idx.
[INFO|trainer.py:1600] 2021-02-24 11:38:41,371 >> ***** Running Evaluation *****
[INFO|trainer.py:1601] 2021-02-24 11:38:41,371 >> Num examples = 9815
[INFO|trainer.py:1602] 2021-02-24 11:38:41,372 >> Batch size = 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1227/1227 [00:10<00:00, 122.19it/s]
02/24/2021 11:38:52 - INFO - main - ***** Eval results mnli *****
02/24/2021 11:38:52 - INFO - main - eval_accuracy = 0.07865511971472236
02/24/2021 11:38:52 - INFO - main - eval_loss = 4.536623954772949
02/24/2021 11:38:52 - INFO - main - eval_runtime = 10.733
02/24/2021 11:38:52 - INFO - main - eval_samples_per_second = 914.471
[INFO|trainer.py:432] 2021-02-24 11:38:52,120 >> The following columns in the evaluation set don’t have a corresponding argument in DistilBertForSequenceClassification.forward and have been ignored: premise, hypothesis, idx.
[INFO|trainer.py:1600] 2021-02-24 11:38:52,124 >> ***** Running Evaluation *****
[INFO|trainer.py:1601] 2021-02-24 11:38:52,124 >> Num examples = 9832
[INFO|trainer.py:1602] 2021-02-24 11:38:52,125 >> Batch size = 8
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1229/1229 [00:10<00:00, 121.59it/s]
02/24/2021 11:39:02 - INFO - main - ***** Eval results mnli-mm *****
02/24/2021 11:39:02 - INFO - main - eval_accuracy = 0.08482506102522376
02/24/2021 11:39:02 - INFO - main - eval_loss = 4.487601280212402
02/24/2021 11:39:02 - INFO - main - eval_runtime = 10.127
02/24/2021 11:39:02 - INFO - main - eval_samples_per_second = 970.87

Expected behavior

It seems all the weights are loaded in the correct place, but the accuracy is below 10% which should be above 80%.

[INFO|modeling_utils.py:1143] 2021-02-24 11:38:39,218 >> All model checkpoint weights were used when initializing DistilBertForSequenceClassification.
[INFO|modeling_utils.py:1152] 2021-02-24 11:38:39,221 >> All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at huggingface/distilbert-base-uncased-finetuned-mnli.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForSequenceClassification for predictions without further training.
LysandreJik commented 3 years ago

Hello! This may be because of labels being switched around for the MNLI task. See this thread https://github.com/huggingface/transformers/pull/10203 for more context.

AliHadizadeh commented 3 years ago

Hello, Many thanks for your response. Yes, that seems to be the source of my issue, and now I can get the accuracy. Thanks!

sgugger commented 3 years ago

I think there is also a specific problem in huggingface/distilbert-base-uncased-finetuned-mnli: its labels seem wrongly coded. Using them specifically and evaluating gives me 34% accuracy.

AliHadizadeh commented 3 years ago

Yes. But other models seem to work with the modification that I made https://github.com/huggingface/transformers/pull/10203#discussion_r582971857

jxmorris12 commented 2 years ago

@sgugger Can someone fix this, or remove the model from the model hub? This is a serious gotcha and cost me a couple weeks of confusion!

sgugger commented 2 years ago

The model has been fixed a year ago, in this commit

jxmorris12 commented 2 years ago

Thank you for clarifying @sgugger! I think we had an old copy