OpenAI GPT language modeling shape mismatch: 512 position embeddings, 1024 input emebddings

avi-otterai commented 3 years ago

Environment info

transformers version: 4.7.0.dev0
Platform: Linux-4.19.0-16-cloud-amd64-x86_64-with-glibc2.10
Python version: 3.8.10
PyTorch version (GPU?): 1.8.1+cu111 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: below log is for cpu; also fails with gpu but cpu gives better error
Using distributed or parallel set-up in script?: NA for cpu

Who can help

gpt2: @patrickvonplaten, @LysandreJik
openai-gpt: @sgugger

Information

Model I am using (Bert, XLNet ...): openai-gpt

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below) https://github.com/huggingface/transformers/tree/master/examples/pytorch/language-modeling

The tasks I am working on is:

[x] an official GLUE/SQUaD task: (give the name) causal language modelling
[ ] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behaviour:

new environment, editable installation from source
CUDA_VISIBLE_DEVICES=, nice python transformers/examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-gpt --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --do_eval --output_dir /tmp/test-clm --per_device_train_batch_size 2 --gradient_accumulation_steps 4

06/07/2021 05:58:13 - WARNING - __main__ -   Process rank: -1, device: cpu, n_gpu: 0distributed training: False, 16-bits training: False
06/07/2021 05:58:13 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(
_n_gpu=0,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=500,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=4,
greater_is_better=None,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_on_each_node=True,
logging_dir=runs/Jun07_05-58-13_fermi-debug,
logging_first_step=False,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
output_dir=/tmp/test-clm,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=2,
prediction_loss_only=False,
push_to_hub=False,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=/tmp/test-clm,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
06/07/2021 05:58:14 - WARNING - datasets.builder -   Reusing dataset wikitext (/home/avit/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/aa5e094000ec7afeb74c3be92c88313cd6f132d564c7effd961c10fd47c76f20)
[INFO|configuration_utils.py:517] 2021-06-07 05:58:14,482 >> loading configuration file https://huggingface.co/openai-gpt/resolve/main/config.json from cache at /home/avit/.cache/huggingface/transformers/bebb46f5735701bc248ef9faa26f12577944fa7fc8e9be1a774b94d4cb8b79b6.ba6f10a5446f364b92311c09e55e49aa27024a4aeefc1ea50fd733b77bcd997d
[INFO|configuration_utils.py:553] 2021-06-07 05:58:14,483 >> Model config OpenAIGPTConfig {
  "afn": "gelu",
  "architectures": [
    "OpenAIGPTLMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "embd_pdrop": 0.1,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "openai-gpt",
  "n_ctx": 512,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 512,
  "n_special": 0,
  "predict_special_tokens": true,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.7.0.dev0",
  "vocab_size": 40478
}

[INFO|configuration_utils.py:517] 2021-06-07 05:58:14,766 >> loading configuration file https://huggingface.co/openai-gpt/resolve/main/config.json from cache at /home/avit/.cache/huggingface/transformers/bebb46f5735701bc248ef9faa26f12577944fa7fc8e9be1a774b94d4cb8b79b6.ba6f10a5446f364b92311c09e55e49aa27024a4aeefc1ea50fd733b77bcd997d
[INFO|configuration_utils.py:553] 2021-06-07 05:58:14,767 >> Model config OpenAIGPTConfig {
  "afn": "gelu",
  "architectures": [
    "OpenAIGPTLMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "embd_pdrop": 0.1,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "openai-gpt",
  "n_ctx": 512,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12,
  "n_positions": 512,
  "n_special": 0,
  "predict_special_tokens": true,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.7.0.dev0",
  "vocab_size": 40478
}

[INFO|tokenization_utils_base.py:1717] 2021-06-07 05:58:16,461 >> loading file https://huggingface.co/openai-gpt/resolve/main/vocab.json from cache at /home/avit/.cache/huggingface/transformers/918c57540c636a2a662770d208fcf20aa8c3faea78201fc612e5c84f052f1119.ac55819e76b0f8b0c32cbb407436947d090d98f8952f38376ee249ed382927ab
[INFO|tokenization_utils_base.py:1717] 2021-06-07 05:58:16,461 >> loading file https://huggingface.co/openai-gpt/resolve/main/merges.txt from cache at /home/avit/.cache/huggingface/transformers/a682e219a788dde0e4f77bc5a470d85a4d7e493420506ce7e3266f7be122cf9e.2150b9689fda7ca7c6224ff32672c004259f974e96934e8eb69d8dd546d682db
[INFO|tokenization_utils_base.py:1717] 2021-06-07 05:58:16,461 >> loading file https://huggingface.co/openai-gpt/resolve/main/tokenizer.json from cache at /home/avit/.cache/huggingface/transformers/325373fcbb0daa99905371727842a87ae9ca0f02f71db071720bb4d5a59076cf.b1810f3c6ed9fc0632664008484a9b569103559c04ac90321723cd808a3a96f9
[INFO|tokenization_utils_base.py:1717] 2021-06-07 05:58:16,461 >> loading file https://huggingface.co/openai-gpt/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1717] 2021-06-07 05:58:16,461 >> loading file https://huggingface.co/openai-gpt/resolve/main/special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:1717] 2021-06-07 05:58:16,461 >> loading file https://huggingface.co/openai-gpt/resolve/main/tokenizer_config.json from cache at None
[INFO|modeling_utils.py:1155] 2021-06-07 05:58:16,805 >> loading weights file https://huggingface.co/openai-gpt/resolve/main/pytorch_model.bin from cache at /home/avit/.cache/huggingface/transformers/3e867ce638da986403594a5acbb39846ecb9c3b360a3b526348dd54b06938e55.93527980a112896731f93175b7c1cbc6b0fd771fad85fcc777ff5d49d249782e
[INFO|modeling_utils.py:1339] 2021-06-07 05:58:18,886 >> All model checkpoint weights were used when initializing OpenAIGPTLMHeadModel.

[WARNING|modeling_utils.py:1341] 2021-06-07 05:58:18,886 >> Some weights of OpenAIGPTLMHeadModel were not initialized from the model checkpoint at openai-gpt and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

  0%|          | 0/5 [00:00<?, ?ba/s]
 40%|████      | 2/5 [00:00<00:00, 18.62ba/s][WARNING|tokenization_utils_base.py:3170] 2021-06-07 05:58:19,096 >> Token indices sequence length is longer than the specified maximum sequence length for this model (535 > 512). Running this sequence through the model will result in indexing errors
[WARNING|run_clm.py:347] 2021-06-07 05:58:19,097 >> ^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits before being passed to the model.

100%|██████████| 5/5 [00:00<00:00, 24.33ba/s]

  0%|          | 0/37 [00:00<?, ?ba/s]
  8%|▊         | 3/37 [00:00<00:01, 22.90ba/s]
 16%|█▌        | 6/37 [00:00<00:01, 23.70ba/s]
 22%|██▏       | 8/37 [00:00<00:01, 20.28ba/s]
 30%|██▉       | 11/37 [00:00<00:01, 21.11ba/s]
 38%|███▊      | 14/37 [00:00<00:01, 21.90ba/s]
 46%|████▌     | 17/37 [00:00<00:00, 22.32ba/s]
 54%|█████▍    | 20/37 [00:00<00:00, 23.04ba/s]
 62%|██████▏   | 23/37 [00:01<00:00, 23.13ba/s]
 70%|███████   | 26/37 [00:01<00:00, 21.79ba/s]
 78%|███████▊  | 29/37 [00:01<00:00, 22.03ba/s]
 86%|████████▋ | 32/37 [00:01<00:00, 22.01ba/s]
 95%|█████████▍| 35/37 [00:01<00:00, 22.39ba/s]
100%|██████████| 37/37 [00:01<00:00, 22.54ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]
 75%|███████▌  | 3/4 [00:00<00:00, 22.82ba/s]
100%|██████████| 4/4 [00:00<00:00, 24.22ba/s]

  0%|          | 0/5 [00:00<?, ?ba/s]
 20%|██        | 1/5 [00:00<00:01,  2.53ba/s]
 40%|████      | 2/5 [00:00<00:01,  2.66ba/s]
 60%|██████    | 3/5 [00:01<00:00,  2.74ba/s]
 80%|████████  | 4/5 [00:01<00:00,  2.91ba/s]
100%|██████████| 5/5 [00:01<00:00,  3.54ba/s]

  0%|          | 0/37 [00:00<?, ?ba/s]
  3%|▎         | 1/37 [00:00<00:10,  3.30ba/s]
  5%|▌         | 2/37 [00:00<00:11,  3.11ba/s]
  8%|▊         | 3/37 [00:01<00:11,  3.05ba/s]
 11%|█         | 4/37 [00:01<00:10,  3.04ba/s]
 14%|█▎        | 5/37 [00:01<00:09,  3.22ba/s]
 16%|█▌        | 6/37 [00:01<00:09,  3.28ba/s]
 19%|█▉        | 7/37 [00:02<00:09,  3.02ba/s]
 22%|██▏       | 8/37 [00:02<00:09,  3.06ba/s]
 24%|██▍       | 9/37 [00:02<00:09,  3.03ba/s]
 27%|██▋       | 10/37 [00:03<00:08,  3.05ba/s]
 30%|██▉       | 11/37 [00:03<00:08,  3.01ba/s]
 32%|███▏      | 12/37 [00:03<00:08,  2.97ba/s]
 35%|███▌      | 13/37 [00:04<00:08,  2.91ba/s]
 38%|███▊      | 14/37 [00:04<00:07,  3.04ba/s]
 41%|████      | 15/37 [00:04<00:07,  3.05ba/s]
 43%|████▎     | 16/37 [00:05<00:07,  2.97ba/s]
 46%|████▌     | 17/37 [00:05<00:06,  2.95ba/s]
 49%|████▊     | 18/37 [00:05<00:06,  3.00ba/s]
 51%|█████▏    | 19/37 [00:06<00:05,  3.01ba/s]
 54%|█████▍    | 20/37 [00:06<00:05,  3.09ba/s]
 57%|█████▋    | 21/37 [00:06<00:05,  2.98ba/s]
 59%|█████▉    | 22/37 [00:07<00:05,  2.89ba/s]
 62%|██████▏   | 23/37 [00:07<00:04,  2.97ba/s]
 65%|██████▍   | 24/37 [00:07<00:04,  3.11ba/s]
 68%|██████▊   | 25/37 [00:08<00:03,  3.23ba/s]
 70%|███████   | 26/37 [00:08<00:03,  3.21ba/s]
 73%|███████▎  | 27/37 [00:08<00:03,  3.04ba/s]
 76%|███████▌  | 28/37 [00:09<00:03,  2.91ba/s]
 78%|███████▊  | 29/37 [00:09<00:02,  3.10ba/s]
 81%|████████  | 30/37 [00:09<00:02,  3.07ba/s]
 84%|████████▍ | 31/37 [00:10<00:02,  2.93ba/s]
 86%|████████▋ | 32/37 [00:10<00:01,  2.96ba/s]
 89%|████████▉ | 33/37 [00:10<00:01,  2.93ba/s]
 92%|█████████▏| 34/37 [00:11<00:01,  2.90ba/s]
 95%|█████████▍| 35/37 [00:11<00:00,  2.98ba/s]
 97%|█████████▋| 36/37 [00:11<00:00,  2.92ba/s]
100%|██████████| 37/37 [00:12<00:00,  3.44ba/s]
100%|██████████| 37/37 [00:12<00:00,  3.05ba/s]

  0%|          | 0/4 [00:00<?, ?ba/s]
 25%|██▌       | 1/4 [00:00<00:00,  3.37ba/s]
 50%|█████     | 2/4 [00:00<00:00,  3.17ba/s]
 75%|███████▌  | 3/4 [00:01<00:00,  3.06ba/s]
100%|██████████| 4/4 [00:01<00:00,  3.47ba/s]
100%|██████████| 4/4 [00:01<00:00,  3.31ba/s]
[INFO|trainer.py:1147] 2021-06-07 05:58:35,755 >> ***** Running training *****
[INFO|trainer.py:1148] 2021-06-07 05:58:35,755 >>   Num examples = 2282
[INFO|trainer.py:1149] 2021-06-07 05:58:35,755 >>   Num Epochs = 3
[INFO|trainer.py:1150] 2021-06-07 05:58:35,755 >>   Instantaneous batch size per device = 2
[INFO|trainer.py:1151] 2021-06-07 05:58:35,755 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:1152] 2021-06-07 05:58:35,755 >>   Gradient Accumulation steps = 4
[INFO|trainer.py:1153] 2021-06-07 05:58:35,756 >>   Total optimization steps = 855

  0%|          | 0/855 [00:00<?, ?it/s]Traceback (most recent call last):
  File "transformers/examples/pytorch/language-modeling/run_clm.py", line 488, in <module>
    main()
  File "transformers/examples/pytorch/language-modeling/run_clm.py", line 438, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/avit/trial2/transformers/src/transformers/trainer.py", line 1263, in train
    tr_loss += self.training_step(model, inputs)
  File "/home/avit/trial2/transformers/src/transformers/trainer.py", line 1741, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/avit/trial2/transformers/src/transformers/trainer.py", line 1773, in compute_loss
    outputs = model(**inputs)
  File "/home/avit/miniconda3/envs/try2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/avit/trial2/transformers/src/transformers/models/openai/modeling_openai.py", line 581, in forward
    transformer_outputs = self.transformer(
  File "/home/avit/miniconda3/envs/try2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/avit/trial2/transformers/src/transformers/models/openai/modeling_openai.py", line 501, in forward
    hidden_states = inputs_embeds + position_embeds + token_type_embeds
RuntimeError: The size of tensor a (1024) must match the size of tensor b (512) at non-singleton dimension 1

  0%|          | 0/855 [00:00<?, ?it/s]

Expected behaviour

Should not have a mismatch in tensor shapes. Apparently, the max length of tokens do not match: position embeddings expect 512 but input embeddings are 1024.

patrickvonplaten commented 3 years ago

Note that openai-gpt has a max_length of 512. See under n_positions in the config here: https://huggingface.co/openai-gpt/blob/main/config.json.

The run_clm.py script however sets max_length to 1024 by default. To fix your bug you should run:

python transformers/examples/pytorch/language-modeling/run_clm.py --model_name_or_path openai-gpt --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --do_eval --output_dir /tmp/test-clm --per_device_train_batch_size 2 --gradient_accumulation_steps 4 --block_size 512

patrickvonplaten commented 3 years ago

Actually, it's weird that you get this error since:

from transformers import OpenAIGPTTokenizer
tokenizer = OpenAIGPTTokenizer.from_pretrained("openai-gpt")
tokenizer.model_max_length   # prints 512

=> so the block size should have automatically been correctly set

sgugger commented 3 years ago

There is a small bug with a line not properly indented, fixing.

huggingface / transformers