GPT2 IndexError: index out of range in functional.py by running run_clm.py when adding any special tokens (even eos and bos only)

ghost commented 3 years ago

Hi all, I need your help as I'm stuck on an issue IndexError trying to finetune GPT2 using run_clm.py while adding special tokens. The error is trigger at this line of functional.py: return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

run_clm.py has been "barely" modified just adding the tokens with tokenizer.add_special_tokens See below details of the modification, the args used and the error log.

After weeks of preparing datasets, we hope to use your amazing scripts and library for an awesome AI project, I need your help please! 👍

Environment info

transformers version: 4.5.0
Platform: Darwin-20.2.0-x86_64-i386-64bit
Python version: 3.7.9
PyTorch version (GPU?): 1.8.1 (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: NO
Using distributed or parallel set-up in script?: NO

Also tried on Windows OS with CUDA 11.1 same transformers version, same Python version, etc = same issue.

Who can help

@patrickvonplaten, @LysandreJik, @sgugger

Information

Model I am using (Bert, XLNet ...): GPT2 Medium

The problem arises when using:

[X] the official example scripts: (give details below)

The tasks I am working on is:

[X] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Run transformers/examples/language-modeling/run_clm.py with the following args (see below). You can probably have the exact same issue using any dataset. It doesn't look to be a dataset related issue as the training works without the special tokens added.
The file run_clm.py has been modified slightly just to include eos token, bos token and additional special tokens (see below). The issue persists as long as I add any of these special token. The only solution seems to be to have no special token at all with this GPT2 fine-tuning code which is unfortunate because I need those for my purpose. :)

ARGS

python transformers/examples/language-modeling/run_clm.py \
--output_dir "models/output/" \
--model_type "gpt2" \
--model_name_or_path "models/original/" \
--tokenizer_name "gpt2" \
--cache_dir "models/cache/" \
--no_use_fast_tokenizer \
--do_train True \
--train_file "models/datasets/dataset-training-05042021.txt" \
--do_eval True \
--validation_file "models/datasets/dataset-validation-05042021.txt" \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--save_steps 500 \
--num_train_epochs 5 \
--learning_rate 5e-5 \
--weight_decay 0 \
--adam_beta1 0.9 \
--adam_beta2 0.999 \
--adam_epsilon 1e-8 \
--max_grad_norm 1.0 \
--no_cuda True \
--seed 123456 \
--fp16 False \
--fp16_opt_level "O1" \
--fp16_backend "auto" \
--fp16_full_eval False \

CODE MODIFICATION

I added this code on line 308 of run_clm.py just before the model.resize_token_embeddings(len(tokenizer)):

    special_tokens_dict = {
        'bos_token': '<|startoftext|>',
        'eos_token': '<|endoftext|>',
        'additional_special_tokens': [
             "<A>",
             "<B>",
             "<C>",
             "<D>",
             "<E>",
             "<F>",
             "<G>",
             "<H>"
         ]
    }
    num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)

ISSUE LOGS

04/06/2021 17:48:36 - WARNING - __main__ -   Process rank: -1, device: cpu, n_gpu: 0distributed training: False, 16-bits training: False
04/06/2021 17:48:36 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=models/output/, overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=1, per_device_eval_batch_size=1, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/Apr06_17-48-36_BLABLABLA-MacBook-Air.local, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=500, save_total_limit=None, no_cuda=True, seed=261184, fp16=False, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=models/output/, disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name=length, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, _n_gpu=0, mp_parameters=)
04/06/2021 17:48:36 - WARNING - datasets.builder -   Using custom data configuration default-544362d6d13a5db7
04/06/2021 17:48:36 - WARNING - datasets.builder -   Reusing dataset text (/Users/blablabla/.cache/huggingface/datasets/text/default-544362d6d13a5db7/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5)
[INFO|configuration_utils.py:488] 2021-04-06 17:48:36,800 >> loading configuration file models/original/config.json
[INFO|configuration_utils.py:526] 2021-04-06 17:48:36,802 >> Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.5.0",
  "use_cache": true,
  "vocab_size": 50257
}

[INFO|configuration_utils.py:490] 2021-04-06 17:48:37,245 >> loading configuration file https://huggingface.co/gpt2/resolve/main/config.json from cache at models/cache/fc674cd6907b4c9e933cb42d67662436b89fa9540a1f40d7c919d0109289ad01.7d2e0efa5ca20cef4fb199382111e9d3ad96fd77b849e1d4bed13a66e1336f51
[INFO|configuration_utils.py:526] 2021-04-06 17:48:37,247 >> Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.5.0",
  "use_cache": true,
  "vocab_size": 50257
}

[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,085 >> loading file https://huggingface.co/gpt2/resolve/main/vocab.json from cache at models/cache/684fe667923972fb57f6b4dcb61a3c92763ad89882f3da5da9866baf14f2d60f.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,085 >> loading file https://huggingface.co/gpt2/resolve/main/merges.txt from cache at models/cache/c0c761a63004025aeadd530c4c27b860ec4ecbe8a00531233de21d865a402598.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,086 >> loading file https://huggingface.co/gpt2/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,086 >> loading file https://huggingface.co/gpt2/resolve/main/special_tokens_map.json from cache at None
[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,086 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer_config.json from cache at None
[INFO|tokenization_utils_base.py:1707] 2021-04-06 17:48:39,086 >> loading file https://huggingface.co/gpt2/resolve/main/tokenizer.json from cache at models/cache/16a2f78023c8dc511294f0c97b5e10fde3ef9889ad6d11ffaa2a00714e73926e.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
[INFO|modeling_utils.py:1050] 2021-04-06 17:48:39,223 >> loading weights file models/original/pytorch_model.bin
[INFO|modeling_utils.py:1168] 2021-04-06 17:48:45,948 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.

[INFO|modeling_utils.py:1177] 2021-04-06 17:48:45,949 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at models/original/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
[INFO|tokenization_utils_base.py:873] 2021-04-06 17:48:45,949 >> Assigning <|startoftext|> to the bos_token key of the tokenizer
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <|startoftext|> to the vocabulary
[INFO|tokenization_utils_base.py:873] 2021-04-06 17:48:45,950 >> Assigning <|endoftext|> to the eos_token key of the tokenizer
[INFO|tokenization_utils_base.py:873] 2021-04-06 17:48:45,950 >> Assigning ['<A>', '<B>', '<C>', '<D>', '<E>', '<F>', '<G>', '<H>'] to the additional_special_tokens key of the tokenizer
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <A> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <B> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <C> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <D> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <E> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <F> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <G> to the vocabulary
[INFO|tokenization_utils.py:207] 2021-04-06 17:48:45,950 >> Adding <H> to the vocabulary
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 199/199 [01:15<00:00,  2.62ba/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:03<00:00,  2.69ba/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 199/199 [01:02<00:00,  3.17ba/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:02<00:00,  3.39ba/s]
[INFO|trainer.py:921] 2021-04-06 17:51:21,859 >> Loading model from models/original/).
[INFO|configuration_utils.py:488] 2021-04-06 17:51:21,924 >> loading configuration file models/original/config.json
[INFO|configuration_utils.py:526] 2021-04-06 17:51:21,931 >> Model config GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.5.0",
  "use_cache": true,
  "vocab_size": 50257
}

[INFO|modeling_utils.py:1050] 2021-04-06 17:51:21,950 >> loading weights file models/original/pytorch_model.bin
[INFO|modeling_utils.py:1168] 2021-04-06 17:51:31,409 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.

[INFO|modeling_utils.py:1177] 2021-04-06 17:51:31,409 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at models/original/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
[INFO|trainer.py:1013] 2021-04-06 17:51:31,478 >> ***** Running training *****
[INFO|trainer.py:1014] 2021-04-06 17:51:31,483 >>   Num examples = 8199
[INFO|trainer.py:1015] 2021-04-06 17:51:31,489 >>   Num Epochs = 5
[INFO|trainer.py:1016] 2021-04-06 17:51:31,489 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:1017] 2021-04-06 17:51:31,489 >>   Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:1018] 2021-04-06 17:51:31,489 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1019] 2021-04-06 17:51:31,489 >>   Total optimization steps = 40995
  0%|                                                                                                                                                                              | 0/40995 [00:00<?, ?it/s]Traceback (most recent call last):
  File "transformers/examples/language-modeling/run_clm.py", line 459, in <module>
    main()
  File "transformers/examples/language-modeling/run_clm.py", line 424, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/transformers/trainer.py", line 1120, in train
    tr_loss += self.training_step(model, inputs)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/transformers/trainer.py", line 1524, in training_step
    loss = self.compute_loss(model, inputs)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/transformers/trainer.py", line 1556, in compute_loss
    outputs = model(**inputs)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 917, in forward
    return_dict=return_dict,
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 694, in forward
    inputs_embeds = self.wte(input_ids)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/Users/blablabla/Developer/Training/env/lib/python3.7/site-packages/torch/nn/functional.py", line 1921, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
  0%|                                                                                                                                                                              | 0/40995 [00:00<?, ?it/s]

LysandreJik commented 3 years ago

Hi @MoonshotQuest, this is because you haven't resized your token embedding matrix after adding new tokens to your vocabulary. The tokenizer can therefore generate new tokens, but the model doesn't know how to handle them. The add_tokens method has a note for that

It was unfortunately forgotten for the add_special_tokens method and only put in the example, so I'm updating it.

ghost commented 3 years ago

Hi @LysandreJik - Thanks so much for looking into it! I did check your note and I think I'm already resizing the token embedding matrix as I add my code on line 308. Line 309 (unchanged) is already: model.resize_token_embeddings(len(tokenizer)) https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_clm.py

That gives the following:

    special_tokens_dict = {
        'bos_token': '<|startoftext|>',
        'eos_token': '<|endoftext|>',
        'additional_special_tokens': [
             "<A>",
             "<B>",
             "<C>",
             "<D>",
             "<E>",
             "<F>",
             "<G>",
             "<H>"
         ]
    }
    num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
    model.resize_token_embeddings(len(tokenizer))

Is there another matrix to resize?

sgugger commented 3 years ago

I am unable to reproduce your issue. In my side adding the code you mentioned to the script runs perfectly with

python examples/language-modeling/run_clm.py \
  --model_type gpt2 \
  --tokenizer_name gpt2 \
  --dataset_name wikitext \
  --dataset_config_name wikitext-2-raw-v1 \
  --do_train \
  --do_eval \
  --output_dir ~/tmp/tst-clm \
  --block_size 128 \
  --max_train_samples 100 \
  --overwrite_output_dir \
  --no_use_fast_tokenizer

I am using wikitext2 since I don't have access to your dataset. Note that your command contains statements that are a bit contradictory:

you are mixing a model_name_or_path (fine-tuning an existing model) with model_type (training from scratch).
you are mixiing no_cuda (so training on CPU) with FP16 options, which are not supported on CPU.

ghost commented 3 years ago

Hi @sgugger thanks for trying to reproduce!

I removed the GPU related args and isolated the issue to the use of my own folder & pre-trained model: --model_type gpt2 # Works perfectly --model_name_or_path "models/original/" # Doesn't work and throw the IndexError

I believe the issue is the model files I'm using in my folder models/original/ as the pre-trained GPT2 Medium. They seem to be different than the ones downloaded and cached when using the --model_type gpt2 argument. I only have 2 files in the folder the .bin and the .config. I would like to keep these files in a folder offline as a precaution. I pulled the files from these URL. Is there a different bin file used for fine-tuning vs a file for inferences only? https://huggingface.co/gpt2-medium/resolve/main/config.json https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-pytorch_model.bin

I see in some other part of transformers code the following, which may suggest that different pre-trained models .bin and .config files are used for different purpose. Maybe I'm completely wrong! 😂 Thanks for your guidance on this.

    _import_structure["models.gpt2"].extend(
        [
            "GPT2_PRETRAINED_MODEL_ARCHIVE_LIST",
            "GPT2DoubleHeadsModel",
            "GPT2ForSequenceClassification",
            "GPT2LMHeadModel",
            "GPT2Model",
            "GPT2PreTrainedModel",
            "load_tf_weights_in_gpt2",
        ]
    )

sgugger commented 3 years ago

Ah, this is because your checkpoint should have the resized weights: it's resized inside the script but since it's a local folder, it's also passed as a checkpoint to the Trainer later in the script, which then reloads the model from that folder without the model.resize_token_embeddings(len(tokenizer)) this time. So you have two solutions:

either load your model, apply model.resize_token_embeddings(len(tokenizer)) then resave it.
or remove the line that interprets the folder as a checkpoint here

ghost commented 3 years ago

Great thank you so much! @sgugger @LysandreJik That makes sense now, I removed the line and it works perfectly. 👍

I will let you know when we get closer to a launch date for our AI based game. It's going to be awesome! Sorry to troll this thread but does Huggingface has a place to showcase apps made using your incredible libraries? 😊

sgugger commented 3 years ago

We have a community page in the documentation, otherwise we'll be happy to help you share on social media!

ghost commented 3 years ago

Awesome!! Take care 🤗

huggingface / transformers