Closed ghost closed 3 years ago
Hi @MoonshotQuest, this is because you haven't resized your token embedding matrix after adding new tokens to your vocabulary. The tokenizer can therefore generate new tokens, but the model doesn't know how to handle them. The add_tokens
method has a note for that
It was unfortunately forgotten for the add_special_tokens
method and only put in the example, so I'm updating it.
Hi @LysandreJik - Thanks so much for looking into it! I did check your note and I think I'm already resizing the token embedding matrix as I add my code on line 308. Line 309 (unchanged) is already: model.resize_token_embeddings(len(tokenizer)) https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_clm.py
That gives the following:
special_tokens_dict = {
'bos_token': '<|startoftext|>',
'eos_token': '<|endoftext|>',
'additional_special_tokens': [
"<A>",
"<B>",
"<C>",
"<D>",
"<E>",
"<F>",
"<G>",
"<H>"
]
}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))
Is there another matrix to resize?
I am unable to reproduce your issue. In my side adding the code you mentioned to the script runs perfectly with
python examples/language-modeling/run_clm.py \
--model_type gpt2 \
--tokenizer_name gpt2 \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--do_train \
--do_eval \
--output_dir ~/tmp/tst-clm \
--block_size 128 \
--max_train_samples 100 \
--overwrite_output_dir \
--no_use_fast_tokenizer
I am using wikitext2 since I don't have access to your dataset. Note that your command contains statements that are a bit contradictory:
model_name_or_path
(fine-tuning an existing model) with model_type
(training from scratch).no_cuda
(so training on CPU) with FP16 options, which are not supported on CPU.Hi @sgugger thanks for trying to reproduce!
I removed the GPU related args and isolated the issue to the use of my own folder & pre-trained model:
--model_type gpt2
# Works perfectly
--model_name_or_path "models/original/"
# Doesn't work and throw the IndexError
I believe the issue is the model files I'm using in my folder models/original/
as the pre-trained GPT2 Medium. They seem to be different than the ones downloaded and cached when using the --model_type gpt2
argument. I only have 2 files in the folder the .bin and the .config. I would like to keep these files in a folder offline as a precaution.
I pulled the files from these URL. Is there a different bin file used for fine-tuning vs a file for inferences only?
https://huggingface.co/gpt2-medium/resolve/main/config.json
https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-pytorch_model.bin
I see in some other part of transformers code the following, which may suggest that different pre-trained models .bin and .config files are used for different purpose. Maybe I'm completely wrong! π Thanks for your guidance on this.
_import_structure["models.gpt2"].extend(
[
"GPT2_PRETRAINED_MODEL_ARCHIVE_LIST",
"GPT2DoubleHeadsModel",
"GPT2ForSequenceClassification",
"GPT2LMHeadModel",
"GPT2Model",
"GPT2PreTrainedModel",
"load_tf_weights_in_gpt2",
]
)
Ah, this is because your checkpoint should have the resized weights: it's resized inside the script but since it's a local folder, it's also passed as a checkpoint to the Trainer later in the script, which then reloads the model from that folder without the model.resize_token_embeddings(len(tokenizer))
this time. So you have two solutions:
model.resize_token_embeddings(len(tokenizer))
then resave it.Great thank you so much! @sgugger @LysandreJik That makes sense now, I removed the line and it works perfectly. π
I will let you know when we get closer to a launch date for our AI based game. It's going to be awesome! Sorry to troll this thread but does Huggingface has a place to showcase apps made using your incredible libraries? π
We have a community page in the documentation, otherwise we'll be happy to help you share on social media!
Awesome!! Take care π€
Hi all, I need your help as I'm stuck on an issue IndexError trying to finetune GPT2 using run_clm.py while adding special tokens. The error is trigger at this line of functional.py:
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
run_clm.py has been "barely" modified just adding the tokens with tokenizer.add_special_tokens See below details of the modification, the args used and the error log.
After weeks of preparing datasets, we hope to use your amazing scripts and library for an awesome AI project, I need your help please! π
Environment info
transformers
version: 4.5.0Also tried on Windows OS with CUDA 11.1 same transformers version, same Python version, etc = same issue.
Who can help
@patrickvonplaten, @LysandreJik, @sgugger
Information
Model I am using (Bert, XLNet ...): GPT2 Medium
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
ARGS
CODE MODIFICATION
I added this code on line 308 of run_clm.py just before the model.resize_token_embeddings(len(tokenizer)):
ISSUE LOGS