Closed osainz59 closed 1 year ago
Yes! Quick fix, use the slow tokenizer. Otherwise I'll open a PR to add template processing! Thanks for reporting!
But it shouldn't add an eos
token right? The LM is not trained to generate a token after the eos
I believe.
But it shouldn't add an
eos
token right? The LM is not trained to generate a token after theeos
I believe.
By default, but if specified with add_eos_token=True
it should. You can always fine-tune the model to make the model learn when to stop.
I guess they would set the pad_token_id
using the eos_token_id
?
model.config.pad_token_id = model.config.eos_token_i
Same here, doing add_eos_token=True doesn't do anything
This should have been fixed by #22959
I guess they would set the
pad_token_id
using theeos_token_id
?model.config.pad_token_id = model.config.eos_token_i
I believe if you just set the pad_token = eos_token
the model still is not learning to predict the eos_token
because the corresponding attn_mask
does not include the token and the labels
ignores that token - i.e. no loss is computed for it. Not 100% sure about this, but that was what it seemed like from some self exploration.
The same is happening with Falcon...
When you say the same, what do you mean?
That it doesn't generate <|endoftext|> (token id 11) when calling generate, therefore it never stops generating. I have tried by setting eos_token_id
to 193, which corresponds to \n
, but I don't think that's a clean fix. I have noticed that when tokenizing the inputs with the Falcon-40b tokenizer, it's not adding eos_token_id
at the end of input ids.
Few things here. Llama has no official model so make sure the one you are using is up to date and has the same eos token id for the model.config / generation config and the tokenizer.
For falcon, code is on the hub, but latest code of transformers adds the eos if you set βadd_eos=Trueβ. In the doc for llama you can find that initializing a model with βadd_eos=Trueβ will make it add the eos when tokenizing.
Actually I was talking about Falcon, not llama, because I'm facing an issue similar to the ones people are reporting with Llama. In fact I upgraded my transformers version to the last version in main
branch, and the problem persists... The model never generates a EOS token, so it never stops generating...
I have tried to explicitly add a string "<|endoftext|>" at the end of the inputs for fine-tuning, but still doesn't work.
What can I do to make falcon generate a eos token ?
The issue is different, the model not stopping does not mean that it is not adding the eos_token
but rather not predicting it.
The problem with LLAM has already been mentioned here: #23230
I thought it could be related, my hypothesis was that Falcon wasn't generating the EOS token because it wasn't being included in the inputs when tokenizing, so when we train the model over inputs without EOS token at the end, the model doesn't learn to generate EOS token.
@avacaondata - I have noticed this same issue, where the model is not learning to predict the EOS token. After doing some digging through several examples and source code, I've noticed something a bit strange particularly related to the DataCollatorForLanguageModeling
. A very typical pattern that I have seen suggested is the following:
transformers import DataCollatorForLanguageModeling
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
However, the problem I see with this approach is that when the DataCollator overrides OR generates the labels
field for the batch it sets all tokens == pad_token
to be -100
.
labels = batch["input_ids"].clone()
if self.tokenizer.pad_token_id is not None:
labels[labels == self.tokenizer.pad_token_id] = -100
batch["labels"] = labels
Since the CrossEntropy
loss ignores tokens with -100
even if the tokenizer we are using properly adds the eos_token
, the loss function will actually ignore this token.
Ways that I have worked around this issue are either (1) to ensure that the eos_token_id != pad_token_id
and make sure that the tokenizer includes the eos_token
when tokenizing (some automatically do this such as the T5 tokenizer
) OR (2) create the labels column myself when tokenizing - by cloning input_ids
- and then using the DataCollatorForSeq2Seq
. I actually really like the DataCollatorForSeq2Seq
because it automatically pads the inputs and labels, but does not mess with tokens in unexpected ways, such as the eos_token.
Hope this is helpful!
@jonathangomesselman Thank you very much for the clear explanation, it makes much sense!
I will change the label for the eos token so that it's not ignored by cross entropy anymore.
Ideally I think that for instruction-tuning we shouldn't use DataCollatorForLanguageModeling
, in this paper they did some experiments and found that only training over outputs typically works better: https://arxiv.org/pdf/2305.14314.pdf . However, I haven't found a way to make DataCollatorForSeq2Seq
work for decoder-only models such as Llama or Falcon. Do you have any code on how to do that?
@avacaondata - You're welcome!
I have generally followed this practice as well - just fine-tuning over the model outputs
, since generally I don't need the model to directly learn the statistical distribution over human instructions, but rather just how to "react" to them.
Continuing from above, to use the DataCollatorForSeq2Seq
for decoder-only models we need to manually create the labels
field when tokenizing our data - i.e. ensuring we have the fields input_ids
, attention_mask
, and labels
. Since we create the labels
ourselves we have control over what tokens we explicitly train over vs. which we want to ignore (using -100
as a label). Here is the skeleton of some code you could use to tokenize the inputs:
from transformers import LlamaTokenizerFast
tokenizer = LlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer")
# By default the bos_token is added and not the eos_token. For instruction tuning I often ignore bos_token.
tokenizer.add_bos_token = False
tokenizer.add_eos_token = True
def create_instruction_tuned_format(data_row):
return f"""<User Instruction>:{data_row["instruct"]}
<Agent Response>: {data_row['response']}
""".strip()
def tokenize(data_row):
"""Format and tokenize instruction tuning data
1) Combine the user input (instruction) and agent response
2) Create `labels` - ensuring we only fine tune over the
desired agent response
"""
model_input_text = create_instruction_tuned_format(data_row)
# Tokenize the full model input
model_input = tokenizer(
model_input_text,
truncation=True,
padding=False,
return_tensors=None
)
# Create `labels` - ignoring user input (instructions)
agent_response = tokenizer(data_row['title']).input_ids
num_tokens_ignore = len(model_input['labels']) - len(agent_response)
ignored_tokens = [-100] * (num_tokens_ignore)
# Copy over the ids for the desired agent response
model_input['labels'] = ignored_tokens \
+ model_input['input_ids'][-len(agent_response):]
# Just to demonstrate length equality
assert len(model_inputs['labels']) == len(model_inputs['input_ids'])
return model_input
tokenized_ds = ds.map(tokenizer, remove_columns=ds.column_names)
A couple of things to note/highlight:
Now that we have our data tokenized and formatted we can use the DataCollatorForSeq2Seq
as follows:
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForSeq2Seq(
tokenizer, return_tensors="pt", padding=True
)
batch_size = 8
train_dataloader = DataLoader(
tokenized_ds, shuffle=True, collate_fn=data_collator, batch_size=batch_size, pin_memory=True
)
Note that the LLAMA tokenizer by default does not have a pad_token
so we have to set it. Because we are using the DataCollatorForSeq2Seq
it is okay for us to set the padding token to the eos_token
as the collator does not create the labels tensor but rather just pads our existing labels tensor with -100
- i.e. the eos_token
will not be ignored/replaced.
This may not be the most standard approach for doing this - but this is an example of what I have found to work / have seen some repos roughly follow. The main idea being that by creating the labels
ourselves we are able to set -100
for tokens that we don't want to fine-tune over + ensure that we learn to generate the eos_token
.
Wow @jonathangomesselman Thank you so much for the so clear explanation... :heart_eyes:
I tried it and yes it works flawlessly. I will check the LIMA paper in detail too to check for that EOT special token, I think that's an interesting approach.
Again, thank you so much, you were extremely helpful!! :heart:
@avacaondata you're welcome! I had very similar questions to what you asked and found myself a bit surprised to not find many good resources. Thankfully the HuggingFace code repos are actually quite readable, especially in separating the complex model logic of the base pre-trained transform models (encoder-decoder + decoder only) vs. adding the "language modeling" head (see sub-classes with ...ConditionalGeneration
, ...CausalLM
, ...LMHeadModel
).
If you're curious yourself, I would definitely recommend looking at the code to learn more. Each model has a slightly different naming convention but you will see that the logic is nearly identical. Some to check out are:
Have fun exploring!
@jonathangomesselman thanks a lot!
I was also running into this issue where the model was unable to output the eos_token after fine-tuning. I also followed examples where they set tokenizer.pad_token = tokenizer.eos_token
. From your earlier comment, I made sure tokenizer.pad_token != tokenizer.eos_token
by setting tokenizer.add_special_tokens({'pad_token': '[PAD]'})
and using DataCollatorForLanguageModeling
as before, e.g.
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
Now the model finally outputs the eos_token as intended!
@georgesung Thanks for sharing this approach! Adding a new [PAD]
token is a great way to differentiate between that and the EOS
token - which as you say allows you to then use the native DataCollatorForLanuageModeling
. It is very interesting / odd to me that this is such a common problem, given it seems sort of obvious that we want this behavior. But regardless it is exciting to see the model finally start outputting the eos_token
π
. An interesting thing that I noticed is that this is generally not an issue with the Encoder-Decoder models such as T5. With these models the tokenizer generally adds the eos_token
by default and the colaters used don't have this problem of ignoring
the eos_token
by treating it as a padding token.
@avacaondata We can use a similar approach to add a the EOT
token described in the LIMA Paper for separating the instruction
and the response
.
I think this could be a great TIP addition to the documentation / blog! If anyone of you has time to open PR, feel free to do so and ping me! π€
@ArthurZucker - I would be happy to work on this! Where do you think it would be best to add this TIP?
Probably in the llama.md
!
What is the correct code for Falcon? I'm still puzzled.
Related links:
@georgesung question:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
But this assumes the model has a pad_token
. I think an additional check has to be done that it does have an embedding for pad_token
so that there are no run time errors (~type errors in the matrix extraction from the embedding "table"/matrix).
But if one does that some care might be needed to initialize the new token so that it dominates the generation: https://nlp.stanford.edu/~johnhew/vocab-expansion.html
@brando90
But this assumes the model has a pad_token
I haven't confirmed, but I think tokenizer.add_special_tokens({'pad_token': '[PAD]'})
is equivalent to tokenizer.pad_token = '[PAD]'
(edit: might be wrong about that). So if there are runtime errors with tokenizer.add_special_tokens({'pad_token': '[PAD]'})
then there would also be runtime errors with tokenizer.pad_token = tokenizer.eos_token
-- note tokenizer.eos_token
is just a string. But I observed runtime errors with neither. I just observed that when I set tokenizer.pad_token = tokenizer.eos_token
during training, the model won't stop generating during inference, since it was trained to not output the eos token (per discussions above).
Since I was working with open_llama_7b, I saw that even though the model's tokenizer didn't specify a pad token string in its tokenizer_config.json, it still had a row in its token embedding matrix for the pad token. If you run print(model)
, you can see its token embedding matrix has an index reserved for the pad token (idx 0 in this case):
> print(model)
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096, padding_idx=0)
..
You can also see the pad token's embedding itself: model.state_dict()['model.embed_tokens.weight'][0]
. Although from discussions above and also this discussion, it doesn't seem to matter what the actual embeddings are for the pad token.
@georgesung unfortunately I'm working with Falcon. It doesn't have a pad token to my surprise (I'm not sure how this even happens in the first place tbh):
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 8/8 [00:10<00:00, 1.36s/it]
type(model)=<class 'transformers_modules.tiiuae.falcon-7b.2f5c3cd4eace6be6c0f12981f377fb35e5bf6ee5.modelling_RW.RWForCausalLM'>
type(tokenizer)=<class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>
Using pad_token, but it is not set yet.
tokenizer.pad_token=None
type(peft_config)=<class 'peft.tuners.lora.LoraConfig'>
model=RWForCausalLM(
(transformer): RWModel(
(word_embeddings): Embedding(65024, 4544)
(h): ModuleList(
(0-31): 32 x DecoderLayer(
(input_layernorm): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
(self_attention): Attention(
(maybe_rotary): RotaryEmbedding()
(query_key_value): Linear4bit(in_features=4544, out_features=4672, bias=False)
(dense): Linear4bit(in_features=4544, out_features=4544, bias=False)
(attention_dropout): Dropout(p=0.0, inplace=False)
)
(mlp): MLP(
(dense_h_to_4h): Linear4bit(in_features=4544, out_features=18176, bias=False)
(act): GELU(approximate='none')
(dense_4h_to_h): Linear4bit(in_features=18176, out_features=4544, bias=False)
)
)
)
(ln_f): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=4544, out_features=65024, bias=False)
)
---- start Print all special tokens
eos_token: <|endoftext|>
additional_special_tokens: ['>>TITLE<<', '>>ABSTRACT<<', '>>INTRODUCTION<<', '>>SUMMARY<<', '>>COMMENT<<', '>>ANSWER<<', '>>QUESTION<<', '>>DOMAIN<<', '>>PREFIX<<', '>>SUFFIX<<', '>>MIDDLE<<']
---- end Print all special tokens
model.get_input_embeddings().weight.size()=torch.Size([65024, 4544])
pad_embedding=tensor([[[-0.0179, 0.0201, -0.0273, ..., -0.0275, -0.0396, -0.0131],
[-0.0510, -0.0079, -0.0383, ..., -0.0481, 0.0581, 0.0282],
[-0.0217, -0.0216, -0.0064, ..., -0.0508, 0.0554, -0.0013],
...,
[ 0.0425, 0.0452, -0.0131, ..., 0.0019, 0.0476, 0.0342],
[-0.0170, -0.0085, 0.0449, ..., -0.0074, 0.0178, 0.0043],
[-0.0439, -0.0859, -0.0820, ..., 0.0130, 0.0669, 0.0884]]],
device='cuda:0', dtype=torch.float16, grad_fn=<UnsqueezeBackward0>)
Success!
/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/utils.py:1259: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
warnings.warn(
Traceback (most recent call last):
File "/lfs/hyperturing1/0/brando9/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/model_tokenizer/falcon_uu_mdl_tok.py", line 190, in <module>
example_test_model_already_has_pad_token()
File "/lfs/hyperturing1/0/brando9/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/model_tokenizer/falcon_uu_mdl_tok.py", line 182, in example_test_model_already_has_pad_token
tokenizer.decode(model.generate(**tokenizer(sent, return_tensors='pt'), do_sample=True)[0])
File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/utils.py", line 1271, in generate
self._validate_model_kwargs(model_kwargs.copy())
File "/lfs/hyperturing1/0/brando9/miniconda/envs/data_quality/lib/python3.10/site-packages/transformers/generation/utils.py", line 1144, in _validate_model_kwargs
raise ValueError(
ValueError: The following `model_kwargs` are not used by the model: ['token_type_ids'] (note: typos in the generate arguments will also show up in this list)
code:
# qlora flacon7b
from uutils.hf_uu.model_tokenizer.falcon_uu_mdl_tok import get_model_tokenizer_qlora_falcon7b
model, tokenizer, peft_config = get_model_tokenizer_qlora_falcon7b()
print(f'{model=}')
sent = 'Dogs are great because they are '
print()
# print to see if pad tokens are present and if it ignores the tokens at the end
# encoded_input = tokenizer(sent, padding='max_length', max_length=10, return_tensors='pt')
# sys.exit()
# Print all special tokens
print('\n---- start Print all special tokens')
for token_name, token in tokenizer.special_tokens_map.items():
print(f"{token_name}: {token}")
print('\n---- end Print all special tokens')
# Get the ID for the '[PAD]' token
try:
pad_token_id = tokenizer.convert_tokens_to_ids('[PAD]')
except KeyError:
raise ValueError("Token [PAD] is not present in the tokenizer vocabulary.")
# Index into the model's embedding table
try:
print(f'{model.get_input_embeddings().weight.size()=}')
pad_embedding = model.get_input_embeddings().weight[pad_token_id]
except IndexError:
raise ValueError(f"Token ID {pad_token_id} is not present in the model's embedding matrix.")
print(f'{pad_embedding=}')
print('Success!')
# check it generates something sensible
tokenizer.decode(model.generate(**tokenizer(sent, return_tensors='pt'), do_sample=True)[0])
print('Success2!')
I think I just need to add it to the tokenizer and the model. Since during fine-tuning/training the pad token would be ignored anyway, adding a random set of weights to the embedding table matrix wouldn't matter anyway. It wouldn't be updated anyway.
Code:
# - Get falcon 4bit model
# todo, where is this being saved & how to download quicker
model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=pretrained_model_name_or_path,
quantization_config=bnb_config,
trust_remote_code=True # allows to execute custom code you download from the uploaded model code you are using
)
# this is here to save gpu vram. Likely only needed when using 40b or when oom issues happen ref: https://stackoverflow.com/questions/76633335/why-does-hugging-face-falcon-model-use-mode-config-use-cache-false-why-wouldn
model.config.use_cache = use_cache
print(f'{type(model)=}')
# - Get falcon tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path,
trust_remote_code=True) # execs code downloaded from hf hub
# tokenizer.pad_token = tokenizer.eos_token # todo: why? https://stackoverflow.com/questions/76633368/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token
tokenizer.add_special_tokens({'pad_token': '[PAD]'}) # I think this is fine if during the training pad is ignored
model.resize_token_embeddings(len(tokenizer)) # todo: I think this is fine if during the training pad is ignored
print(f'{type(tokenizer)=}')
print(f'{tokenizer.pad_token=}')
So close....
Darn this still not works:
UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
code:
""" sfttrainer (likely using peft) best practices: https://huggingface.co/docs/trl/main/en/sft_trainer#best-practices
Best practices
Pay attention to the following best practices when training a model with that trainer:
todo: why trust_remote_code? I want more details. """ import sys
import torch from peft import LoraConfig
from transformers.modeling_utils import PreTrainedModel
from pdb import set_trace as st
def test_bfloat16_int4(compute_dtype: torch.dtype, use_4bit, ): """ python -c "import torch; print(torch.cuda.get_device_capability());" todo: check other code test_bfloat16() do we need use_4bit? """ if compute_dtype == torch.float16 and use4bit: major, = torch.cuda.get_device_capability() if major >= 8: print("=" 80) print("Your GPU supports bfloat16, you can accelerate training with the argument --bfloat16") print("=" 80)
def get_model_tokenizer_qlora_falcon7b(
# model_id = "tiiuae/falcon-7b"
pretrained_model_name_or_path: str = "ybelkada/falcon-7b-sharded-bf16",
use_cache: bool = True,
# -- lora args
lora_alpha=16, # todo
lora_dropout=0.1, # todo, evidence drop out really help? google, crfm, gpt4
lora_r=64, # todo
bnb_4bit_compute_dtype=torch.float16, # changed it from Guanaco hf
# -- training args
output_dir="./results",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
# paging so that the sudden mem gpu spikes don't cause the run to shut down
# (I think usually caused by too long seqs)
# todo: why 32 bit opt?
# todo: paged nadamw opt?
optim="paged_adamw_32bit",
save_steps=10,
logging_steps=10,
learning_rate=2e-4,
max_grad_norm=0.3,
max_steps=500,
warmup_ratio=0.03,
lr_scheduler_type="constant",
# -- quant. args (not recommended to be changed unless you know what your doing?)
load_in_4bit=True, # load (usually huge) base model in 4 bits
bnb_4bit_quant_type="nf4", # normal float 4 for the (large) base models qlora
) -> tuple: """ Load the Falcon 7B model, quantize it in 4bit and attach LoRA adapters on it.
bf16 = 1S, 7Exp, 8Mantissa
hypothesis: 7b trained due to 6.7 emergence rumour, I still don't think emergence is real.
Notes:
- ft a model is very specific to the model, tokenizer and training scheme. Thus we return
- model, tokenizer, ft config (peft config), training args
ref:
- https://colab.research.google.com/drive/1DOi8MFv4SWN9NImVornZ7t6BgmLoPQO-#scrollTo=AjB0WAqFSzlD
"""
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer
# - Get bnb config for bit-4 base model (bnb lib for using 4bit qlora quantization techniques by tim dettmers)
bnb_config = BitsAndBytesConfig(
load_in_4bit=load_in_4bit, # load (usually huge) base model in 4 bits
bnb_4bit_quant_type=bnb_4bit_quant_type, # normal float 4 for the (usually huge) base model
bnb_4bit_compute_dtype=bnb_4bit_compute_dtype, # if you can, during computation use bf16
)
# - Get falcon 4bit model
# todo, where is this being saved & how to download quicker
model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=pretrained_model_name_or_path,
quantization_config=bnb_config,
trust_remote_code=True # allows to execute custom code you download from the uploaded model code you are using
)
print(f'{type(model)=}')
print(f'{model=}')
# this is here to save gpu vram. Likely only needed when using 40b or when oom issues happen ref: https://stackoverflow.com/questions/76633335/why-does-hugging-face-falcon-model-use-mode-config-use-cache-false-why-wouldn
model.config.use_cache = use_cache
print(f'{type(model)=}')
# - Get falcon tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path,
trust_remote_code=True) # execs code downloaded from hf hub
# tokenizer.pad_token = tokenizer.eos_token # ref: https://stackoverflow.com/questions/76633368/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token
# tokenizer.add_special_tokens({'pad_token': '[PAD]'}) # I think this is fine if during the training pad is ignored
tokenizer.add_special_tokens({'pad_token': '<|pad|>'}) # I think this is fine if during the training pad is ignored
# - Modify model
# add pad token embed
model.resize_token_embeddings(len(tokenizer)) # todo: I think this is fine if during the training pad is ignored
model.transformer.word_embeddings.padding_idx = len(tokenizer) - 1
model.config.max_new_tokens = len(tokenizer)
# model.config.min_length = 1
print(f'{model=}')
print(f'{type(tokenizer)=}')
print(f'{tokenizer.pad_token=}')
# data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False) todo
# - Get falcon lora config
peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias="none",
task_type="CAUSAL_LM",
# model card for falcon tiiuae/falcon-7b: https://huggingface.co/tiiuae/falcon-7b/blob/main/modelling_RW.py
# does seem to include all trainable params as done by qlora on their own paper
target_modules=[
# word_embeddings,
"query_key_value",
"dense",
"dense_h_to_4h",
"dense_4h_to_h",
# "lm_head"
]
)
print(f'{type(peft_config)=}')
# todo: print the num params of the lora = D1*r + D2*r and num of bytes by prec. (bytes) * num params
return model, tokenizer, peft_config
def example_test_model_already_has_pad_token(): """ if it already has pad token, it likely has a small prob, so we are done.
compare it's norm with other tokens to verify this is true.
python ~/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/model_tokenizer/falcon_uu_mdl_tok.py """
from uutils.hf_uu.data_hf.common import get_guanaco_datsets_add_splits_train_test_only
trainset, _, testset = get_guanaco_datsets_add_splits_train_test_only()
# qlora flacon7b
from uutils.hf_uu.model_tokenizer.falcon_uu_mdl_tok import get_model_tokenizer_qlora_falcon7b
model, tokenizer, peft_config = get_model_tokenizer_qlora_falcon7b()
model: PreTrainedModel = model
print(f'{model=}')
sent = 'Dogs are great because they are '
print()
# print to see if pad tokens are present and if it ignores the tokens at the end
encoded_input = tokenizer(sent, padding='max_length', max_length=10, return_tensors='pt')
print(f'{encoded_input=}')
# Print all special tokens
print('\n---- start Print all special tokens')
for token_name, token in tokenizer.special_tokens_map.items():
print(f"{token_name}: {token}")
print('\n---- end Print all special tokens')
# Get the ID for the '[PAD]' token
try:
pad_token_id = tokenizer.convert_tokens_to_ids('[PAD]')
except KeyError:
raise ValueError("Token [PAD] is not present in the tokenizer vocabulary.")
# Index into the model's embedding table
try:
print(f'{model.get_input_embeddings().weight.size()=}')
pad_embedding = model.get_input_embeddings().weight[pad_token_id]
except IndexError:
raise ValueError(f"Token ID {pad_token_id} is not present in the model's embedding matrix.")
print(f'{pad_embedding=}')
print('Success!\n')
# check it generates something sensible
# tokenizer.decode(model.generate(**tokenizer(sent, return_tensors='pt'), do_sample=True)[0])
input_ids, attention_mask = encoded_input['input_ids'], encoded_input['attention_mask']
predicted_tokens_ids_options = model.generate(input_ids=input_ids, attention_mask=attention_mask, do_sample=True)
predicted_tokens_ids = predicted_tokens_ids_options[0]
predicted_sent = tokenizer.decode(predicted_tokens_ids)
print(f'original sentence: {sent=}')
print(f'predicted sentence: {predicted_sent=}')
print('Success2!')
if name == 'main': import time
start_time = time.time()
example_test_model_already_has_pad_token()
print(f"The main function executed in {time.time() - start_time} seconds.\a")
it doesn't like the modifications to the model:
model.transformer.word_embeddings.padding_idx = len(tokenizer) - 1
model.config.max_new_tokens = len(tokenizer)
Hey @brando90 ! Thanks a lot for reporting and using transformers
. This particular thread is not exactly the good place to have such huge chunks of codes and talk about another issue. My best recommendation is:
pad_token
, max_new_tokens
etc . You should have a look here. This will remove the warning that you were seeing. Reading the post you created on the HF forum, you mention
it doesnβt like the modifications to the model:
But since there is no traceback, this is very vague! A colab will show the outputs you got, making it easier to understand. Also regarding padding token and not padding token, I believe this is a very important question and if we should review how we resize the embedding, so be it! Some model's embedding are usually always bigger than the length of the tokenizer to allow adding new tokens / be a power of X to make it faster.
As a temporary fix I was able to accomplish the inference (of a Falcon 7b training) stopping correctly like this:
tokenizer.pad_token = tokenizer.eos_token
)eos_token_id=39735
This makes the inference generate token ***** at the end of the answer (because it is in all the training examples), at which point it will stop because it is set as the ending token.
output_tokens = model.generate(
input_ids = batch.input_ids,
max_new_tokens=100,
temperature=0.001,
top_p=0.7,
num_return_sequences=1,
pad_token_id=39735, # *****
eos_token_id=39735, # *****
)
Finally found the correct way to do this here: https://georgesung.github.io/ai/qlora-ift/
You need to do tokenizer.add_special_tokens({'pad_token': '[PAD]'})
instead of tokenizer.pad_token = tokenizer.eos_token
And you need to add the tokenizer.eos_token
at the end of EACH training example.
in my case for some reason eos_token_id
and ... was not being added to model.generate configs
If you want help feel free to open an issue with more details π
@robertheessels your answer solved my problem. you save my life. thank you so much!!!
You need to do tokenizer.add_special_tokens({'pad_token': '[PAD]'}) instead of tokenizer.pad_token = tokenizer.eos_token
And you need to add the tokenizer.eos_token at the end of EACH training example.
Adding the eos_token
at the end of each training example can be activated using
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("huggyllama/llama-7b", add_eos_token = True)
Or simply:
>>> tokenizer.add_eos_token = True
When I setup:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
My code is crashed every time I run on T4 Colab
# Load dataset (you can process it here)
dataset = load_dataset(dataset_name, split="train")
# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
load_in_4bit=use_4bit,
bnb_4bit_quant_type=bnb_4bit_quant_type,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=use_nested_quant,
)
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
major, _ = torch.cuda.get_device_capability()
if major >= 8:
print("=" * 80)
print("Your GPU supports bfloat16: accelerate training with bf16=True")
print("=" * 80)
# Load base model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map=device_map
)
# model.to(device)
model.config.use_cache = False
model.config.pretraining_tp = 1
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# tokenizer.pad_token = tokenizer.eos_token
# For LLaMA models, the default tokenizer does not specify a pad token
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
# tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training
#Adding the eos_token: </s> at the end of each training example
#tokenizer.add_eos_token = True
# Load LoRA configuration
peft_config = LoraConfig(
lora_alpha=lora_alpha,
lora_dropout=lora_dropout,
r=lora_r,
bias="none",
task_type="CAUSAL_LM",
)
# Set training parameters
training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_train_epochs,
per_device_train_batch_size=per_device_train_batch_size,
gradient_accumulation_steps=gradient_accumulation_steps,
optim=optim,
save_steps=save_steps,
logging_steps=logging_steps,
learning_rate=learning_rate,
weight_decay=weight_decay,
fp16=fp16,
bf16=bf16,
max_grad_norm=max_grad_norm,
max_steps=max_steps,
warmup_ratio=warmup_ratio,
group_by_length=group_by_length,
lr_scheduler_type=lr_scheduler_type,
report_to="tensorboard"
)
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
dataset_text_field="text",
max_seq_length=max_seq_length,
tokenizer=tokenizer,
args=training_arguments,
packing=packing,
)
# Train model
trainer.train()
# Save trained model
trainer.model.save_pretrained(new_model)
https://drive.google.com/file/d/1l6HsYH8iHgEA8H3jHMwgcMgA5_OdTbmh/view?usp=sharing
Hey, please open a different issue as this is not related! Also make sure that you can run nvidia-smi
and that it shows the gpu, otherwise you might just not have set the colab to use a GPU instance.
@dtthanh1971 Your issue may be because len(tokenizer) != model.vocab_size, i.e. len(tokenizer) == model.vocab_size + 1. That was my experience. See Kumar Saurabh's answer here: https://stackoverflow.com/questions/76633368/how-does-one-set-the-pad-token-correctly-not-to-eos-during-fine-tuning-to-avoi
I am running into the same issue on stablecode-instruct-alpha-3b
which is gptx neo based. whats the recommended approach in this case?
adding [pad]
token results in issues during training. ../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [94,0,0], thread: [96,0,0] Assertion srcIndex < srcSelectDimSize failed.
@LazerJesus you probably did not resize your embedding matrix. Run on CPU to have a better idea of where the issue arise, and open a seperate issue if there is indeed a bug. Also could you make sure to check other issues or the forum, as this is quite a common problem ! (meaning has been reported a lot)
I solved it by changing AutoTokenizer to LlamaTokenizer to force to use the slow tokenizer and not the fast tokenizer that is automatically imported, I lost some functions but it works.
I solved it by changing AutoTokenizer to LlamaTokenizer to force to use the slow tokenizer and not the fast tokenizer that is automatically imported, I lost some functions but it works.
This worked for me, too.
use_fast = False
when loading with AutoTokenizer
tokenizer("text", add_special_tokens= True)
for both fast and slow you should have the same results.
If not, and you are on main, feel free to open a new issue π @brando90
But this assumes the model has a pad_token
I haven't confirmed, but I think
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
is equivalent totokenizer.pad_token = '[PAD]'
(edit: might be wrong about that). So if there are runtime errors withtokenizer.add_special_tokens({'pad_token': '[PAD]'})
then there would also be runtime errors withtokenizer.pad_token = tokenizer.eos_token
-- notetokenizer.eos_token
is just a string. But I observed runtime errors with neither. I just observed that when I settokenizer.pad_token = tokenizer.eos_token
during training, the model won't stop generating during inference, since it was trained to not output the eos token (per discussions above).Since I was working with open_llama_7b, I saw that even though the model's tokenizer didn't specify a pad token string in its tokenizer_config.json, it still had a row in its token embedding matrix for the pad token. If you run
print(model)
, you can see its token embedding matrix has an index reserved for the pad token (idx 0 in this case):> print(model) LlamaForCausalLM( (model): LlamaModel( (embed_tokens): Embedding(32000, 4096, padding_idx=0) ..
You can also see the pad token's embedding itself:
model.state_dict()['model.embed_tokens.weight'][0]
. Although from discussions above and also this discussion, it doesn't seem to matter what the actual embeddings are for the pad token.
It works well when I use the following codes:
DEFAULT_PAD_TOKEN = "[PAD]"
DEFAULT_EOS_TOKEN = "</s>"
DEFAULT_BOS_TOKEN = "<s>"
DEFAULT_UNK_TOKEN = "<unk>"
tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Llama-2-7b-hf",
return_tensors="pt",
model_max_length=512,
add_eos_token=True,
add_bos_token=True,
padding='longest',
padding_side="right",
use_fast=False,
trust_remote_code=True,
# use_auth_token=use_auth_token,
device_map="auto",
)
tokenizer.add_special_tokens(
{
"pad_token": DEFAULT_PAD_TOKEN,
"eos_token": DEFAULT_EOS_TOKEN,
"bos_token": DEFAULT_BOS_TOKEN,
"unk_token": DEFAULT_UNK_TOKEN,
}
)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
Best regards,
Shuyue Nov. 27th, 2023
I have one question about adding these special tokens
:
A warning occurred:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
After we add these special tokens, we resize input token embeddings matrix of the model:
# Resize input token embeddings matrix of the model if new_num_tokens != config.vocab_size.
model.resize_token_embeddings(len(tokenizer))
So, one problem here is how could we know whether these newly added embeddings are fine-tuned.
Thank you very much!
Best regards,
Shuyue Nov. 27th, 2023
Of course they are not if the size of the matrix / these tokens are new. The warning is more general than it seems, but if you add new special tokens, they were not part of the vocab before thus not seen.
Of course they are not if the size of the matrix / these tokens are new. The warning is more general than it seems, but if you add new special tokens, they were not part of the vocab before thus not seen.
Dear Arthur,
Thank you very much for you reply!
I also have one question about training the model. In practice, what kind of length
should we set when we train the model? Should we use max length
of the dataset, or averaged length
? Should we use group_by_length
? Or, should we pad all the sequences into the same length, for example, the maximum length in a batch, or, should we just pad all the sequences in the dataset into max length
? Which is the most effective in practice?
Thank you very much, and have a nice day!
Best regards,
Shuyue Dec. 2nd, 2023
Not sure at all about this, but you can ask on the forum instead!
Thanks!
System Info
transformers
version: 4.29.0.dev0Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
As mentioned on the title, the LLaMA tokenizer does not add the
eos_token
at the end of the inputs. This only happens on the fast version (use_fast=True
).Steps to reproduce the behaviour:
input_ids
to check if theeos_token_id
(2
) is added at the end.Expected behavior
Expected output