Open RDouglasSharp opened 6 months ago
File "/home/doug/FastChat/fastchat/train/train.py", line 114, in preprocess
raise ValueError(
input_ids = tokenizer(
ValueError: File "/home/doug/FastChat/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2858, in call
Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token
(tokenizer.pad_token = tokenizer.eos_token e.g.)
or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'})
.
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/home/doug/FastChat/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2944, in _call_one
return self.batch_encode_plus(
File "/home/doug/FastChat/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3126, in batch_encode_plus
padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
File "/home/doug/FastChat/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2763, in _get_padding_truncation_strategies
raise ValueError(
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token
(tokenizer.pad_token = tokenizer.eos_token e.g.)
or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'})
.
I think this is the issue: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/discussions/4
same issue
So here is my workaround while we wait for proper training support...
First, implement the fixes suggested in the comments to issue:
https://github.com/lm-sys/FastChat/issues/3263
This will add Llama-3 model adapter, and will fix inference.
Next, in fastchat/train.py, I added four lines just above the lines:
# Load data
data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)
Here are the lines added:
eot = "<|eot_id|>"
eot_id = tokenizer.convert_tokens_to_ids(eot)
tokenizer.pad_token = eot
tokenizer.pad_token_id = eot_id
This sets the pad token to one of the two end of text tokens, and fixes the crash above.
Then, in order for inference to work, you need to make a corresponding change for inference in fastchat/model/model_adapter.py, adding the following lines in class Llama3Adapter ()which was added by the fixes in the issue cited abbove at top) , to method def load_model, just below the line: model, tokenizer = super().load_model(model_path, from_pretrained_kwargs)
eot = "<|eot_id|>"
eot_id = tokenizer.convert_tokens_to_ids(eot)
model.config.eos_token = eot
model.config.eos_token_id = eot_id
Now, training and inference should both work. Make sure you have selected the llama3 model adapter at inference time.
@RDouglasSharp Do we also set unk_token?
tokenizer.unk_token = eot
tokenizer.unk_token_id = eot_id
Only if you want output/training on any sample to end at any unknown token. Or that is my interpretation...
Hi Everyone,
As per my experience with LLaMA-3, the following should work.
A better work-around for pad
token in LLaMA-3
would be to add a special token to tokenizer and then save it along with the model configs. For example, you may use the following code to achieve this,
def smart_tokenizer_and_embedding_resize(
special_tokens_dict: Dict,
tokenizer: transformers.PreTrainedTokenizer,
model: transformers.PreTrainedModel,
):
"""Resize tokenizer and embedding.
Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
"""
num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))
if num_new_tokens > 0:
input_embeddings = model.get_input_embeddings().weight.data
output_embeddings = model.get_output_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
dim=0, keepdim=True)
output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
output_embeddings[-num_new_tokens:] = output_embeddings_avg
if tokenizer.pad_token is None:
print(f"Adding pad token as '<pad>'")
smart_tokenizer_and_embedding_resize(
special_tokens_dict=dict(pad_token="<pad>"),
tokenizer=tokenizer,
model=model,
)
And finally,
model.config.pad_token_id = tokenizer.pad_token_id
Here you add <pad>
token, resize the embeddings and finally save this information in your model config.
Following above, we train LLaMA-3
based LLaVA-v1.5
model and achieve very good results. All the codes (fully supported with official LLaVA framework), pretrained chekcpoints and evaluation results are available on our GitHub Repo at LLaVA++.
@mmaaz60 Thanks for sharing your better solution, that is really great! Besides, i have star your project!
Thank You @Oscarjia
Thanks all for suggesting the solution. But it is still a problem for me.
I use the latest Fastchat
repo and use the Meta-Llama-3-8B
official model for fine-tuning, after getting the error (same as shown below), I just tried the changes following @mmaaz60 's solution, but it still not work, still raise the warning like this, the loss is badly zero.
WARNING: tokenization mismatch: 69 vs. 70. #turn = 1. (ignored)
WARNING: tokenization mismatch: 392 vs. 393. #turn = 1. (ignored)
WARNING: tokenization mismatch: 1043 vs. 1044. #turn = 1. (ignored)
WARNING: tokenization mismatch: 281 vs. 282. #turn = 1. (ignored)
WARNING: tokenization mismatch: 435 vs. 436. #turn = 1. (ignored)
WARNING: tokenization mismatch: 611 vs. 612. #turn = 1. (ignored)
WARNING: tokenization mismatch: 463 vs. 464. #turn = 1. (ignored)
WARNING: tokenization mismatch: 461 vs. 462. #turn = 1. (ignored)
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 9.021199819576004e-08, 'epoch': 0.0}
0%| | 1/44328 [00:53<655:37:51, 53.25s/it]
WARNING: tokenization mismatch: 1187 vs. 1188. #turn = 1. (ignored)
WARNING: tokenization mismatch: 722 vs. 723. #turn = 1. (ignored)
WARNING: tokenization mismatch: 947 vs. 948. #turn = 1. (ignored)
WARNING: tokenization mismatch: 68 vs. 69. #turn = 1. (ignored)
Look at those numbers, cur_len
is always less 1 then another. See the source code:
...
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_TOKEN_ID
rank0_print(
f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
f" #turn = {len(turns) - 1}. (ignored)"
)
...
What should I change more? Really kind thanks for all.
I ended up switching to Axolotl to train. I find it runs significantly faster and supports Llama3 properly.
On Sun, Jul 7, 2024 at 6:00 AM Mellen Y. Pu @.***> wrote:
Thanks all for suggesting the solution. But it is still a problem for me. I use the latest Fastchat repo and use the Meta-Llama-3-8B official model for fine-tuning, after getting the error (same as shown below), I just tried the changes following @mmaaz60 https://github.com/mmaaz60 's solution, but it still not work, still raise the warning like this, the loss is badly zero.
WARNING: tokenization mismatch: 69 vs. 70. #turn = 1. (ignored) WARNING: tokenization mismatch: 392 vs. 393. #turn = 1. (ignored) WARNING: tokenization mismatch: 1043 vs. 1044. #turn = 1. (ignored) WARNING: tokenization mismatch: 281 vs. 282. #turn = 1. (ignored) WARNING: tokenization mismatch: 435 vs. 436. #turn = 1. (ignored) WARNING: tokenization mismatch: 611 vs. 612. #turn = 1. (ignored) WARNING: tokenization mismatch: 463 vs. 464. #turn = 1. (ignored) WARNING: tokenization mismatch: 461 vs. 462. #turn = 1. (ignored) {'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 9.021199819576004e-08, 'epoch': 0.0} 0%| | 1/44328 [00:53<655:37:51, 53.25s/it] WARNING: tokenization mismatch: 1187 vs. 1188. #turn = 1. (ignored) WARNING: tokenization mismatch: 722 vs. 723. #turn = 1. (ignored) WARNING: tokenization mismatch: 947 vs. 948. #turn = 1. (ignored) WARNING: tokenization mismatch: 68 vs. 69. #turn = 1. (ignored)
Look at those numbers, cur_len is always less 1 then another. See the source code:
... if cur_len < tokenizer.model_max_length: if cur_len != total_len: target[:] = IGNORE_TOKEN_ID rank0_print( f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" #turn = {len(turns) - 1}. (ignored)" ) ...
What should I change more? Really kind thanks for all.
— Reply to this email directly, view it on GitHub https://github.com/lm-sys/FastChat/issues/3266#issuecomment-2212393268, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATMRTCICMMUVF4ZSW66TP53ZLEGUDAVCNFSM6AAAAABGQ5VULWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJSGM4TGMRWHA . You are receiving this because you were mentioned.Message ID: @.***>
Axolotl
Good to see, thanks @RDouglasSharp and let me try it!
I attempted a workaround, but the output from finetuning doesn't look quite right. Has anyone made a working fix for this issue?