Finetuning Llama3-8B fails due to lack of a PAD token in new tokenizer

RDouglasSharp commented 6 months ago

I attempted a workaround, but the output from finetuning doesn't look quite right. Has anyone made a working fix for this issue?

RDouglasSharp commented 6 months ago

File "/home/doug/FastChat/fastchat/train/train.py", line 114, in preprocess raise ValueError( input_ids = tokenizer( ValueError: File "/home/doug/FastChat/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2858, in call Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}). encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs) File "/home/doug/FastChat/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2944, in _call_one return self.batch_encode_plus( File "/home/doug/FastChat/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3126, in batch_encode_plus padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies( File "/home/doug/FastChat/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2763, in _get_padding_truncation_strategies raise ValueError( ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).

RDouglasSharp commented 6 months ago

I think this is the issue: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/discussions/4

cy565025164 commented 6 months ago

same issue

RDouglasSharp commented 6 months ago

So here is my workaround while we wait for proper training support...

First, implement the fixes suggested in the comments to issue:

https://github.com/lm-sys/FastChat/issues/3263

This will add Llama-3 model adapter, and will fix inference.

Next, in fastchat/train.py, I added four lines just above the lines:

# Load data
data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)

Here are the lines added:

    eot = "<|eot_id|>"
    eot_id = tokenizer.convert_tokens_to_ids(eot)
    tokenizer.pad_token = eot
    tokenizer.pad_token_id = eot_id

This sets the pad token to one of the two end of text tokens, and fixes the crash above.

Then, in order for inference to work, you need to make a corresponding change for inference in fastchat/model/model_adapter.py, adding the following lines in class Llama3Adapter ()which was added by the fixes in the issue cited abbove at top) , to method def load_model, just below the line: model, tokenizer = super().load_model(model_path, from_pretrained_kwargs)

    eot = "<|eot_id|>"
    eot_id = tokenizer.convert_tokens_to_ids(eot)

    model.config.eos_token = eot
    model.config.eos_token_id = eot_id

Now, training and inference should both work. Make sure you have selected the llama3 model adapter at inference time.

Oscarjia commented 6 months ago

@RDouglasSharp Do we also set unk_token?

tokenizer.unk_token = eot
tokenizer.unk_token_id = eot_id

RDouglasSharp commented 6 months ago

Only if you want output/training on any sample to end at any unknown token. Or that is my interpretation...

mmaaz60 commented 6 months ago

Hi Everyone,

As per my experience with LLaMA-3, the following should work.

A better work-around for pad token in LLaMA-3 would be to add a special token to tokenizer and then save it along with the model configs. For example, you may use the following code to achieve this,

def smart_tokenizer_and_embedding_resize(
    special_tokens_dict: Dict,
    tokenizer: transformers.PreTrainedTokenizer,
    model: transformers.PreTrainedModel,
):
    """Resize tokenizer and embedding.

    Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
    """
    num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
    model.resize_token_embeddings(len(tokenizer))

    if num_new_tokens > 0:
        input_embeddings = model.get_input_embeddings().weight.data
        output_embeddings = model.get_output_embeddings().weight.data

        input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
            dim=0, keepdim=True)
        output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
            dim=0, keepdim=True)

        input_embeddings[-num_new_tokens:] = input_embeddings_avg
        output_embeddings[-num_new_tokens:] = output_embeddings_avg

if tokenizer.pad_token is None:
    print(f"Adding pad token as '<pad>'")
    smart_tokenizer_and_embedding_resize(
        special_tokens_dict=dict(pad_token="<pad>"),
        tokenizer=tokenizer,
        model=model,
    )

And finally,

model.config.pad_token_id = tokenizer.pad_token_id

Here you add <pad> token, resize the embeddings and finally save this information in your model config.

Following above, we train LLaMA-3 based LLaVA-v1.5 model and achieve very good results. All the codes (fully supported with official LLaVA framework), pretrained chekcpoints and evaluation results are available on our GitHub Repo at LLaVA++.

Oscarjia commented 6 months ago

@mmaaz60 Thanks for sharing your better solution, that is really great! Besides, i have star your project!

mmaaz60 commented 6 months ago

Thank You @Oscarjia

Dandelionym commented 4 months ago

Thanks all for suggesting the solution. But it is still a problem for me. I use the latest Fastchat repo and use the Meta-Llama-3-8B official model for fine-tuning, after getting the error (same as shown below), I just tried the changes following @mmaaz60 's solution, but it still not work, still raise the warning like this, the loss is badly zero.

WARNING: tokenization mismatch: 69 vs. 70. #turn = 1. (ignored)
WARNING: tokenization mismatch: 392 vs. 393. #turn = 1. (ignored)
WARNING: tokenization mismatch: 1043 vs. 1044. #turn = 1. (ignored)
WARNING: tokenization mismatch: 281 vs. 282. #turn = 1. (ignored)
WARNING: tokenization mismatch: 435 vs. 436. #turn = 1. (ignored)
WARNING: tokenization mismatch: 611 vs. 612. #turn = 1. (ignored)
WARNING: tokenization mismatch: 463 vs. 464. #turn = 1. (ignored)
WARNING: tokenization mismatch: 461 vs. 462. #turn = 1. (ignored)
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 9.021199819576004e-08, 'epoch': 0.0}                          
  0%|                                           | 1/44328 [00:53<655:37:51, 53.25s/it]
WARNING: tokenization mismatch: 1187 vs. 1188. #turn = 1. (ignored)
WARNING: tokenization mismatch: 722 vs. 723. #turn = 1. (ignored)
WARNING: tokenization mismatch: 947 vs. 948. #turn = 1. (ignored)
WARNING: tokenization mismatch: 68 vs. 69. #turn = 1. (ignored)

Look at those numbers, cur_len is always less 1 then another. See the source code:

...
if cur_len < tokenizer.model_max_length:
    if cur_len != total_len:
        target[:] = IGNORE_TOKEN_ID
        rank0_print(
            f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}."
            f" #turn = {len(turns) - 1}. (ignored)"
        )
...

What should I change more? Really kind thanks for all.

RDouglasSharp commented 4 months ago

I ended up switching to Axolotl to train. I find it runs significantly faster and supports Llama3 properly.

On Sun, Jul 7, 2024 at 6:00 AM Mellen Y. Pu @.***> wrote:

Thanks all for suggesting the solution. But it is still a problem for me. I use the latest Fastchat repo and use the Meta-Llama-3-8B official model for fine-tuning, after getting the error (same as shown below), I just tried the changes following @mmaaz60 https://github.com/mmaaz60 's solution, but it still not work, still raise the warning like this, the loss is badly zero.

WARNING: tokenization mismatch: 69 vs. 70. #turn = 1. (ignored) WARNING: tokenization mismatch: 392 vs. 393. #turn = 1. (ignored) WARNING: tokenization mismatch: 1043 vs. 1044. #turn = 1. (ignored) WARNING: tokenization mismatch: 281 vs. 282. #turn = 1. (ignored) WARNING: tokenization mismatch: 435 vs. 436. #turn = 1. (ignored) WARNING: tokenization mismatch: 611 vs. 612. #turn = 1. (ignored) WARNING: tokenization mismatch: 463 vs. 464. #turn = 1. (ignored) WARNING: tokenization mismatch: 461 vs. 462. #turn = 1. (ignored) {'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 9.021199819576004e-08, 'epoch': 0.0} 0%| | 1/44328 [00:53<655:37:51, 53.25s/it] WARNING: tokenization mismatch: 1187 vs. 1188. #turn = 1. (ignored) WARNING: tokenization mismatch: 722 vs. 723. #turn = 1. (ignored) WARNING: tokenization mismatch: 947 vs. 948. #turn = 1. (ignored) WARNING: tokenization mismatch: 68 vs. 69. #turn = 1. (ignored)

Look at those numbers, cur_len is always less 1 then another. See the source code:

... if cur_len < tokenizer.model_max_length: if cur_len != total_len: target[:] = IGNORE_TOKEN_ID rank0_print( f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" #turn = {len(turns) - 1}. (ignored)" ) ...

What should I change more? Really kind thanks for all.

— Reply to this email directly, view it on GitHub https://github.com/lm-sys/FastChat/issues/3266#issuecomment-2212393268, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATMRTCICMMUVF4ZSW66TP53ZLEGUDAVCNFSM6AAAAABGQ5VULWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJSGM4TGMRWHA . You are receiving this because you were mentioned.Message ID: @.***>

Dandelionym commented 4 months ago

Axolotl

Good to see, thanks @RDouglasSharp and let me try it!

lm-sys / FastChat

Finetuning Llama3-8B fails due to lack of a PAD token in new tokenizer #3266