lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.89k stars 4.54k forks source link

WARNING: tokenization mismatch: 185 vs. 186. (ignored) #1290

Closed zxzhijia closed 1 year ago

zxzhijia commented 1 year ago

I have below consecutive warnings when I use a llama-7b-hf as pretrain to fine tune using my own data. Is this a problem? Could anyone please guide me on how to fix it?

WARNING: tokenization mismatch: 185 vs. 186. (ignored) WARNING: tokenization mismatch: 130 vs. 131. (ignored) WARNING: tokenization mismatch: 139 vs. 140. (ignored) WARNING: tokenization mismatch: 124 vs. 125. (ignored) WARNING: tokenization mismatch: 185 vs. 186. (ignored) WARNING: tokenization mismatch: 124 vs. 125. (ignored) WARNING: tokenization mismatch: 72 vs. 73. (ignored) WARNING: tokenization mismatch: 124 vs. 125. (ignored) WARNING: tokenization mismatch: 130 vs. 131. (ignored)

Below is my llama model's config.json

{"architectures": ["LLaMAForCausalLM"], "bos_token_id": 0, "eos_token_id": 1, "hidden_act": "silu", "hidden_size": 4096, "intermediate_size": 11008, "initializer_range": 0.02, "max_sequence_length": 2048, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "pad_token_id": -1, "rms_norm_eps": 1e-06, "torch_dtype": "float16", "transformers_version": "4.27.0.dev0", "use_cache": true, "vocab_size": 32000}

Below is my tokenizer_config.json

{"bos_token": "", "eos_token": "", "model_max_length": 1000000000000000019884624838656, "tokenizer_class": "LlamaTokenizer", "unk_token": ""}

elow is my special_tokens_map.json

{}

firqaaa commented 1 year ago

How you solve this problem?

Ted8000 commented 1 year ago

same problem, how can you solve?

ericzhou571 commented 1 year ago

Although I receive the same warning, I am relieved to see that the training can still be completed successfully. However, I do have concerns about the potential impact of these warnings on the performance of our finetuned model.

wcy1122 commented 1 year ago

Same problem found. Is there any negative effect with this warning?

ajinkya123-robo commented 1 year ago

I am also getting same Warning. Anything to worry?

lucasjinreal commented 1 year ago

Same here for training Baichuan2 using fastchat, is there anything wrong with this warning message?

mianzhang commented 11 months ago

same here.

yuefengz commented 11 months ago

When this message appears, the entire training example will be masked: https://github.com/lm-sys/FastChat/blob/a754c48bc74368a042e6bd24e808086ed6040fb4/fastchat/train/train.py#L161

lucasjinreal commented 11 months ago

Why does it happen?

TjoyLiu commented 10 months ago

It happens due to different tokenization output of this round text. If your round text starts with word like "Solution" or something elese, it will tokenize "Solution" as one word "_Solution". If the round text "Solution" is included in the conversation, it will split as "Sol" and "ution", as two tokens. It is hard to fix.

lucasjinreal commented 8 months ago

@TjoyLiu will it effect training and how to ignore it?

TjoyLiu commented 8 months ago

@TjoyLiu will it effect training and how to ignore it?

When you meet this mismatch warning, just ignore it, do not use this sample the training stage.

cainiaoup commented 1 month ago

@TjoyLiu will it effect training and how to ignore it?

When you meet this mismatch warning, just ignore it, do not use this sample the training stage.

Maybe all sample have the same warning

cainiaoup commented 1 month ago

I got the same problem, when use llama3 as llava's LLM part. I set the '<|finetune_right_pad_id|>' as pad_token. Then got the problem. tokenizer.pad_token = '<|finetune_right_pad_id|>' tokenizer.pad_token_id = tokenizer.encode('<|finetune_right_pad_id|>', add_special_tokens=False)[0]