Closed zxzhijia closed 1 year ago
How you solve this problem?
same problem, how can you solve?
Although I receive the same warning, I am relieved to see that the training can still be completed successfully. However, I do have concerns about the potential impact of these warnings on the performance of our finetuned model.
Same problem found. Is there any negative effect with this warning?
I am also getting same Warning. Anything to worry?
Same here for training Baichuan2 using fastchat, is there anything wrong with this warning message?
same here.
When this message appears, the entire training example will be masked: https://github.com/lm-sys/FastChat/blob/a754c48bc74368a042e6bd24e808086ed6040fb4/fastchat/train/train.py#L161
Why does it happen?
It happens due to different tokenization output of this round text. If your round text starts with word like "Solution" or something elese, it will tokenize "Solution" as one word "_Solution". If the round text "Solution" is included in the conversation, it will split as "Sol" and "ution", as two tokens. It is hard to fix.
@TjoyLiu will it effect training and how to ignore it?
@TjoyLiu will it effect training and how to ignore it?
When you meet this mismatch warning, just ignore it, do not use this sample the training stage.
@TjoyLiu will it effect training and how to ignore it?
When you meet this mismatch warning, just ignore it, do not use this sample the training stage.
Maybe all sample have the same warning
I got the same problem, when use llama3 as llava's LLM part. I set the '<|finetune_right_pad_id|>' as pad_token. Then got the problem. tokenizer.pad_token = '<|finetune_right_pad_id|>' tokenizer.pad_token_id = tokenizer.encode('<|finetune_right_pad_id|>', add_special_tokens=False)[0]
I have below consecutive warnings when I use a llama-7b-hf as pretrain to fine tune using my own data. Is this a problem? Could anyone please guide me on how to fix it?
WARNING: tokenization mismatch: 185 vs. 186. (ignored) WARNING: tokenization mismatch: 130 vs. 131. (ignored) WARNING: tokenization mismatch: 139 vs. 140. (ignored) WARNING: tokenization mismatch: 124 vs. 125. (ignored) WARNING: tokenization mismatch: 185 vs. 186. (ignored) WARNING: tokenization mismatch: 124 vs. 125. (ignored) WARNING: tokenization mismatch: 72 vs. 73. (ignored) WARNING: tokenization mismatch: 124 vs. 125. (ignored) WARNING: tokenization mismatch: 130 vs. 131. (ignored)
Below is my llama model's config.json
{"architectures": ["LLaMAForCausalLM"], "bos_token_id": 0, "eos_token_id": 1, "hidden_act": "silu", "hidden_size": 4096, "intermediate_size": 11008, "initializer_range": 0.02, "max_sequence_length": 2048, "model_type": "llama", "num_attention_heads": 32, "num_hidden_layers": 32, "pad_token_id": -1, "rms_norm_eps": 1e-06, "torch_dtype": "float16", "transformers_version": "4.27.0.dev0", "use_cache": true, "vocab_size": 32000}
Below is my tokenizer_config.json
{"bos_token": "", "eos_token": "", "model_max_length": 1000000000000000019884624838656, "tokenizer_class": "LlamaTokenizer", "unk_token": ""}
elow is my special_tokens_map.json
{}