Tokenization mismatch in Phi-3 when finetune process

hellangleZ commented 6 months ago

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Using conversation format: phi3 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Using conversation format: phi3 [2024-05-03 23:34:36,587] [INFO] [partition_parameters.py:345:exit] finished initializing model - num_params = 586, num_elems = 4.12B Formatting inputs...Skip in lazy mode Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. Parameter Offload: Total persistent parameters: 530432 in 312 params 0%| | 0/5198 [00:00<?, ?it/s]WARNING: tokenization mismatch: 565 vs. 569. (ignored) WARNING: tokenization mismatch: 505 vs. 514. (ignored) WARNING: tokenization mismatch: 505 vs. 509. (ignored) WARNING: tokenization mismatch: 505 vs. 514. (ignored) WARNING: tokenization mismatch: 510 vs. 519. (ignored) WARNING: tokenization mismatch: 465 vs. 485. (ignored) WARNING: tokenization mismatch: 336 vs. 340. (ignored) WARNING: tokenization mismatch: 494 vs. 497. (ignored) WARNING: tokenization mismatch: 471 vs. 480. (ignored) WARNING: tokenization mismatch: 524 vs. 533. (ignored) WARNING: tokenization mismatch: 477 vs. 485. (ignored) WARNING: tokenization mismatch: 509 vs. 518. (ignored) WARNING: tokenization mismatch: 514 vs. 523. (ignored) WARNING: tokenization mismatch: 539 vs. 566. (ignored) WARNING: tokenization mismatch: 672 vs. 703. (ignored) WARNING: tokenization mismatch: 322 vs. 336. (ignored) WARNING: tokenization mismatch: 516 vs. 525. (ignored) WARNING: tokenization mismatch: 508 vs. 517. (ignored) WARNING: tokenization mismatch: 501 vs. 510. (ignored) WARNING: tokenization mismatch: 503 vs. 528. (ignored) WARNING: tokenization mismatch: 529 vs. 538. (ignored) WARNING: tokenization mismatch: 477 vs. 485. (ignored) WARNING: tokenization mismatch: 502 vs. 511. (ignored) WARNING: tokenization mismatch: 467 vs. 475. (ignored) WARNING: tokenization mismatch: 536 vs. 545. (ignored) WARNING: tokenization mismatch: 512 vs. 521. (ignored) WARNING: tokenization mismatch: 302 vs. 307. (ignored) WARNING: tokenization mismatch: 365 vs. 371. (ignored) WARNING: tokenization mismatch: 337 vs. 354. (ignored) WARNING: tokenization mismatch: 152 vs. 158. (ignored) WARNING: tokenization mismatch: 526 vs. 535. (ignored) WARNING: tokenization mismatch: 371 vs. 374. (ignored) WARNING: tokenization mismatch: 325 vs. 341. (ignored) WARNING: tokenization mismatch: 372 vs. 390. (ignored) WARNING: tokenization mismatch: 480 vs. 483. (ignored) WARNING: tokenization mismatch: 544 vs. 548. (ignored)WARNING: tokenization mismatch: 138 vs. 142. (ignored)

WARNING: tokenization mismatch: 630 vs. 633. (ignored) WARNING: tokenization mismatch: 200 vs. 203. (ignored) WARNING: tokenization mismatch: 227 vs. 236. (ignored) WARNING: tokenization mismatch: 221 vs. 225. (ignored) WARNING: tokenization mismatch: 494 vs. 503. (ignored) WARNING: tokenization mismatch: 398 vs. 405. (ignored) WARNING: tokenization mismatch: 121 vs. 125. (ignored) WARNING: tokenization mismatch: 516 vs. 525. (ignored) WARNING: tokenization mismatch: 404 vs. 411. (ignored) WARNING: tokenization mismatch: 511 vs. 520. (ignored) WARNING: tokenization mismatch: 135 vs. 139. (ignored) WARNING: tokenization mismatch: 339 vs. 343. (ignored) WARNING: tokenization mismatch: 353 vs. 357. (ignored) WARNING: tokenization mismatch: 172 vs. 175. (ignored) WARNING: tokenization mismatch: 332 vs. 338. (ignored) WARNING: tokenization mismatch: 128 vs. 132. (ignored) WARNING: tokenization mismatch: 153 vs. 157. (ignored) WARNING: tokenization mismatch: 249 vs. 259. (ignored) WARNING: tokenization mismatch: 356 vs. 371. (ignored) WARNING: tokenization mismatch: 509 vs. 518. (ignored) WARNING: tokenization mismatch: 516 vs. 525. (ignored) WARNING: tokenization mismatch: 118 vs. 121. (ignored) WARNING: tokenization mismatch: 150 vs. 154. (ignored) WARNING: tokenization mismatch: 155 vs. 159. (ignored) WARNING: tokenization mismatch: 172 vs. 179. (ignored) WARNING: tokenization mismatch: 123 vs. 126. (ignored) WARNING: tokenization mismatch: 102 vs. 105. (ignored) WARNING: tokenization mismatch: 123 vs. 127. (ignored) WARNING: tokenization mismatch: 106 vs. 109. (ignored) WARNING: tokenization mismatch: 505 vs. 514. (ignored) WARNING: tokenization mismatch: 387 vs. 393. (ignored) WARNING: tokenization mismatch: 128 vs. 132. (ignored) WARNING: tokenization mismatch: 498 vs. 507. (ignored) WARNING: tokenization mismatch: 91 vs. 94. (ignored) WARNING: tokenization mismatch: 117 vs. 121. (ignored) WARNING: tokenization mismatch: 151 vs. 155. (ignored) WARNING: tokenization mismatch: 194 vs. 203. (ignored) WARNING: tokenization mismatch: 126 vs. 130. (ignored) WARNING: tokenization mismatch: 91 vs. 94. (ignored)

hellangleZ commented 6 months ago

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Using conversation format: phi3 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Using conversation format: phi3 [2024-05-03 23:34:36,587] [INFO] [partition_parameters.py:345:exit] finished initializing model - num_params = 586, num_elems = 4.12B Formatting inputs...Skip in lazy mode Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. Parameter Offload: Total persistent parameters: 530432 in 312 params 0%| | 0/5198 [00:00<?, ?it/s]WARNING: tokenization mismatch: 565 vs. 569. (ignored) WARNING: tokenization mismatch: 505 vs. 514. (ignored) WARNING: tokenization mismatch: 505 vs. 509. (ignored) WARNING: tokenization mismatch: 505 vs. 514. (ignored) WARNING: tokenization mismatch: 510 vs. 519. (ignored) WARNING: tokenization mismatch: 465 vs. 485. (ignored) WARNING: tokenization mismatch: 336 vs. 340. (ignored) WARNING: tokenization mismatch: 494 vs. 497. (ignored) WARNING: tokenization mismatch: 471 vs. 480. (ignored) WARNING: tokenization mismatch: 524 vs. 533. (ignored) WARNING: tokenization mismatch: 477 vs. 485. (ignored) WARNING: tokenization mismatch: 509 vs. 518. (ignored) WARNING: tokenization mismatch: 514 vs. 523. (ignored) WARNING: tokenization mismatch: 539 vs. 566. (ignored) WARNING: tokenization mismatch: 672 vs. 703. (ignored) WARNING: tokenization mismatch: 322 vs. 336. (ignored) WARNING: tokenization mismatch: 516 vs. 525. (ignored) WARNING: tokenization mismatch: 508 vs. 517. (ignored) WARNING: tokenization mismatch: 501 vs. 510. (ignored) WARNING: tokenization mismatch: 503 vs. 528. (ignored) WARNING: tokenization mismatch: 529 vs. 538. (ignored) WARNING: tokenization mismatch: 477 vs. 485. (ignored) WARNING: tokenization mismatch: 502 vs. 511. (ignored) WARNING: tokenization mismatch: 467 vs. 475. (ignored) WARNING: tokenization mismatch: 536 vs. 545. (ignored) WARNING: tokenization mismatch: 512 vs. 521. (ignored) WARNING: tokenization mismatch: 302 vs. 307. (ignored) WARNING: tokenization mismatch: 365 vs. 371. (ignored) WARNING: tokenization mismatch: 337 vs. 354. (ignored) WARNING: tokenization mismatch: 152 vs. 158. (ignored) WARNING: tokenization mismatch: 526 vs. 535. (ignored) WARNING: tokenization mismatch: 371 vs. 374. (ignored) WARNING: tokenization mismatch: 325 vs. 341. (ignored) WARNING: tokenization mismatch: 372 vs. 390. (ignored) WARNING: tokenization mismatch: 480 vs. 483. (ignored) WARNING: tokenization mismatch: 544 vs. 548. (ignored)WARNING: tokenization mismatch: 138 vs. 142. (ignored)

WARNING: tokenization mismatch: 630 vs. 633. (ignored) WARNING: tokenization mismatch: 200 vs. 203. (ignored) WARNING: tokenization mismatch: 227 vs. 236. (ignored) WARNING: tokenization mismatch: 221 vs. 225. (ignored) WARNING: tokenization mismatch: 494 vs. 503. (ignored) WARNING: tokenization mismatch: 398 vs. 405. (ignored) WARNING: tokenization mismatch: 121 vs. 125. (ignored) WARNING: tokenization mismatch: 516 vs. 525. (ignored) WARNING: tokenization mismatch: 404 vs. 411. (ignored) WARNING: tokenization mismatch: 511 vs. 520. (ignored) WARNING: tokenization mismatch: 135 vs. 139. (ignored) WARNING: tokenization mismatch: 339 vs. 343. (ignored) WARNING: tokenization mismatch: 353 vs. 357. (ignored) WARNING: tokenization mismatch: 172 vs. 175. (ignored) WARNING: tokenization mismatch: 332 vs. 338. (ignored) WARNING: tokenization mismatch: 128 vs. 132. (ignored) WARNING: tokenization mismatch: 153 vs. 157. (ignored) WARNING: tokenization mismatch: 249 vs. 259. (ignored) WARNING: tokenization mismatch: 356 vs. 371. (ignored) WARNING: tokenization mismatch: 509 vs. 518. (ignored) WARNING: tokenization mismatch: 516 vs. 525. (ignored) WARNING: tokenization mismatch: 118 vs. 121. (ignored) WARNING: tokenization mismatch: 150 vs. 154. (ignored) WARNING: tokenization mismatch: 155 vs. 159. (ignored) WARNING: tokenization mismatch: 172 vs. 179. (ignored) WARNING: tokenization mismatch: 123 vs. 126. (ignored) WARNING: tokenization mismatch: 102 vs. 105. (ignored) WARNING: tokenization mismatch: 123 vs. 127. (ignored) WARNING: tokenization mismatch: 106 vs. 109. (ignored) WARNING: tokenization mismatch: 505 vs. 514. (ignored) WARNING: tokenization mismatch: 387 vs. 393. (ignored) WARNING: tokenization mismatch: 128 vs. 132. (ignored) WARNING: tokenization mismatch: 498 vs. 507. (ignored) WARNING: tokenization mismatch: 91 vs. 94. (ignored) WARNING: tokenization mismatch: 117 vs. 121. (ignored) WARNING: tokenization mismatch: 151 vs. 155. (ignored) WARNING: tokenization mismatch: 194 vs. 203. (ignored) WARNING: tokenization mismatch: 126 vs. 130. (ignored) WARNING: tokenization mismatch: 91 vs. 94. (ignored)

I use the phi3-instrcuct May 1th version, and pretrain is working good, but when FT process, this error occurs

I found there is a code confliction

At train.py

Is it should be change to same with llama3 ?

hellangleZ commented 6 months ago

After test, this issue not occurs on LLama3 fine_tuning

mmaaz60 commented 6 months ago

Hi @hellangleZ

Thank you for your interest in our work. One of the reasons of this issue could be the wrong --version value. Can you please confirm if you are using --version phi3_instruct in your experiment?

If the issue does not resolve, please provide detailed steps you followed to run the training so that I can reproduce the error and help you better. Thank You :)

hellangleZ commented 6 months ago

Hi @hellangleZ

Thank you for your interest in our work. One of the reasons of this issue could be the wrong --version value. Can you please confirm if you are using --version phi3_instruct in your experiment?

If the issue does not resolve, please provide detailed steps you followed to run the training so that I can reproduce the error and help you better. Thank You :)

Yes I use phi3_instruct

Hi @mmaaz60

I found some issue on pretrain process.

Like

And

I think the problem maybe due to the two fields, could you help to check on Phi-3 repo

My fine_tuning scripts all of that copy your template

but after change field , it still show mismatch tokenizer

deepspeed llava/train/train_mem.py \ --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \ --deepspeed ./scripts/zero3.json \ --model_name_or_path /data2/phi3-A11 \ --version phi3_instruct \ --data_path /data2/llavaft/llava_v1_5_mix665k.json \ --image_folder /data2/LLaVA-main/playground/data \ --vision_tower /data2/openai/clip-vit-large-patch14-336 \ --pretrain_mm_mlp_adapter ./checkpoints/llava-v1.5-phi3-mini-pretrain/mm_projector.bin \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir ./checkpoints/llava-v1.5-phi3-mini-lora \ --num_train_epochs 1 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 2 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-4 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \

hellangleZ commented 6 months ago

Please close it, ,just using the newest phi-3 model, it solves the problem

mbzuai-oryx / LLaVA-pp

Tokenization mismatch in Phi-3 when finetune process #17