Tokenization mismatch in LLaVA-LLaMA-3

          > ```shell

pip's dependency

Hi @Luo-Z13,

The error related to pip's dependency can be ignored.
The error TypeError: pad_sequence(): argument 'padding_value' (position 3) must be float, not NoneType occurs during LLaMA-3 based model training. Actually LLaMA-3 does not use any pad token however during LLaVA-LLaMA-3 training we need pad token. So the workaround is to add a special token and resize the embeddings. This is done at https://github.com/mbzuai-oryx/LLaVA-pp/blob/b93d9c8d8539e794fc79a867aae08c4d7b3b6de7/LLaMA-3-V/train.py#L1015.

Please make sure that baseline official LLaVA code is working properly. And then make sure to copy all the files related to LLaMA-3 in the corresponding directory. Lastly please note that to run LLaMA-3 based training you need to pass --version llama3.

I hope it will help and solve the issue. Good Luck.

Originally posted by @mmaaz60 in https://github.com/mbzuai-oryx/LLaVA-pp/issues/8#issuecomment-2088868773

Thank you very much, my previously reported TypeError: pad_sequence(): argument 'padding_value' (position 3) must be float, not NoneType issue has been resolved after correctly copying the right train.py file. Thanks for your advice on that matter.

However, I still encounter tokenization mismatch issue during training, my current environment:

accelerate                0.29.3
aiofiles                  23.2.1
altair                    5.3.0
annotated-types           0.6.0
anyio                     4.3.0
appdirs                   1.4.4
attrs                     23.2.0
bitsandbytes              0.42.0
certifi                   2024.2.2
charset-normalizer        3.3.2
click                     8.1.7
contourpy                 1.2.1
cycler                    0.12.1
deepspeed                 0.12.6
docker-pycreds            0.4.0
einops                    0.6.1
einops-exts               0.0.4
exceptiongroup            1.2.1
fastapi                   0.110.3
ffmpy                     0.3.2
filelock                  3.14.0
flash-attn                2.5.8
fonttools                 4.51.0
fsspec                    2024.3.1
gitdb                     4.0.11
GitPython                 3.1.43
gradio                    4.16.0
gradio_client             0.8.1
h11                       0.14.0
hjson                     3.1.0
httpcore                  0.17.3
httpx                     0.24.0
huggingface-hub           0.22.2
idna                      3.7
importlib_resources       6.4.0
Jinja2                    3.1.3
joblib                    1.4.0
jsonschema                4.21.1
jsonschema-specifications 2023.12.1
kiwisolver                1.4.5
llava                     1.2.2.post1 /VLM_Code/LLaVA-pp/LLaVA
markdown-it-py            3.0.0
markdown2                 2.4.13
MarkupSafe                2.1.5
matplotlib                3.8.4
mdurl                     0.1.2
mpmath                    1.3.0
networkx                  3.3
ninja                     1.11.1.1
numpy                     1.26.4
nvidia-cublas-cu12        12.1.3.1
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105
nvidia-cudnn-cu12         8.9.2.26
nvidia-cufft-cu12         11.0.2.54
nvidia-curand-cu12        10.3.2.106
nvidia-cusolver-cu12      11.4.5.107
nvidia-cusparse-cu12      12.1.0.106
nvidia-nccl-cu12          2.18.1
nvidia-nvjitlink-cu12     12.4.127
nvidia-nvtx-cu12          12.1.105
orjson                    3.10.1
packaging                 24.0
pandas                    2.2.2
peft                      0.10.0
pillow                    10.3.0
pip                       24.0
protobuf                  4.25.3
psutil                    5.9.8
py-cpuinfo                9.0.0
pydantic                  2.7.1
pydantic_core             2.18.2
pydub                     0.25.1
Pygments                  2.17.2
pynvml                    11.5.0
pyparsing                 3.1.2
python-dateutil           2.9.0.post0
python-multipart          0.0.9
pytz                      2024.1
PyYAML                    6.0.1
referencing               0.35.0
regex                     2024.4.28
requests                  2.31.0
rich                      13.7.1
rpds-py                   0.18.0
ruff                      0.4.2
safetensors               0.4.3
scikit-learn              1.2.2
scipy                     1.13.0
semantic-version          2.10.0
sentencepiece             0.1.99
sentry-sdk                2.0.1
setproctitle              1.3.3
setuptools                68.2.2
shellingham               1.5.4
shortuuid                 1.0.13
six                       1.16.0
smmap                     5.0.1
sniffio                   1.3.1
starlette                 0.37.2
svgwrite                  1.4.3
sympy                     1.12
threadpoolctl             3.5.0
timm                      0.6.13
tokenizers                0.19.1
tomlkit                   0.12.0
toolz                     0.12.1
torch                     2.1.2
torchvision               0.16.2
tqdm                      4.66.2
transformers              4.41.0.dev0
triton                    2.1.0
typer                     0.12.3
typing_extensions         4.11.0
tzdata                    2024.1
urllib3                   2.2.1
uvicorn                   0.29.0
wandb                     0.16.6
wavedrom                  2.0.3.post3
websockets                11.0.3
wheel                     0.41.2

And the beginning of the training output is as follows:

[2024-05-02 10:49:50,155] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-02 10:52:43,257] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=0,1,2,3 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2024-05-02 10:52:43,296] [INFO] [runner.py:571:main] cmd = miniconda-3/envs/llava-llama/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=24385 --enable_each_rank_log=None llava/train/train_mem.py --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 --deepspeed ./scripts/zero3.json --model_name_or_path VLM_Code/Meta-Llama-3-8B-Instruct --version llama3 --data_path VLM_Code/data/train_all-tasks_1-4subset.json --image_folder VLM_Code/data/image --vision_tower VLM_Code/data/clip-vit-large-patch14-336 --pretrain_mm_mlp_adapter VLM_Code/data/HuggingFace-Download-Accelerator/models--MBZUAI--LLaVA-Meta-Llama-3-8B-Instruct-pretrain/mm_projector.bin --mm_projector_type mlp2x_gelu --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir ./checkpoints/llava-v1.5-8b-finetune-lora --num_train_epochs 1 --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --gradient_accumulation_steps 8 --evaluation_strategy no --save_strategy steps --save_steps 50000 --save_total_limit 1 --learning_rate 2e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 4096 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to none
[2024-05-02 10:52:46,249] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-02 10:52:48,309] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2024-05-02 10:52:48,309] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2024-05-02 10:52:48,309] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2024-05-02 10:52:48,309] [INFO] [launch.py:163:main] dist_world_size=4
[2024-05-02 10:52:48,309] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3

[2024-05-02 10:54:45,305] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-02 10:54:45,305] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-02 10:54:45,305] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-02 10:54:45,305] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-02 10:54:45,949] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-02 10:54:45,949] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-02 10:54:45,949] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-02 10:54:46,039] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-05-02 10:54:46,039] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
You are using a model of type llama to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are using a model of type llama to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
[2024-05-02 10:55:01,066] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 291, num_elems = 8.03B

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:05<00:17,  5.91s/it]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:05<00:17,  5.93s/it]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:05<00:17,  5.91s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:15<00:16,  8.34s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:15<00:16,  8.33s/it]
Loading checkpoint shards:  50%|█████     | 2/4 [00:15<00:16,  8.34s/it]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:30<01:32, 30.82s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:34<00:12, 12.82s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:34<00:12, 12.81s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:34<00:12, 12.82s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:35<00:00,  8.13s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:35<00:00,  8.76s/it]

Loading checkpoint shards: 100%|██████████| 4/4 [00:35<00:00,  8.14s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:35<00:00,  8.77s/it]

Loading checkpoint shards: 100%|██████████| 4/4 [00:35<00:00,  8.15s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:35<00:00,  8.77s/it]

Loading checkpoint shards:  50%|█████     | 2/4 [00:42<00:38, 19.33s/it]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:56<00:17, 17.27s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [01:10<00:00, 15.99s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [01:10<00:00, 17.74s/it]
Adding LoRA adapters...
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Adding pad token as '<pad>'
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Adding pad token as '<pad>'
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Adding pad token as '<pad>'Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Adding pad token as '<pad>'
Using conversation format: llama3
Using conversation format: llama3
Using conversation format: llama3
Using conversation format: llama3
miniconda-3/envs/llava-llama/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
miniconda-3/envs/llava-llama/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
miniconda-3/envs/llava-llama/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
miniconda-3/envs/llava-llama/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
[2024-05-02 10:56:51,268] [INFO] [partition_parameters.py:348:__exit__] finished initializing model - num_params = 682, num_elems = 8.33B
Total parameters: 881864704
Trainable parameters: 356524032
Total parameters: 881864704
Trainable parameters: 356524032
Total parameters: 881864704
Trainable parameters: 356524032
Total parameters: 881864704
Trainable parameters: 356524032
Formatting inputs...Skip in lazy mode
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Parameter Offload: Total persistent parameters: 599040 in 312 params

  0%|          | 0/2883 [00:00<?, ?it/s]WARNING: tokenization mismatch: 262 vs. 265. (ignored)
WARNING: tokenization mismatch: 240 vs. 243. (ignored)
WARNING: tokenization mismatch: 184 vs. 186. (ignored)
WARNING: tokenization mismatch: 452 vs. 455. (ignored)
WARNING: tokenization mismatch: 439 vs. 442. (ignored)
WARNING: tokenization mismatch: 598 vs. 603. (ignored)
miniconda-3/envs/llava-llama/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
miniconda-3/envs/llava-llama/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
miniconda-3/envs/llava-llama/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
miniconda-3/envs/llava-llama/lib/python3.10/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
miniconda-3/envs/llava-llama/lib/python3.10/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
miniconda-3/envs/llava-llama/lib/python3.10/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
miniconda-3/envs/llava-llama/lib/python3.10/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
miniconda-3/envs/llava-llama/lib/python3.10/site-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
WARNING: tokenization mismatch: 390 vs. 393. (ignored)
WARNING: tokenization mismatch: 287 vs. 290. (ignored)
  0%|          | 1/2883 [00:37<30:17:38, 37.84s/it]

{'loss': 2.8514, 'grad_norm': 9.077989765362483, 'learning_rate': 2.2988505747126437e-06, 'epoch': 0.0}

  0%|          | 1/2883 [00:37<30:17:38, 37.84s/it]WARNING: tokenization mismatch: 195 vs. 197. (ignored)

  0%|          | 2/2883 [00:51<18:41:34, 23.36s/it]

{'loss': 3.144, 'grad_norm': 11.74703382259249, 'learning_rate': 4.5977011494252875e-06, 'epoch': 0.0}

  0%|          | 2/2883 [00:51<18:41:34, 23.36s/it]WARNING: tokenization mismatch: 254 vs. 256. (ignored)
WARNING: tokenization mismatch: 425 vs. 429. (ignored)
WARNING: tokenization mismatch: 574 vs. 578. (ignored)
WARNING: tokenization mismatch: 646 vs. 650. (ignored)

mbzuai-oryx / LLaVA-pp

Tokenization mismatch in LLaVA-LLaMA-3 #14