ThugJudy commented 7 months ago

Describe the issue

Issue: Getting an error when trying to finetune the LLaVA-v1.6-34b Command:

PASTE THE COMMANDS HERE.

!/bin/bash

deepspeed LLaVA/llava/train/train_mem.py \ --deepspeed LLaVA/scripts/zero3.json \ --model_name_or_path liuhaotian/llava-v1.6-34b \ --version v1 \ --data_path datasets/bargraph_data.json \ --image_folder datasets \ --vision_tower openai/clip-vit-large-patch14-336 \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir ./checkpoints/llava-v1.6-34b-task \ --num_train_epochs 1 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to wandb \ --cache_dir <> Log:

PASTE THE LOGS HERE.

[2024-02-08 11:13:01,934] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-08 11:13:40,223] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=0,1: setting --include=localhost:0,1 [2024-02-08 11:13:40,223] [INFO] [runner.py:555:main] cmd = /u/psg4/.conda/envs/llava/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None LLaVA/llava/train/train_mem.py --deepspeed LLaVA/scripts/zero3.json --model_name_or_path liuhaotian/llava-v1.6-34b --version v1 --data_path datasets/bargraph_data.json --image_folder datasets --vision_tower openai/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir ./checkpoints/llava-v1.6-34b-task --num_train_epochs 1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 50000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to wandb --cache_dir /projects/bbpr/psg4 [2024-02-08 11:13:41,852] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-08 11:13:44,216] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]} [2024-02-08 11:13:44,216] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0 [2024-02-08 11:13:44,216] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2024-02-08 11:13:44,216] [INFO] [launch.py:163:main] dist_world_size=2 [2024-02-08 11:13:44,216] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2024-02-08 11:13:47,502] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-08 11:13:47,505] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-02-08 11:13:52,414] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-02-08 11:13:52,414] [INFO] [comm.py:594:init_distributed] cdb=None [2024-02-08 11:13:52,414] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-02-08 11:13:52,414] [INFO] [comm.py:594:init_distributed] cdb=None [2024-02-08 11:13:52,414] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors. You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors. Traceback (most recent call last): File "/projects/bbpr/psg4/LLaVA_dataviz/LLaVA/llava/train/train_mem.py", line 4, in train(attn_implementation="flash_attention_2") File "/projects/bbpr/psg4/LLaVA_dataviz/LLaVA/llava/train/train.py", line 827, in train [2024-02-08 11:13:52,907] [INFO] [partition_parameters.py:453:exit] finished initializing model with 0.00B parameters Traceback (most recent call last): File "/projects/bbpr/psg4/LLaVA_dataviz/LLaVA/llava/train/train_mem.py", line 4, in model = LlavaLlamaForCausalLM.from_pretrained( File "/u/psg4/.conda/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2700, in from_pretrained train(attn_implementation="flash_attention_2") File "/projects/bbpr/psg4/LLaVA_dataviz/LLaVA/llava/train/train.py", line 827, in train model = LlavaLlamaForCausalLM.from_pretrained( File "/u/psg4/.conda/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2700, in from_pretrained model = cls(config, *model_args, model_kwargs) File "/u/psg4/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 385, in wrapper model = cls(config, *model_args, *model_kwargs) File "/u/psg4/.conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 385, in wrapper f(module, args, kwargs) TypeError: LlavaLlamaForCausalLM.init() got an unexpected keyword argument 'attn_implementation' f(module, *args, **kwargs) TypeError: LlavaLlamaForCausalLM.init() got an unexpected keyword argument 'attn_implementation' [2024-02-08 11:14:00,233] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1260646 [2024-02-08 11:14:00,247] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1260647 [2024-02-08 11:14:00,247] [ERROR] [launch.py:321:sigkill_handler] ['/u/psg4/.conda/envs/llava/bin/python', '-u', 'LLaVA/llava/train/train_mem.py', '--local_rank=1', '--deepspeed', 'LLaVA/scripts/zero3.json', '--model_name_or_path', 'liuhaotian/llava-v1.6-34b', '--version', 'v1', '--data_path', 'datasets/bargraph_data.json', '--image_folder', 'datasets', '--vision_tower', 'openai/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'True', '--bf16', 'True', '--output_dir', './checkpoints/llava-v1.6-34b-task', '--num_train_epochs', '1', '--per_device_train_batch_size', '16', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'wandb', '--cache_dir', '/projects/bbpr/psg4'] exits with return code = 1 Screenshots: You may attach screenshots if it better explains the issue.