Open harrytea opened 1 year ago
How much RAM do you have?
Why do I get this error during pre-training? Thank you very much
[2023-10-21 19:41:04,065] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-21 19:41:06,429] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-10-21 19:41:06,430] [INFO] [runner.py:555:main] cmd = /home/nj/.conda/envs/llava/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llava/train/train_mem.py --deepspeed ./scripts/zero2.json --model_name_or_path lmsys/vicuna-7b-v1.5 --version plain --data_path ./playground/data/LLaVA-Pretrain/blip_laion_cc_sbu_558k_first500.json --image_folder ./playground/data/LLaVA-Pretrain/images --vision_tower openai/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --tune_mm_mlp_adapter True --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --fp16 True --output_dir ./liuhaotian2/llava-v1.5-7b-pretrain --num_train_epochs 1 --per_device_train_batch_size 32 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 24000 --save_total_limit 1 --learning_rate 1e-3 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 False --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to wandb [2023-10-21 19:41:07,902] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-21 19:41:09,817] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2023-10-21 19:41:09,817] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0 [2023-10-21 19:41:09,818] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]}) [2023-10-21 19:41:09,818] [INFO] [launch.py:163:main] dist_world_size=4 [2023-10-21 19:41:09,818] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 [2023-10-21 19:41:12,902] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-21 19:41:12,952] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-21 19:41:13,003] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-21 19:41:13,021] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) /media/nj/data2/nj/Models/LLaVA/llava/train/llama_flash_attn_monkey_patch.py:108: UserWarning: Flash attention is only supported on A100 or H100 GPU during training due to head dim > 64 backward.ref: https://github.com/HazyResearch/flash-attention/issues/190#issuecomment-1523359593 warnings.warn( /media/nj/data2/nj/Models/LLaVA/llava/train/llama_flash_attn_monkey_patch.py:108: UserWarning: Flash attention is only supported on A100 or H100 GPU during training due to head dim > 64 backward.ref: https://github.com/HazyResearch/flash-attention/issues/190#issuecomment-1523359593 warnings.warn( /media/nj/data2/nj/Models/LLaVA/llava/train/llama_flash_attn_monkey_patch.py:108: UserWarning: Flash attention is only supported on A100 or H100 GPU during training due to head dim > 64 backward.ref: https://github.com/HazyResearch/flash-attention/issues/190#issuecomment-1523359593 warnings.warn( /media/nj/data2/nj/Models/LLaVA/llava/train/llama_flash_attn_monkey_patch.py:108: UserWarning: Flash attention is only supported on A100 or H100 GPU during training due to head dim > 64 backward.ref: https://github.com/HazyResearch/flash-attention/issues/190#issuecomment-1523359593 warnings.warn( [2023-10-21 19:41:13,693] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-10-21 19:41:13,694] [INFO] [comm.py:594:init_distributed] cdb=None [2023-10-21 19:41:13,699] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-10-21 19:41:13,699] [INFO] [comm.py:594:init_distributed] cdb=None [2023-10-21 19:41:13,700] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-10-21 19:41:13,782] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-10-21 19:41:13,782] [INFO] [comm.py:594:init_distributed] cdb=None [2023-10-21 19:41:13,789] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-10-21 19:41:13,789] [INFO] [comm.py:594:init_distributed] cdb=None You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors. You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors. You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors. You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors. [2023-10-21 19:41:56,780] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10468 [2023-10-21 19:41:56,809] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10469 [2023-10-21 19:41:57,776] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10470 [2023-10-21 19:41:58,770] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10471 [2023-10-21 19:41:59,773] [ERROR] [launch.py:321:sigkill_handler] ['/home/nj/.conda/envs/llava/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=3', '--deepspeed', './scripts/zero2.json', '--model_name_or_path', 'lmsys/vicuna-7b-v1.5', '--version', 'plain', '--data_path', './playground/data/LLaVA-Pretrain/blip_laion_cc_sbu_558k_first500.json', '--image_folder', './playground/data/LLaVA-Pretrain/images', '--vision_tower', 'openai/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--tune_mm_mlp_adapter', 'True', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--fp16', 'True', '--output_dir', './liuhaotian2/llava-v1.5-7b-pretrain', '--num_train_epochs', '1', '--per_device_train_batch_size', '32', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '24000', '--save_total_limit', '1', '--learning_rate', '1e-3', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'False', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'wandb'] exits with return code = -9
Question
I have successfully done the pretrain stage, while for fintuning, i encounter following issues.
(llava2) wangyh@A16:/data/wangyh/mllms/LLaVA$ bash finetune2.sh [2023-08-12 15:39:43,510] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-12 15:39:45,078] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-08-12 15:39:45,078] [INFO] [runner.py:555:main] cmd = /home/wangyh/miniconda3/envs/llava2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llava/train/train_mem.py --deepspeed /data/wangyh/mllms/LLaVA/finetune.json --model_name_or_path ./checkpoints/vicuna-7b-v1.5 --version v1 --data_path /data/wangyh/mllms/LLaVA/datasets/LLaVA-Instruct-150K/llava_instruct_150k.json --image_folder /data/wangyh/mllms/LLaVA/datasets/coco/train2017 --vision_tower openai/clip-vit-large-patch14 --pretrain_mm_mlp_adapter ./checkpoints/llava-7b-pretrain/mm_projector.bin --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --bf16 True --output_dir /data/wangyh/mllms/LLaVA/checkpoints/llava-7b-finetune --num_train_epochs 3 --per_device_train_batch_size 8 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 50000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to wandb [2023-08-12 15:39:46,224] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-12 15:39:47,788] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2023-08-12 15:39:47,788] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0 [2023-08-12 15:39:47,788] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2023-08-12 15:39:47,788] [INFO] [launch.py:163:main] dist_world_size=8 [2023-08-12 15:39:47,788] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2023-08-12 15:39:50,339] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-12 15:39:50,390] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-12 15:39:50,425] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-12 15:39:50,505] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-12 15:39:50,557] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-12 15:39:50,764] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-12 15:39:50,820] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-08-12 15:39:50,821] [INFO] [comm.py:594:init_distributed] cdb=None [2023-08-12 15:39:50,842] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-12 15:39:50,865] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-08-12 15:39:50,868] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-08-12 15:39:50,868] [INFO] [comm.py:594:init_distributed] cdb=None [2023-08-12 15:39:50,905] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-08-12 15:39:50,905] [INFO] [comm.py:594:init_distributed] cdb=None [2023-08-12 15:39:50,984] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-08-12 15:39:50,985] [INFO] [comm.py:594:init_distributed] cdb=None [2023-08-12 15:39:51,085] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-08-12 15:39:51,085] [INFO] [comm.py:594:init_distributed] cdb=None [2023-08-12 15:39:51,296] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-08-12 15:39:51,296] [INFO] [comm.py:594:init_distributed] cdb=None [2023-08-12 15:39:51,296] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-08-12 15:39:51,339] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-08-12 15:39:51,339] [INFO] [comm.py:594:init_distributed] cdb=None [2023-08-12 15:39:51,353] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-08-12 15:39:51,353] [INFO] [comm.py:594:init_distributed] cdb=None You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors. You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors. You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors. You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors. You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors. You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors. You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors. You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors. [2023-08-12 15:40:02,706] [INFO] [partition_parameters.py:453:__exit__] finished initializing model with 6.74B parameters Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00, 9.29s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00, 9.29s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00, 9.30s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00, 9.32s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00, 9.32s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00, 9.33s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00, 9.33s/it] Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00, 9.31s/it] [2023-08-12 15:40:24,164] [WARNING] [partition_parameters.py:836:_post_init_method] param `class_embedding` in CLIPVisionEmbeddings not on GPU so was not broadcasted from rank 0 [2023-08-12 15:40:29,745] [INFO] [partition_parameters.py:453:__exit__] finished initializing model with 7.04B parameters Formatting inputs...Skip in lazy mode Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/wangyh/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.464034080505371 seconds Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/wangyh/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.4421682357788086 seconds Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/wangyh/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.4753994941711426 seconds Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /home/wangyh/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.6163582801818848 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 2.6462419033050537 seconds Loading extension module cpu_adam... Loading extension module cpu_adam... Time to load cpu_adam op: 2.588582754135132 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 2.5909383296966553 seconds Time to load cpu_adam op: 2.562427520751953 seconds Parameter Offload: Total persistent parameters: 594944 in 311 params wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: 3 wandb: You chose "Don't visualize my results" wandb: Tracking run with wandb version 0.15.8 wandb: W&B syncing is set to `offline` in this directory. wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing. {'loss': 6.0156, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0} {'loss': 6.0703, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0} {'loss': 5.9375, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0} {'loss': 5.9609, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0} {'loss': 6.0195, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0} {'loss': 5.9531, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0} {'loss': 6.0273, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0} {'loss': 5.9805, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0} {'loss': 5.9805, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0} {'loss': 6.207, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0} {'loss': 6.1289, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0} {'loss': 5.9102, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0} {'loss': 5.918, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.9258, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 6.0391, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.9531, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.8164, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.8789, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.957, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 6.0977, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 6.1484, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.9609, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.9453, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.8945, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 6.1094, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.9219, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.8203, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.8984, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.9375, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.9531, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.9648, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.8711, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.9141, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.9961, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 6.0977, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01} {'loss': 5.9531, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.01} {'loss': 5.9844, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02} {'loss': 5.9648, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02} {'loss': 5.8164, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02} {'loss': 5.9414, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02} {'loss': 6.0664, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02} {'loss': 6.0625, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02} 1%|▋ | 42/7395 [09:48<27:33:57, 13.50s/it]Traceback (most recent call last): File "/data/wangyh/mllms/LLaVA/llava/train/train_mem.py", line 21, in <module> train() File "/data/wangyh/mllms/LLaVA/./llava/train/train.py", line 909, in train trainer.train() File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/transformers/trainer.py", line 2665, in training_step self.accelerator.backward(loss) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1861, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1993, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, *args) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1006, in reduce_partition_and_remove_grads self.reduce_ready_partitions_and_remove_grads(param, i) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1286, in reduce_ready_partitions_and_remove_grads self.reduce_independent_p_g_buckets_and_remove_grads(param, i) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1041, in reduce_independent_p_g_buckets_and_remove_grads self.__reduce_and_partition_ipg_grads() File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1091, in __reduce_and_partition_ipg_grads self.partition_grads(self.params_in_ipg_bucket, grad_partitions) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1271, in partition_grads fp32_grad_tensor.copy_(grad_buffer) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. [2023-08-12 15:51:09,569] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090130 [2023-08-12 15:51:14,623] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090131 [2023-08-12 15:51:18,682] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090132 [2023-08-12 15:51:22,988] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090133 [2023-08-12 15:51:27,297] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090134 [2023-08-12 15:51:27,298] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090135 [2023-08-12 15:51:32,219] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090136 [2023-08-12 15:51:36,482] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090137 [2023-08-12 15:51:41,105] [ERROR] [launch.py:321:sigkill_handler] ['/home/wangyh/miniconda3/envs/llava2/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=7', '--deepspeed', '/data/wangyh/mllms/LLaVA/finetune.json', '--model_name_or_path', './checkpoints/vicuna-7b-v1.5', '--version', 'v1', '--data_path', '/data/wangyh/mllms/LLaVA/datasets/LLaVA-Instruct-150K/llava_instruct_150k.json', '--image_folder', '/data/wangyh/mllms/LLaVA/datasets/coco/train2017', '--vision_tower', 'openai/clip-vit-large-patch14', '--pretrain_mm_mlp_adapter', './checkpoints/llava-7b-pretrain/mm_projector.bin', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--bf16', 'True', '--output_dir', '/data/wangyh/mllms/LLaVA/checkpoints/llava-7b-finetune', '--num_train_epochs', '3', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'wandb'] exits with return code = -6
This seems to have run successfully for a while and reported this error, what's wrong?
This is my shell file
#!/bin/bash # Uncomment and set the following variables correspondingly to run this script: ################## VICUNA ################## PROMPT_VERSION=v1 MODEL_VERSION="vicuna-7b-v1.5" ################## VICUNA ################## ################## LLaMA-2 ################## # PROMPT_VERSION="llava_llama_2" # MODEL_VERSION="llama-2-7b-chat" ################## LLaMA-2 ################## deepspeed llava/train/train_mem.py \ --deepspeed /data/wangyh/mllms/LLaVA/finetune.json \ --model_name_or_path ./checkpoints/$MODEL_VERSION \ --version $PROMPT_VERSION \ --data_path /data/wangyh/mllms/LLaVA/datasets/LLaVA-Instruct-150K/llava_instruct_150k.json \ --image_folder /data/wangyh/mllms/LLaVA/datasets/coco/train2017 \ --vision_tower openai/clip-vit-large-patch14 \ --pretrain_mm_mlp_adapter ./checkpoints/llava-7b-pretrain/mm_projector.bin \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --bf16 True \ --output_dir /data/wangyh/mllms/LLaVA/checkpoints/llava-7b-finetune \ --num_train_epochs 3 \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 1 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to wandb
Thanks
similar issue, have you solved it ? Thanks.
Question
I have successfully done the pretrain stage, while for fintuning, i encounter following issues.
This seems to have run successfully for a while and reported this error, what's wrong?
This is my shell file
Thanks