[Usage] program exit exception with nothing useful information output

payne4handsome commented 1 year ago

Describe the issue

Issue: Hi @haotian-liu , help me. I have download llama-2 weights、 llava-150k、pretrain_mm_mlp_adapter. I just want to test the correctness of the program . But program exit exception with nothing useful information output.

I train LLaVA utilizing 8 Nvidia 3090(24G) gpus.

Command:

PYTHONPATH=. sh scripts/finetune_full_schedule.sh

Log:

root@node37:/home/zhangpan/workspace/LLaVA# PYTHONPATH=. sh scripts/finetune_full_schedule.sh
[2023-09-21 11:43:52,127] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-21 11:43:53,598] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-09-21 11:43:53,652] [INFO] [runner.py:555:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llava/train/train_mem.py --deepspeed ./scripts/zero2.json --model_name_or_path ./checkpoints/llama2-7b-hf --version llava_llama_2 --data_path ./playground/data/llava_instruct_158k.json --image_folder ./playground/data/mscoco/train2017 --vision_tower openai/clip-vit-large-patch14 --pretrain_mm_mlp_adapter ./checkpoints/llava-llama2-7b-hf-pretrain/mm_projector.bin --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --bf16 True --output_dir ./checkpoints/llava-llama2-7b-hf-finetune --num_train_epochs 3 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 50000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to wandb
[2023-09-21 11:43:55,799] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-21 11:43:56,413] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.13.4-1
[2023-09-21 11:43:56,413] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.13.4-1
[2023-09-21 11:43:56,413] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2
[2023-09-21 11:43:56,414] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
[2023-09-21 11:43:56,414] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.13.4-1+cuda11.7
[2023-09-21 11:43:56,414] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.13.4-1+cuda11.7
[2023-09-21 11:43:56,414] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.13.4-1
[2023-09-21 11:43:56,414] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-09-21 11:43:56,414] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-09-21 11:43:56,414] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-09-21 11:43:56,414] [INFO] [launch.py:163:main] dist_world_size=8
[2023-09-21 11:43:56,414] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-09-21 11:43:58,932] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-21 11:43:59,029] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-21 11:43:59,149] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-21 11:43:59,206] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-21 11:43:59,211] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-21 11:43:59,237] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-21 11:43:59,251] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-21 11:43:59,287] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-21 11:44:01,163] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-09-21 11:44:01,163] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-09-21 11:44:01,164] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-09-21 11:44:01,164] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-09-21 11:44:01,166] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-09-21 11:44:01,166] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-09-21 11:44:01,173] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-09-21 11:44:01,173] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-09-21 11:44:01,178] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-09-21 11:44:01,178] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-09-21 11:44:01,182] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-09-21 11:44:01,182] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-09-21 11:44:01,182] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-09-21 11:44:01,197] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-09-21 11:44:01,197] [INFO] [comm.py:616:init_distributed] cdb=None
[2023-09-21 11:44:01,198] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-09-21 11:44:01,198] [INFO] [comm.py:616:init_distributed] cdb=None
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards:   0%|                                                                                                                        | 0/2 [00:00<?, ?it/s][2023-09-21 11:45:10,350] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 43166
[2023-09-21 11:45:13,658] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 43167
[2023-09-21 11:45:17,284] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 43168
[2023-09-21 11:45:17,285] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 43169
[2023-09-21 11:45:22,391] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 43170
Loading checkpoint shards:  50%|████████████████████████████████████████████████████████                                                        | 1/2 [00:22<00:22, 22.76s/it][2023-09-21 11:45:25,652] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 43171
[2023-09-21 11:45:27,714] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 43172
[2023-09-21 11:45:29,921] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 43173
[2023-09-21 11:45:32,930] [ERROR] [launch.py:321:sigkill_handler] ['/opt/conda/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=7', '--deepspeed', './scripts/zero2.json', '--model_name_or_path', './checkpoints/llama2-7b-hf', '--version', 'llava_llama_2', '--data_path', './playground/data/llava_instruct_158k.json', '--image_folder', './playground/data/mscoco/train2017', '--vision_tower', 'openai/clip-vit-large-patch14', '--pretrain_mm_mlp_adapter', './checkpoints/llava-llama2-7b-hf-pretrain/mm_projector.bin', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--bf16', 'True', '--output_dir', './checkpoints/llava-llama2-7b-hf-finetune', '--num_train_epochs', '3', '--per_device_train_batch_size', '16', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'wandb'] exits with return code = -9
root@node37:/home/zhangpan/workspace/LLaVA#

Screenshots: my finetune_full_schedule.sh likes below.

payne4handsome commented 1 year ago

The cause of this exception is out of cpu memory. I set param low_cpu_mem_usage=True in function from_pretrained() resolve this. I close this question.

yytzsy commented 1 year ago

The cause of this exception is out of cpu memory. I set param low_cpu_mem_usage=True in function from_pretrained() resolve this. I close this question.

How do you set the low_cpu_mem_usage=True? Can you show the code? Thank you very much!

payne4handsome commented 1 year ago

@yytzsy If you train with deepspeed with stage 2, the code likes belows.

            model = LlavaLlamaForCausalLM.from_pretrained(
                model_args.model_name_or_path,
                cache_dir=training_args.cache_dir,
                low_cpu_mem_usage=True,
                **bnb_model_from_pretrained_args
            )

If you use deepspeed with stage 3, you don't need do this and will train correctly.

459737087 commented 10 months ago

There is a contradiction point, when I use zero3_offload.json It reported

 exits with return code = -9

use zero2.json, it reported OOM。 I don't know how to solve it

ybsu commented 5 months ago

The cause of this exception is out of cpu memory. I set param low_cpu_mem_usage=True in function from_pretrained() resolve this. I close this question.

I have set the low_cpu_mem_usage=True, but the issue is still exists, what to do next ? Thanks.

haotian-liu / LLaVA

[Usage] program exit exception with nothing useful information output #449

Describe the issue