[Usage] 微调7b，训练时一直卡着不动，A100 40G X8

Describe the issue

Issue: 微调7b，训练时一直卡着不动，A100 40G X8 Command:

PASTE THE COMMANDS HERE.
```GPU=$((WORLD_SIZE * 8))
echo GPU=${GPU}
echo WORLD_SIZE=$WORLD_SIZE
echo RANK=$RANK
echo MASTER_ADDR=$MASTER_ADDR
echo MASTER_PORT=$MASTER_PORT

WANDB_MODE=online deepspeed llava/train/train_mem_CoS.py \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path ./checkpoints/llava-v1.5-7b-ckpt \
    --version v1 \
    --data_path ./playground/data/llava_v1_5_mix665k.json \
    --cos_data_path /PATH/TO/YOUR/DATA \
    --image_folder ./playground/data \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter ./checkpoints/llava-v1.5-7b-ckpt/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/llava-v1.5-7b-CoS \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

Log:

PASTE THE LOGS HERE. (cos) root@ubuntu22:/home/vipuser/Chain-of-Spot# bash ./scripts/v1_5/finetune_CoS_7b.sh GPU=0 WORLD_SIZE= RANK= MASTER_ADDR= MASTER_PORT= [2024-09-21 17:35:39,385] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): [2024-09-21 17:35:42,605] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-09-21 17:35:42,605] [INFO] [runner.py:555:main] cmd = /home/vipuser/miniconda3/envs/cos/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llava/train/train_mem_CoS.py --deepspeed ./scripts/zero3.json --model_name_or_path ./checkpoints/llava-v1.5-7b-ckpt --version v1 --data_path ./playground/data/llava_v1_5_mix665k.json --cos_data_path /PATH/TO/YOUR/DATA --image_folder ./playground/data --vision_tower openai/clip-vit-large-patch14-336 --pretrain_mm_mlp_adapter ./checkpoints/llava-v1.5-7b-ckpt/mm_projector.bin --mm_projector_type mlp2x_gelu --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --image_aspect_ratio pad --group_by_modality_length True --bf16 True --output_dir ./checkpoints/llava-v1.5-7b-CoS --num_train_epochs 1 --per_device_train_batch_size 16 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 50000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to wandb [2024-09-21 17:35:43,941] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): [2024-09-21 17:35:46,275] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2024-09-21 17:35:46,275] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0 [2024-09-21 17:35:46,275] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2024-09-21 17:35:46,275] [INFO] [launch.py:163:main] dist_world_size=8 [2024-09-21 17:35:46,275] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2024-09-21 17:35:51,193] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): [2024-09-21 17:35:51,315] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-21 17:35:51,358] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-21 17:35:51,373] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): [2024-09-21 17:35:51,415] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): [2024-09-21 17:35:51,420] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-21 17:35:51,420] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): [2024-09-21 17:35:51,482] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, weight, bias=None): /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): [2024-09-21 17:35:52,051] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-09-21 17:35:52,051] [INFO] [comm.py:594:init_distributed] cdb=None [2024-09-21 17:35:52,117] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-09-21 17:35:52,117] [INFO] [comm.py:594:init_distributed] cdb=None [2024-09-21 17:35:52,165] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-09-21 17:35:52,165] [INFO] [comm.py:594:init_distributed] cdb=None [2024-09-21 17:35:52,172] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-09-21 17:35:52,172] [INFO] [comm.py:594:init_distributed] cdb=None [2024-09-21 17:35:52,172] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2024-09-21 17:35:52,172] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-09-21 17:35:52,173] [INFO] [comm.py:594:init_distributed] cdb=None [2024-09-21 17:35:52,185] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-09-21 17:35:52,185] [INFO] [comm.py:594:init_distributed] cdb=None [2024-09-21 17:35:52,231] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-09-21 17:35:52,231] [INFO] [comm.py:594:init_distributed] cdb=None [2024-09-21 17:35:52,282] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-09-21 17:35:52,282] [INFO] [comm.py:594:init_distributed] cdb=None /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/huggingface_hub/file_download.py:1142: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/huggingface_hub/file_download.py:1142: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/huggingface_hub/file_download.py:1142: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/huggingface_hub/file_download.py:1142: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/huggingface_hub/file_download.py:1142: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/huggingface_hub/file_download.py:1142: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/huggingface_hub/file_download.py:1142: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/huggingface_hub/file_download.py:1142: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/transformers/modeling_utils.py:460: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(checkpoint_file, map_location="cpu") Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/transformers/modeling_utils.py:460: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(checkpoint_file, map_location="cpu") /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/transformers/modeling_utils.py:460: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(checkpoint_file, map_location="cpu") Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/transformers/modeling_utils.py:460: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(checkpoint_file, map_location="cpu") Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/transformers/modeling_utils.py:460: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(checkpoint_file, map_location="cpu") Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/transformers/modeling_utils.py:460: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(checkpoint_file, map_location="cpu") /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/transformers/modeling_utils.py:460: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(checkpoint_file, map_location="cpu") [2024-09-21 17:35:58,155] [INFO] [partition_parameters.py:453:exit] finished initializing model with 6.76B parameters Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]/home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/transformers/modeling_utils.py:460: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. return torch.load(checkpoint_file, map_location="cpu") Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:23<00:00, 11.64s/it]

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:23<00:00, 11.65s/it] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:23<00:00, 11.65s/it] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:23<00:00, 11.66s/it] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:23<00:00, 11.66s/it] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:23<00:00, 11.67s/it] Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:23<00:00, 11.68s/it] [2024-09-21 17:36:24,133] [WARNING] [partition_parameters.py:836:_post_init_method] param class_embedding in CLIPVisionEmbeddings not on GPU so was not broadcasted from rank 0 [2024-09-21 17:36:24,442] [INFO] [partition_parameters.py:453:exit] finished initializing model with 7.06B parameters /home/vipuser/Chain-of-Spot/llava/model/llava_arch.py:74: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu') /home/vipuser/Chain-of-Spot/llava/model/llava_arch.py:74: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu') /home/vipuser/Chain-of-Spot/llava/model/llava_arch.py:74: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu') /home/vipuser/Chain-of-Spot/llava/model/llava_arch.py:74: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu') /home/vipuser/Chain-of-Spot/llava/model/llava_arch.py:74: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu') /home/vipuser/Chain-of-Spot/llava/model/llava_arch.py:74: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu') /home/vipuser/Chain-of-Spot/llava/model/llava_arch.py:74: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu') /home/vipuser/Chain-of-Spot/llava/model/llava_arch.py:74: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location='cpu') pwd/home/vipuser/Chain-of-Spot pwd/home/vipuser/Chain-of-Spot pwd/home/vipuser/Chain-of-Spot pwd/home/vipuser/Chain-of-Spot pwd/home/vipuser/Chain-of-Spot pwd/home/vipuser/Chain-of-Spot pwd/home/vipuser/Chain-of-Spot pwd/home/vipuser/Chain-of-Spot

Formatting inputs...Skip in lazy mode Parameter Offload: Total persistent parameters: 599040 in 312 params wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. wandb: Currently logged in as: 43217594 (43217594-huazhong-university-of-science-and-technology). Use wandb login --relogin to force relogin wandb: Tracking run with wandb version 0.18.1 wandb: Run data is saved locally in /home/vipuser/Chain-of-Spot/wandb/run-20240921_173744-8g7rq791 wandb: Run wandb offline to turn off syncing. wandb: Syncing run dazzling-dust-1 wandb: ⭐️ View project at https://wandb.ai/43217594-huazhong-university-of-science-and-technology/huggingface wandb: 🚀 View run at https://wandb.ai/43217594-huazhong-university-of-science-and-technology/huggingface/runs/8g7rq791 0%| | 0/5198 [00:00<?, ?it/s]/home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. return fn(*args, *kwargs) /home/vipuser/miniconda3/envs/cos/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants. return fn(args, **kwargs)

[rank6]:[E921 17:55:52.146960461 ProcessGroupNCCL.cpp:1375] [PG 0 (default_pg) Rank 6] First PG on this rank that detected no heartbeat of its watchdog. [rank6]:[E921 17:55:52.147105451 ProcessGroupNCCL.cpp:1413] [PG 0 (defaultpg) Rank 6] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList.size()=3 [rank7]:[E921 17:55:52.256762614 ProcessGroupNCCL.cpp:1375] [PG 0 (default_pg) Rank 7] First PG on this rank that detected no heartbeat of its watchdog. [rank7]:[E921 17:55:52.256849497 ProcessGroupNCCL.cpp:1413] [PG 0 (defaultpg) Rank 7] Heartbeat monitor timed out! Process will be terminated after dumping debug info. workMetaList.size()=3

Screenshots: You may attach screenshots if it better explains the issue.

haotian-liu / LLaVA

[Usage] 微调7b，训练时一直卡着不动，A100 40G X8 #1711

Describe the issue