[Memory err run finetune_full_schedule.sh on V100]

Describe the issue

Issue: pretrain.sh Can be trained successfully， but finetune_full_schedule.sh, process memory usage exceeded on V100, Is there any way to solve this problem?

Command:

# Uncomment and set the following variables correspondingly to run this script:

################## VICUNA ##################
# PROMPT_VERSION=v1
# MODEL_VERSION="vicuna-v1-3-7b"
################## VICUNA ##################

################## LLaMA-2 ##################
# PROMPT_VERSION="llava_llama_2"
# MODEL_VERSION="llama-2-7b-chat"
################## LLaMA-2 ##################

MODEL_VERSION=vicuna-7b-v1.5-16k
PROMPT_VERSION=v1

deepspeed llava/train/train.py \
    --deepspeed ./scripts/zero2.json \
    --model_name_or_path ${MLLM_DIR}/2.pretrained_models/vicuna/$MODEL_VERSION \
    --version $PROMPT_VERSION \
    --data_path ${MLLM_DIR}/1.datasets/LLaVA-Instruct-150K/llava_instruct_150k.json \
    --image_folder ${MLLM_DIR}/1.datasets/COCO/train2017 \
    --vision_tower ${MLLM_DIR}/2.pretrained_models/clip/clip-vit-large-patch14 \
    --pretrain_mm_mlp_adapter ${MLLM_DIR}/3.output_dirs/LLaVA_output/llava-$MODEL_VERSION-pretrain/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 False \
    --output_dir ${MLLM_DIR}/3.output_dirs/LLaVA_output/llava-$MODEL_VERSION-finetune \
    --num_train_epochs 3 \ 
    --per_device_train_batch_size 8 \ 
    --per_device_eval_batch_size 4 \ 
    --gradient_accumulation_steps 2 \ 
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \ 
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \ 
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \ 
    --lazy_preprocess True \
    --report_to none

Log:

    trainer.train()
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/trainer.py", line 1656, in _inner_training_loop
    model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/accelerate/accelerator.py", line 1198, in prepare
    result = self._prepare_deepspeed(*args)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/accelerate/accelerator.py", line 1537, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 309, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1184, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1419, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 372, in __init__
    dist.barrier()
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 116, in log_wrapper
    return func(*args, **kwargs)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 394, in barrier
    return cdb.barrier(group=group, async_op=async_op)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 225, in barrier
    return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3330, in barrier
    work = group.barrier(opts=opts)
RuntimeError: [7] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe. This may indicate a possible application crash on rank 0 or a network set up issue.
Traceback (most recent call last):

haotian-liu / LLaVA

[Memory err run finetune_full_schedule.sh on V100] #428

Describe the issue