haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.16k stars 2.22k forks source link

error about finetuning #367

Open harrytea opened 1 year ago

harrytea commented 1 year ago

Question

I have successfully done the pretrain stage, while for fintuning, i encounter following issues.

(llava2) wangyh@A16:/data/wangyh/mllms/LLaVA$ bash finetune2.sh 
[2023-08-12 15:39:43,510] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:45,078] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-08-12 15:39:45,078] [INFO] [runner.py:555:main] cmd = /home/wangyh/miniconda3/envs/llava2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llava/train/train_mem.py --deepspeed /data/wangyh/mllms/LLaVA/finetune.json --model_name_or_path ./checkpoints/vicuna-7b-v1.5 --version v1 --data_path /data/wangyh/mllms/LLaVA/datasets/LLaVA-Instruct-150K/llava_instruct_150k.json --image_folder /data/wangyh/mllms/LLaVA/datasets/coco/train2017 --vision_tower openai/clip-vit-large-patch14 --pretrain_mm_mlp_adapter ./checkpoints/llava-7b-pretrain/mm_projector.bin --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --bf16 True --output_dir /data/wangyh/mllms/LLaVA/checkpoints/llava-7b-finetune --num_train_epochs 3 --per_device_train_batch_size 8 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 50000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to wandb
[2023-08-12 15:39:46,224] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:47,788] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-08-12 15:39:47,788] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-08-12 15:39:47,788] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-08-12 15:39:47,788] [INFO] [launch.py:163:main] dist_world_size=8
[2023-08-12 15:39:47,788] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-08-12 15:39:50,339] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:50,390] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:50,425] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:50,505] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:50,557] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:50,764] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:50,820] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-12 15:39:50,821] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-12 15:39:50,842] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:50,865] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:50,868] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-12 15:39:50,868] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-12 15:39:50,905] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-12 15:39:50,905] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-12 15:39:50,984] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-12 15:39:50,985] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-12 15:39:51,085] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-12 15:39:51,085] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-12 15:39:51,296] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-12 15:39:51,296] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-12 15:39:51,296] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-08-12 15:39:51,339] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-12 15:39:51,339] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-12 15:39:51,353] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-12 15:39:51,353] [INFO] [comm.py:594:init_distributed] cdb=None
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
[2023-08-12 15:40:02,706] [INFO] [partition_parameters.py:453:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.29s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.29s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.30s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.32s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.32s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.33s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.33s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.31s/it]
[2023-08-12 15:40:24,164] [WARNING] [partition_parameters.py:836:_post_init_method] param `class_embedding` in CLIPVisionEmbeddings not on GPU so was not broadcasted from rank 0
[2023-08-12 15:40:29,745] [INFO] [partition_parameters.py:453:__exit__] finished initializing model with 7.04B parameters
Formatting inputs...Skip in lazy mode
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/wangyh/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.464034080505371 seconds
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/wangyh/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.4421682357788086 seconds
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/wangyh/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.4753994941711426 seconds
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/wangyh/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.6163582801818848 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.6462419033050537 seconds
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.588582754135132 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.5909383296966553 seconds
Time to load cpu_adam op: 2.562427520751953 seconds
Parameter Offload: Total persistent parameters: 594944 in 311 params
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.15.8
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
{'loss': 6.0156, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 6.0703, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 5.9375, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 5.9609, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 6.0195, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 5.9531, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 6.0273, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 5.9805, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 5.9805, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 6.207, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                     
{'loss': 6.1289, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 5.9102, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 5.918, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                    
{'loss': 5.9258, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 6.0391, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9531, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.8164, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.8789, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.957, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                    
{'loss': 6.0977, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 6.1484, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9609, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9453, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.8945, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 6.1094, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9219, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.8203, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.8984, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9375, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9531, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9648, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.8711, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9141, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9961, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 6.0977, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9531, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.01}                                                                                                  
{'loss': 5.9844, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02}                                                                                                  
{'loss': 5.9648, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02}                                                                                                  
{'loss': 5.8164, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02}                                                                                                  
{'loss': 5.9414, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02}                                                                                                  
{'loss': 6.0664, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02}                                                                                                  
{'loss': 6.0625, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02}                                                                                                  
  1%|▋                                                                                                                              | 42/7395 [09:48<27:33:57, 13.50s/it]Traceback (most recent call last):
  File "/data/wangyh/mllms/LLaVA/llava/train/train_mem.py", line 21, in <module>
    train()
  File "/data/wangyh/mllms/LLaVA/./llava/train/train.py", line 909, in train
    trainer.train()
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/transformers/trainer.py", line 2665, in training_step
    self.accelerator.backward(loss)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1861, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1993, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1006, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1286, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1041, in reduce_independent_p_g_buckets_and_remove_grads
    self.__reduce_and_partition_ipg_grads()
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1091, in __reduce_and_partition_ipg_grads
    self.partition_grads(self.params_in_ipg_bucket, grad_partitions)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1271, in partition_grads
    fp32_grad_tensor.copy_(grad_buffer)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[2023-08-12 15:51:09,569] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090130
[2023-08-12 15:51:14,623] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090131
[2023-08-12 15:51:18,682] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090132
[2023-08-12 15:51:22,988] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090133
[2023-08-12 15:51:27,297] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090134
[2023-08-12 15:51:27,298] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090135
[2023-08-12 15:51:32,219] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090136
[2023-08-12 15:51:36,482] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090137
[2023-08-12 15:51:41,105] [ERROR] [launch.py:321:sigkill_handler] ['/home/wangyh/miniconda3/envs/llava2/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=7', '--deepspeed', '/data/wangyh/mllms/LLaVA/finetune.json', '--model_name_or_path', './checkpoints/vicuna-7b-v1.5', '--version', 'v1', '--data_path', '/data/wangyh/mllms/LLaVA/datasets/LLaVA-Instruct-150K/llava_instruct_150k.json', '--image_folder', '/data/wangyh/mllms/LLaVA/datasets/coco/train2017', '--vision_tower', 'openai/clip-vit-large-patch14', '--pretrain_mm_mlp_adapter', './checkpoints/llava-7b-pretrain/mm_projector.bin', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--bf16', 'True', '--output_dir', '/data/wangyh/mllms/LLaVA/checkpoints/llava-7b-finetune', '--num_train_epochs', '3', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'wandb'] exits with return code = -6

This seems to have run successfully for a while and reported this error, what's wrong?

This is my shell file

#!/bin/bash

# Uncomment and set the following variables correspondingly to run this script:

################## VICUNA ##################
PROMPT_VERSION=v1
MODEL_VERSION="vicuna-7b-v1.5"
################## VICUNA ##################

################## LLaMA-2 ##################
# PROMPT_VERSION="llava_llama_2"
# MODEL_VERSION="llama-2-7b-chat"
################## LLaMA-2 ##################

deepspeed llava/train/train_mem.py \
    --deepspeed /data/wangyh/mllms/LLaVA/finetune.json \
    --model_name_or_path ./checkpoints/$MODEL_VERSION \
    --version $PROMPT_VERSION \
    --data_path /data/wangyh/mllms/LLaVA/datasets/LLaVA-Instruct-150K/llava_instruct_150k.json \
    --image_folder /data/wangyh/mllms/LLaVA/datasets/coco/train2017 \
    --vision_tower openai/clip-vit-large-patch14 \
    --pretrain_mm_mlp_adapter ./checkpoints/llava-7b-pretrain/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --output_dir /data/wangyh/mllms/LLaVA/checkpoints/llava-7b-finetune \
    --num_train_epochs 3 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

Thanks

wanghao-cst commented 1 year ago

How much RAM do you have?

nj159 commented 1 year ago

Why do I get this error during pre-training? Thank you very much

[2023-10-21 19:41:04,065] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-21 19:41:06,429] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-10-21 19:41:06,430] [INFO] [runner.py:555:main] cmd = /home/nj/.conda/envs/llava/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llava/train/train_mem.py --deepspeed ./scripts/zero2.json --model_name_or_path lmsys/vicuna-7b-v1.5 --version plain --data_path ./playground/data/LLaVA-Pretrain/blip_laion_cc_sbu_558k_first500.json --image_folder ./playground/data/LLaVA-Pretrain/images --vision_tower openai/clip-vit-large-patch14-336 --mm_projector_type mlp2x_gelu --tune_mm_mlp_adapter True --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --fp16 True --output_dir ./liuhaotian2/llava-v1.5-7b-pretrain --num_train_epochs 1 --per_device_train_batch_size 32 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 24000 --save_total_limit 1 --learning_rate 1e-3 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 False --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to wandb [2023-10-21 19:41:07,902] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-21 19:41:09,817] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2023-10-21 19:41:09,817] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0 [2023-10-21 19:41:09,818] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]}) [2023-10-21 19:41:09,818] [INFO] [launch.py:163:main] dist_world_size=4 [2023-10-21 19:41:09,818] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 [2023-10-21 19:41:12,902] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-21 19:41:12,952] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-21 19:41:13,003] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-10-21 19:41:13,021] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect) /media/nj/data2/nj/Models/LLaVA/llava/train/llama_flash_attn_monkey_patch.py:108: UserWarning: Flash attention is only supported on A100 or H100 GPU during training due to head dim > 64 backward.ref: https://github.com/HazyResearch/flash-attention/issues/190#issuecomment-1523359593 warnings.warn( /media/nj/data2/nj/Models/LLaVA/llava/train/llama_flash_attn_monkey_patch.py:108: UserWarning: Flash attention is only supported on A100 or H100 GPU during training due to head dim > 64 backward.ref: https://github.com/HazyResearch/flash-attention/issues/190#issuecomment-1523359593 warnings.warn( /media/nj/data2/nj/Models/LLaVA/llava/train/llama_flash_attn_monkey_patch.py:108: UserWarning: Flash attention is only supported on A100 or H100 GPU during training due to head dim > 64 backward.ref: https://github.com/HazyResearch/flash-attention/issues/190#issuecomment-1523359593 warnings.warn( /media/nj/data2/nj/Models/LLaVA/llava/train/llama_flash_attn_monkey_patch.py:108: UserWarning: Flash attention is only supported on A100 or H100 GPU during training due to head dim > 64 backward.ref: https://github.com/HazyResearch/flash-attention/issues/190#issuecomment-1523359593 warnings.warn( [2023-10-21 19:41:13,693] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-10-21 19:41:13,694] [INFO] [comm.py:594:init_distributed] cdb=None [2023-10-21 19:41:13,699] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-10-21 19:41:13,699] [INFO] [comm.py:594:init_distributed] cdb=None [2023-10-21 19:41:13,700] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2023-10-21 19:41:13,782] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-10-21 19:41:13,782] [INFO] [comm.py:594:init_distributed] cdb=None [2023-10-21 19:41:13,789] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-10-21 19:41:13,789] [INFO] [comm.py:594:init_distributed] cdb=None You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors. You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors. You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors. You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors. [2023-10-21 19:41:56,780] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10468 [2023-10-21 19:41:56,809] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10469 [2023-10-21 19:41:57,776] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10470 [2023-10-21 19:41:58,770] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 10471 [2023-10-21 19:41:59,773] [ERROR] [launch.py:321:sigkill_handler] ['/home/nj/.conda/envs/llava/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=3', '--deepspeed', './scripts/zero2.json', '--model_name_or_path', 'lmsys/vicuna-7b-v1.5', '--version', 'plain', '--data_path', './playground/data/LLaVA-Pretrain/blip_laion_cc_sbu_558k_first500.json', '--image_folder', './playground/data/LLaVA-Pretrain/images', '--vision_tower', 'openai/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--tune_mm_mlp_adapter', 'True', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--fp16', 'True', '--output_dir', './liuhaotian2/llava-v1.5-7b-pretrain', '--num_train_epochs', '1', '--per_device_train_batch_size', '32', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '24000', '--save_total_limit', '1', '--learning_rate', '1e-3', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'False', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'wandb'] exits with return code = -9

ybsu commented 5 months ago

Question

I have successfully done the pretrain stage, while for fintuning, i encounter following issues.

(llava2) wangyh@A16:/data/wangyh/mllms/LLaVA$ bash finetune2.sh 
[2023-08-12 15:39:43,510] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:45,078] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-08-12 15:39:45,078] [INFO] [runner.py:555:main] cmd = /home/wangyh/miniconda3/envs/llava2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None llava/train/train_mem.py --deepspeed /data/wangyh/mllms/LLaVA/finetune.json --model_name_or_path ./checkpoints/vicuna-7b-v1.5 --version v1 --data_path /data/wangyh/mllms/LLaVA/datasets/LLaVA-Instruct-150K/llava_instruct_150k.json --image_folder /data/wangyh/mllms/LLaVA/datasets/coco/train2017 --vision_tower openai/clip-vit-large-patch14 --pretrain_mm_mlp_adapter ./checkpoints/llava-7b-pretrain/mm_projector.bin --mm_vision_select_layer -2 --mm_use_im_start_end False --mm_use_im_patch_token False --bf16 True --output_dir /data/wangyh/mllms/LLaVA/checkpoints/llava-7b-finetune --num_train_epochs 3 --per_device_train_batch_size 8 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy no --save_strategy steps --save_steps 50000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --tf32 True --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to wandb
[2023-08-12 15:39:46,224] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:47,788] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-08-12 15:39:47,788] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2023-08-12 15:39:47,788] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-08-12 15:39:47,788] [INFO] [launch.py:163:main] dist_world_size=8
[2023-08-12 15:39:47,788] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-08-12 15:39:50,339] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:50,390] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:50,425] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:50,505] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:50,557] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:50,764] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:50,820] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-12 15:39:50,821] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-12 15:39:50,842] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:50,865] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-08-12 15:39:50,868] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-12 15:39:50,868] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-12 15:39:50,905] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-12 15:39:50,905] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-12 15:39:50,984] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-12 15:39:50,985] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-12 15:39:51,085] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-12 15:39:51,085] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-12 15:39:51,296] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-12 15:39:51,296] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-12 15:39:51,296] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-08-12 15:39:51,339] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-12 15:39:51,339] [INFO] [comm.py:594:init_distributed] cdb=None
[2023-08-12 15:39:51,353] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2023-08-12 15:39:51,353] [INFO] [comm.py:594:init_distributed] cdb=None
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
You are using a model of type llama to instantiate a model of type llava. This is not supported for all configurations of models and can yield errors.
[2023-08-12 15:40:02,706] [INFO] [partition_parameters.py:453:__exit__] finished initializing model with 6.74B parameters
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.29s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.29s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.30s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.32s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.32s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.33s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.33s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.31s/it]
[2023-08-12 15:40:24,164] [WARNING] [partition_parameters.py:836:_post_init_method] param `class_embedding` in CLIPVisionEmbeddings not on GPU so was not broadcasted from rank 0
[2023-08-12 15:40:29,745] [INFO] [partition_parameters.py:453:__exit__] finished initializing model with 7.04B parameters
Formatting inputs...Skip in lazy mode
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/wangyh/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.464034080505371 seconds
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/wangyh/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.4421682357788086 seconds
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/wangyh/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.4753994941711426 seconds
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Installed CUDA version 11.3 does not match the version torch was compiled with 11.7 but since the APIs are compatible, accepting this combination
Using /home/wangyh/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/wangyh/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.6163582801818848 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.6462419033050537 seconds
Loading extension module cpu_adam...
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.588582754135132 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 2.5909383296966553 seconds
Time to load cpu_adam op: 2.562427520751953 seconds
Parameter Offload: Total persistent parameters: 594944 in 311 params
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: 3
wandb: You chose "Don't visualize my results"
wandb: Tracking run with wandb version 0.15.8
wandb: W&B syncing is set to `offline` in this directory.  
wandb: Run `wandb online` or set WANDB_MODE=online to enable cloud syncing.
{'loss': 6.0156, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 6.0703, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 5.9375, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 5.9609, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 6.0195, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 5.9531, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 6.0273, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 5.9805, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 5.9805, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 6.207, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                     
{'loss': 6.1289, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 5.9102, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.0}                                                                                                    
{'loss': 5.918, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                    
{'loss': 5.9258, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 6.0391, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9531, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.8164, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.8789, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.957, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                    
{'loss': 6.0977, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 6.1484, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9609, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9453, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.8945, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 6.1094, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9219, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.8203, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.8984, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9375, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9531, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9648, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.8711, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9141, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9961, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 6.0977, 'learning_rate': 9.00900900900901e-08, 'epoch': 0.01}                                                                                                   
{'loss': 5.9531, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.01}                                                                                                  
{'loss': 5.9844, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02}                                                                                                  
{'loss': 5.9648, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02}                                                                                                  
{'loss': 5.8164, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02}                                                                                                  
{'loss': 5.9414, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02}                                                                                                  
{'loss': 6.0664, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02}                                                                                                  
{'loss': 6.0625, 'learning_rate': 1.801801801801802e-07, 'epoch': 0.02}                                                                                                  
  1%|▋                                                                                                                              | 42/7395 [09:48<27:33:57, 13.50s/it]Traceback (most recent call last):
  File "/data/wangyh/mllms/LLaVA/llava/train/train_mem.py", line 21, in <module>
    train()
  File "/data/wangyh/mllms/LLaVA/./llava/train/train.py", line 909, in train
    trainer.train()
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/transformers/trainer.py", line 2665, in training_step
    self.accelerator.backward(loss)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1861, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1993, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1006, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1286, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1041, in reduce_independent_p_g_buckets_and_remove_grads
    self.__reduce_and_partition_ipg_grads()
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1091, in __reduce_and_partition_ipg_grads
    self.partition_grads(self.params_in_ipg_bucket, grad_partitions)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/wangyh/miniconda3/envs/llava2/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1271, in partition_grads
    fp32_grad_tensor.copy_(grad_buffer)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[2023-08-12 15:51:09,569] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090130
[2023-08-12 15:51:14,623] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090131
[2023-08-12 15:51:18,682] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090132
[2023-08-12 15:51:22,988] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090133
[2023-08-12 15:51:27,297] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090134
[2023-08-12 15:51:27,298] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090135
[2023-08-12 15:51:32,219] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090136
[2023-08-12 15:51:36,482] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3090137
[2023-08-12 15:51:41,105] [ERROR] [launch.py:321:sigkill_handler] ['/home/wangyh/miniconda3/envs/llava2/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=7', '--deepspeed', '/data/wangyh/mllms/LLaVA/finetune.json', '--model_name_or_path', './checkpoints/vicuna-7b-v1.5', '--version', 'v1', '--data_path', '/data/wangyh/mllms/LLaVA/datasets/LLaVA-Instruct-150K/llava_instruct_150k.json', '--image_folder', '/data/wangyh/mllms/LLaVA/datasets/coco/train2017', '--vision_tower', 'openai/clip-vit-large-patch14', '--pretrain_mm_mlp_adapter', './checkpoints/llava-7b-pretrain/mm_projector.bin', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--bf16', 'True', '--output_dir', '/data/wangyh/mllms/LLaVA/checkpoints/llava-7b-finetune', '--num_train_epochs', '3', '--per_device_train_batch_size', '8', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '1', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'wandb'] exits with return code = -6

This seems to have run successfully for a while and reported this error, what's wrong?

This is my shell file

#!/bin/bash

# Uncomment and set the following variables correspondingly to run this script:

################## VICUNA ##################
PROMPT_VERSION=v1
MODEL_VERSION="vicuna-7b-v1.5"
################## VICUNA ##################

################## LLaMA-2 ##################
# PROMPT_VERSION="llava_llama_2"
# MODEL_VERSION="llama-2-7b-chat"
################## LLaMA-2 ##################

deepspeed llava/train/train_mem.py \
    --deepspeed /data/wangyh/mllms/LLaVA/finetune.json \
    --model_name_or_path ./checkpoints/$MODEL_VERSION \
    --version $PROMPT_VERSION \
    --data_path /data/wangyh/mllms/LLaVA/datasets/LLaVA-Instruct-150K/llava_instruct_150k.json \
    --image_folder /data/wangyh/mllms/LLaVA/datasets/coco/train2017 \
    --vision_tower openai/clip-vit-large-patch14 \
    --pretrain_mm_mlp_adapter ./checkpoints/llava-7b-pretrain/mm_projector.bin \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --output_dir /data/wangyh/mllms/LLaVA/checkpoints/llava-7b-finetune \
    --num_train_epochs 3 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

Thanks

similar issue, have you solved it ? Thanks.