Closed AlexJJJChen closed 2 months ago
请问栈信息能贴一下吗,我们的代码中应该没有显式直接用这种方式加载完整模型的地方
batch_size设置为1
batch_size设置为1
设置为1也不行,会突然间爆显存
请问栈信息能贴一下吗,我们的代码中应该没有显式直接用这种方式加载完整模型的地方
File "/home/jianc/miniconda3/envs/benchmark-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, kwargs)
File "/home/jianc/miniconda3/envs/benchmark-llm/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 161, in forward
return self.model.forward(*args, *kwargs)
File "/home/jianc/.cache/modelscope/hub/_github/LLaVA.git/llava/model/language_model/llava_mistral.py", line 91, in forward
return super().forward(
File "/home/jianc/miniconda3/envs/benchmark-llm/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1158, in forward
outputs = self.model(
File "/home/jianc/miniconda3/envs/benchmark-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/home/jianc/miniconda3/envs/benchmark-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, kwargs)
File "/home/jianc/miniconda3/envs/benchmark-llm/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1033, in forward
layer_outputs = self._gradient_checkpointing_func(
File "/home/jianc/miniconda3/envs/benchmark-llm/lib/python3.10/site-packages/swift/llm/utils/model.py", line 3803, in
batch_size设置为1
设置为1也不行,会突然间爆显存
可以尝试 export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
减少内存碎片。
batch_size设置为1
设置为1也不行,会突然间爆显存
可以尝试
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
减少内存碎片。
还是不行,运行了一个step之后会爆显存
Traceback (most recent call last):
File "/home/jianc/miniconda3/envs/benchmark-llm/lib/python3.10/site-packages/swift/cli/sft.py", line 5, in find_unused_parameters=True
to torch.nn.parallel.DistributedDataParallel
, and by
making sure all forward
function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward
function. Please include the loss function and the structure of the return value of forward
of your module when reporting this issue (e.g. list, dict, iterable).
Parameters which did not receive grad for rank 3: base_model.model.model.vision_tower.vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.lora_B.default.weight, base_model.model.model.vision_tower.vision_tower.vision_model.encoder.layers.23.
已经支持deepspeed zero3 使用--deepspeed default-zero3试一下
nproc_per_node=4 CUDA_VISIBLE_DEVICES=0,1,2,3 \ NPROC_PER_NODE=$nproc_per_node \ swift sft \ --model_id_or_path "AI-ModelScope/llava-v1.6-mistral-7b" \ --template_type "llava-mistral-instruct" \ --custom_train_dataset_path train_swift.json \ --custom_val_dataset_path test_swift.json \ --dataset_test_ratio "0.15" \ --save_steps "20" \ --lora_target_modules q_proj k_proj v_proj \ --batch_size "8" \ --learning_rate "1e-4" \ --num_train_epochs "2" \ --gradient_accumulation_steps "16" \ --eval_batch_size "8" \ --use_flash_attn "True" \ --add_output_dir_suffix False \ --output_dir finetune_output_epoch_100 \ --logging_dir finetune_output_epoch_100\ --max_length -1\ --train_dataset_sample -1 \ --sft_type lora \
网上的解决方案是:
原来代码,load进的数据放在gpu里
pretrain_weight = torch.load(path)['model']
应该改成
pretrain_weight = torch.load(path, map_location=torch.device('cpu'))['model'] model.load_state_dict(pretrain_weight)