使用Lora微调MiniCPM-V-2_6，合并后再Lora训练出现问题

guihonghao commented 3 weeks ago

报下面的错误

RuntimeError: weight should have at least three dimensions
Traceback (most recent call last):
  File "/mnt/bn/arnold-ghh-test/mlx/users/guihonghao/playground/ghh_swift/swift/examples/pytorch/llm/llm_sft.py", line 10, in <module>
    output = sft_main()
  File "/mnt/bn/arnold-ghh-test/mlx/users/guihonghao/playground/ghh_swift/swift/swift/utils/run_utils.py", line 32, in x_main
    result = llm_x(args, **kwargs)
  File "/mnt/bn/arnold-ghh-test/mlx/users/guihonghao/playground/ghh_swift/swift/swift/llm/sft.py", line 342, in llm_sft
    td0, tkwargs0 = template.encode(train_dataset[0])
  File "/mnt/bn/arnold-ghh-test/mlx/users/guihonghao/playground/ghh_swift/swift/swift/llm/utils/template.py", line 445, in encode
    return _encode(example) if not streaming else _encode(example)[0]
  File "/mnt/bn/arnold-ghh-test/mlx/users/guihonghao/playground/ghh_swift/swift/swift/llm/utils/template.py", line 2721, in _encode
    inputs_embeds, _ = self.model.get_vllm_embedding(data)
  File "/home/tiger/.cache/huggingface/modules/transformers_modules/MiniCPM-V-2_6/modeling_minicpmv.py", line 117, in get_vllm_embedding
    vision_embedding = self.vpm(all_pixel_values, patch_attention_mask=patch_attn_mask, tgt_sizes=tgt_sizes).last_hidden_state
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tiger/.cache/huggingface/modules/transformers_modules/MiniCPM-V-2_6/modeling_navit_siglip.py", line 903, in forward
    hidden_states = self.embeddings(pixel_values=pixel_values, patch_attention_mask=patch_attention_mask, tgt_sizes=tgt_sizes)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tiger/.cache/huggingface/modules/transformers_modules/MiniCPM-V-2_6/modeling_navit_siglip.py", line 320, in forward
    patch_embeds = self.patch_embedding(pixel_values)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: weight should have at least three dimensions

使用下面的指令训练，$BASE_PATH/playground/lora_results/MiniCPM-V-2_6-cupai/checkpoint-80000-merged是通过infer中merge合并后的模型。

nproc_per_node=8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
torchrun \
    --nproc_per_node=$nproc_per_node \
    --master_port 26565 \
    examples/pytorch/llm/llm_sft.py \
    --model_type 'minicpm-v-v2_6-chat' \
    --model_id_or_path $BASE_PATH/playground/lora_results/MiniCPM-V-2_6-cupai/checkpoint-80000-merged \
    --sft_type 'lora' \
    --tuner_backend 'peft' \
    --template_type 'AUTO' \
    --dtype 'AUTO' \
    --output_dir $BASE_PATH/playground/lora_results/MiniCPM-V-2_6-white_list \
    --custom_train_dataset_path $BASE_PATH/data/black_white_list/ins/white_list_outer_rough_not_pass_all_train.jsonl \
    --custom_val_dataset_path $BASE_PATH/data/black_white_list/ins/white_list_outer_rough_not_pass_all_dev.jsonl \
    --num_train_epochs 3 \
    --max_length 4000 \
    --check_dataset_strategy 'warning' \

Jintao-Huang commented 3 weeks ago

拉一下最新的代码

guihonghao commented 3 weeks ago

需要安装最新的2.3.1版本吗？

Jintao-Huang commented 3 weeks ago

main分支. 是使用了deepspeed zero3嘛

guihonghao commented 3 weeks ago

使用了deepspeed zero3

guihonghao commented 3 weeks ago

改成deepspeed zero2以后又报下面这种错误了。之前也提交了issue反应了这个问题。这种咋解决啊？

[INFO:swift] Loading the model using model_dir: /mnt/bn/arnold-ghh-test/mlx/users/guihonghao/playground/lora_results/MiniCPM-V-2_6-cupai/checkpoint-80000-merged
Traceback (most recent call last):
  File "/mnt/bn/arnold-ghh-test/mlx/users/guihonghao/playground/ghh_swift/swift/examples/pytorch/llm/llm_sft.py", line 7, in <module>
    output = sft_main()
  File "/mnt/bn/arnold-ghh-test/mlx/users/guihonghao/playground/ghh_swift/swift/swift/utils/run_utils.py", line 32, in x_main
    result = llm_x(args, **kwargs)
  File "/mnt/bn/arnold-ghh-test/mlx/users/guihonghao/playground/ghh_swift/swift/swift/llm/sft.py", line 215, in llm_sft
    model, tokenizer = get_model_tokenizer(
  File "/mnt/bn/arnold-ghh-test/mlx/users/guihonghao/playground/ghh_swift/swift/swift/llm/utils/model.py", line 6300, in get_model_tokenizer
    model, tokenizer = get_function(model_dir, torch_dtype, model_kwargs, load_model, **kwargs)
  File "/mnt/bn/arnold-ghh-test/mlx/users/guihonghao/playground/ghh_swift/swift/swift/llm/utils/model.py", line 5741, in get_model_tokenizer_minicpm_v_2_x
    processor = AutoProcessor.from_pretrained(model_dir, trust_remote_code=True)
  File "/home/tiger/.local/lib/python3.9/site-packages/transformers/models/auto/processing_auto.py", line 310, in from_pretrained
    return processor_class.from_pretrained(
  File "/home/tiger/.local/lib/python3.9/site-packages/transformers/processing_utils.py", line 465, in from_pretrained
    args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/transformers/processing_utils.py", line 511, in _get_arguments_from_pretrained
    args.append(attribute_class.from_pretrained(pretrained_model_name_or_path, **kwargs))
  File "/home/tiger/.local/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 843, in from_pretrained
    tokenizer_class = get_class_from_dynamic_module(class_ref, pretrained_model_name_or_path, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 501, in get_class_from_dynamic_module
    return get_class_in_module(class_name, final_module)
  File "/home/tiger/.local/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 202, in get_class_in_module
    return getattr(module, class_name)
AttributeError: module 'transformers_modules.checkpoint-80000-merged.tokenization_minicpmv_fast' has no attribute 'MiniCPMVTokenizerFast'

Jintao-Huang commented 3 weeks ago

你试试将模型文件夹改个名字

modelscope / ms-swift

使用Lora微调MiniCPM-V-2_6，合并后再Lora训练出现问题 #1778