使用lora微调本地模型chatglm3-6b-base报错ValueError: /root/autodl-tmp/chatglm3-6b-base not in MODEL_MAPPING

XieHaoTao commented 11 months ago

你好，我在本地微调chatglm3-6b-base时报错ValueError:/root/autodl-tmp/chatglm3-6b-base not in MODEL_MAPPING，所采用的命令是bash swift-main/examples/pytorch/llm/scripts/chatglm3_6b_base/lora_ddp_ds/sft.sh 是本地模型路径设置的有问题吗还是其他原因，感谢回复配置文件信息如下：

# GPU: 1*A40*48G
nproc_per_node=1

PYTHONPATH=../../.. \
CUDA_VISIBLE_DEVICES=0 \
torchrun \
    --nproc_per_node=$nproc_per_node \
    --master_port 29500 \
    /root/swift-main/examples/pytorch/llm/llm_sft.py \
    --model_id_or_path /root/autodl-tmp/chatglm3-6b-base \
    --model_revision master \
    --sft_type lora \
    --tuner_backend swift \
    --template_type chatglm-generation \
    --dtype AUTO \
    --output_dir output \
    --ddp_backend nccl \
    --dataset /root/train.json \
    --custom_train_dataset_path /root/train.json \
    --custom_val_dataset_path /root/dev.json \
    --dataset_test_ratio 1.0 \
    --train_dataset_sample -1 \
    --num_train_epochs 1 \
    --max_length 1024 \
    --check_dataset_strategy warning \
    --lora_rank 8 \
    --lora_alpha 32 \
    --lora_dropout_p 0.05 \
    --lora_target_modules DEFAULT \
    --gradient_checkpointing true \
    --batch_size 4 \
    --weight_decay 0.01 \
    --learning_rate 1e-4 \
    --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
    --max_grad_norm 0.5 \
    --warmup_ratio 0.03 \
    --eval_steps 100 \
    --save_steps 10 \
    --save_total_limit 2 \
    --logging_steps 10 \
    --push_to_hub false \
    --hub_model_id chatglm3-6b-base-lora \
    --hub_private_repo true \
    --hub_token 'your-sdk-token' \
    --deepspeed_config_path '/root/swift-main/examples/pytorch/llm/ds_config/zero2.json' \
    --only_save_model true \

报错信息如下：

(base) root@autodl-container-6a9d11bbae-d7a33f77:~# bash swift-main/examples/pytorch/llm/scripts/chatglm3_6b_base/lora_ddp_ds/sft.sh
2023-11-18 08:25:24,281 - modelscope - INFO - PyTorch version 2.0.1 Found.
2023-11-18 08:25:24,284 - modelscope - INFO - TensorFlow version 2.9.0 Found.
2023-11-18 08:25:24,285 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
2023-11-18 08:25:24,348 - modelscope - INFO - Loading done! Current index file version is 1.9.3, with md5 eda829e6e9cb62a3ef86236fe00fc9ce and a total number of 943 components indexed
2023-11-18 08:25:27.015120: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
Traceback (most recent call last):
  File "/root/swift-main/examples/pytorch/llm/llm_sft.py", line 7, in <module>
    best_ckpt_dir = sft_main()
  File "/root/miniconda3/lib/python3.8/site-packages/swift/llm/utils/utils.py", line 188, in x_main
    args, remaining_argv = parse_args(args_class, argv)
  File "/root/miniconda3/lib/python3.8/site-packages/swift/utils/utils.py", line 63, in parse_args
    args, remaining_args = parser.parse_args_into_dataclasses(
  File "/root/miniconda3/lib/python3.8/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 71, in __init__
  File "/root/miniconda3/lib/python3.8/site-packages/swift/llm/utils/argument.py", line 153, in __post_init__
    set_model_type(self)
  File "/root/miniconda3/lib/python3.8/site-packages/swift/llm/utils/argument.py", line 439, in set_model_type
    raise ValueError(f'{model_id_or_path} not in MODEL_MAPPING')
ValueError: /root/autodl-tmp/chatglm3-6b-base not in MODEL_MAPPING
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 960) of binary: /root/miniconda3/bin/python
Traceback (most recent call last):
  File "/root/miniconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/root/swift-main/examples/pytorch/llm/llm_sft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-11-18_08:25:31
  host      : autodl-container-6a9d11bbae-d7a33f77
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 960)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Jintao-Huang commented 11 months ago

你好可以查看model_cache_dir的参数含义哦. 你可以加一下readme中的微信群, 方便我们的沟通😊.

Feiruuuu commented 11 months ago

可以再发一个微信群吗？我最近在用这个项目进行微调，但是出现了一系列的问题，想要进一步的进行探讨。

Jintao-Huang commented 11 months ago

readme中有微信群哦

modelscope / ms-swift

使用lora微调本地模型chatglm3-6b-base报错ValueError: /root/autodl-tmp/chatglm3-6b-base not in MODEL_MAPPING #156