modelscope / ms-swift

Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
4.43k stars 389 forks source link

官方 qwen2-vl 微调样例报错,无法获取数据 latex-ocr-print #2509

Open 312shan opened 4 days ago

312shan commented 4 days ago

网络环境没有问题,参考官方文档微调报错无法获取数据集latex-ocr-print ,参考文档地址:https://github.com/modelscope/ms-swift/blob/main/docs/source/Multi-Modal/qwen2-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.md

遇到报错的微调启动命令:

# 单卡A10/3090可运行
# GPU Memory: 20GB
SIZE_FACTOR=8 MAX_PIXELS=602112 CUDA_VISIBLE_DEVICES=0 swift sft \
  --model_type qwen2-vl-7b-instruct \
  --model_id_or_path qwen/Qwen2-VL-7B-Instruct \
  --sft_type lora \
  --dataset latex-ocr-print#20000

脚本启动之后模型正确加载,在获取和处理数据的时候报错如下:

[INFO:swift] PeftModelForCausalLM: 8311.5607M Params (20.1851M Trainable [0.2429%]), 0.0019M Buffers.
[INFO:swift] system: You are a helpful assistant.
[INFO:swift] args.lazy_tokenize: True
[INFO:swift] Downloading the dataset from ModelScope, dataset_id: AI-ModelScope/LaTeX_OCR
[ERROR:modelscope] >> Error loading AI-ModelScope/LaTeX_OCR: 'full'
[ERROR:swift] Dataset AI-ModelScope/LaTeX_OCR load failed: subset_name=full,split=validation with error: 'full'
[ERROR:modelscope] >> Error loading AI-ModelScope/LaTeX_OCR: 'full'
[ERROR:swift] Dataset AI-ModelScope/LaTeX_OCR load failed: subset_name=full,split=validation with error: 'full'
[ERROR:modelscope] >> Error loading AI-ModelScope/LaTeX_OCR: 'full'
[ERROR:swift] Dataset AI-ModelScope/LaTeX_OCR load failed: subset_name=full,split=validation with error: 'full'
[ERROR:modelscope] >> Error loading AI-ModelScope/LaTeX_OCR: 'full'
[ERROR:swift] Dataset AI-ModelScope/LaTeX_OCR load failed: subset_name=full,split=validation with error: 'full'
[ERROR:modelscope] >> Error loading AI-ModelScope/LaTeX_OCR: 'full'
[ERROR:swift] Dataset AI-ModelScope/LaTeX_OCR load failed: subset_name=full,split=validation with error: 'full'
[ERROR:modelscope] >> Error loading AI-ModelScope/LaTeX_OCR: 'full'
[ERROR:swift] Dataset AI-ModelScope/LaTeX_OCR load failed: subset_name=full,split=test with error: 'full'
[ERROR:modelscope] >> Error loading AI-ModelScope/LaTeX_OCR: 'full'
[ERROR:swift] Dataset AI-ModelScope/LaTeX_OCR load failed: subset_name=full,split=test with error: 'full'
[ERROR:modelscope] >> Error loading AI-ModelScope/LaTeX_OCR: 'full'
[ERROR:swift] Dataset AI-ModelScope/LaTeX_OCR load failed: subset_name=full,split=test with error: 'full'
[ERROR:modelscope] >> Error loading AI-ModelScope/LaTeX_OCR: 'full'
[ERROR:swift] Dataset AI-ModelScope/LaTeX_OCR load failed: subset_name=full,split=test with error: 'full'
[ERROR:modelscope] >> Error loading AI-ModelScope/LaTeX_OCR: 'full'
[ERROR:swift] Dataset AI-ModelScope/LaTeX_OCR load failed: subset_name=full,split=test with error: 'full'
Traceback (most recent call last):
  File "/SD_NAS/cys/vllm_proj/ms-swift/swift/cli/sft.py", line 5, in <module>
    sft_main()
  File "/SD_NAS/cys/vllm_proj/ms-swift/swift/utils/run_utils.py", line 32, in x_main
    result = llm_x(args, **kwargs)
  File "/SD_NAS/cys/vllm_proj/ms-swift/swift/llm/sft.py", line 545, in llm_sft
    train_dataset, val_dataset = prepare_dataset(args, template, msg)
  File "/SD_NAS/cys/vllm_proj/ms-swift/swift/llm/sft.py", line 354, in prepare_dataset
    train_dataset, val_dataset = _get_train_val_dataset(args)
  File "/SD_NAS/cys/vllm_proj/ms-swift/swift/llm/sft.py", line 32, in _get_train_val_dataset
    train_dataset, val_dataset = get_dataset(
  File "/SD_NAS/cys/vllm_proj/ms-swift/swift/llm/utils/dataset.py", line 2924, in get_dataset
    dataset = get_function(
  File "/SD_NAS/cys/vllm_proj/ms-swift/swift/llm/utils/dataset.py", line 491, in get_dataset_from_repo
    dataset = load_ms_dataset(
  File "/SD_NAS/cys/vllm_proj/ms-swift/swift/llm/utils/dataset.py", line 394, in load_ms_dataset
    return concatenate_datasets(dataset_list)
  File "/home/trnuser/anaconda3/envs/swift_llm/lib/python3.10/site-packages/datasets/combine.py", line 188, in concatenate_datasets
    raise ValueError("Unable to concatenate an empty list of datasets.")
ValueError: Unable to concatenate an empty list of datasets.
(swift_llm) [trnuser@iZj6cetcbae2m6339k2e2rZ vllm_proj]$
312shan commented 4 days ago

参考数据集说明 :https://modelscope.cn/datasets/AI-ModelScope/LaTeX_OCR

In [1]: from modelscope import MsDataset

In [2]: xtrain_dataset = MsDataset.load("AI-ModelScope/LaTeX_OCR", subset_name="small", split="train")

small 可以成功执行,改成 full 也不行。