modelscope / ms-swift

Use PEFT or Full-parameter to finetune 350+ LLMs or 90+ MLLMs. (Qwen2.5, GLM4v, Internlm2.5, Yi, Llama3.1, Llava-Video, Internvl2, MiniCPM-V-2.6, Deepseek, Baichuan2, Gemma2, Phi3-Vision, ...)
https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html
Apache License 2.0
3.49k stars 299 forks source link

codefuse-codellama-34b-chat sft 报数据集错误 #179

Closed bravelll closed 9 months ago

bravelll commented 9 months ago

PYTHONPATH=../../.. \ CUDA_VISIBLE_DEVICES=0 \ python llm_sft.py \ --model_type codefuse-codellama-34b-chat \ --sft_type lora \ --tuner_backend swift \ --template_type codefuse-codellama \ --dtype fp16 \ --output_dir output \ --custom_train_dataset_path /u01/liuys/work/datasets/data/java-1k.jsonl \ --custom_val_dataset_path /u01/liuys/work/datasets/data/java-100.jsonl \ --train_dataset_sample -1 \ --num_train_epochs 1 \ --max_length 4096 \ --check_dataset_strategy warning \ --lora_rank 8 \ --lora_alpha 32 \ --lora_dropout_p 0.05 \ --lora_target_modules DEFAULT \ --gradient_checkpointing true \ --batch_size 1 \ --weight_decay 0.01 \ --learning_rate 1e-4 \ --gradient_accumulation_steps 16 \ --max_grad_norm 0.5 \ --warmup_ratio 0.03 \ --eval_steps 100 \ --save_steps 100 \ --save_total_limit 2 \ --logging_steps 10 \ --use_flash_attn true \ --push_to_hub false \ --hub_model_id codefuse-codellama-34b-chat-lora \ --hub_private_repo true \ --hub_token 'your-sdk-token' \ 报错如下:Traceback (most recent call last): File "/u01/liuys/swift/examples/pytorch/llm/llm_sft.py", line 7, in output = sft_main() File "/u01/liuys/swift/swift/llm/utils/utils.py", line 194, in x_main return llm_x(args, *kwargs) File "/u01/liuys/swift/swift/llm/sft.py", line 253, in llm_sft trainer = Seq2SeqTrainer( File "/u01/liuys/swift/swift/trainers/trainers.py", line 29, in init super().init(args, **kwargs) File "/u01/liuys/swift/swift/trainers/mixin.py", line 283, in init super().init(model, args, data_collator, train_dataset, File "/u01/liuys/anaconda3/envs/ms-swift/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 56, in init super().init( File "/u01/liuys/anaconda3/envs/ms-swift/lib/python3.10/site-packages/transformers/trainer.py", line 481, in init self._move_model_to_device(model, args.device) File "/u01/liuys/anaconda3/envs/ms-swift/lib/python3.10/site-packages/transformers/trainer.py", line 716, in _move_model_to_device model = model.to(device) File "/u01/liuys/anaconda3/envs/ms-swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1160, in to return self._apply(convert) File "/u01/liuys/anaconda3/envs/ms-swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/u01/liuys/anaconda3/envs/ms-swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) File "/u01/liuys/anaconda3/envs/ms-swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 810, in _apply module._apply(fn) [Previous line repeated 4 more times] File "/u01/liuys/anaconda3/envs/ms-swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 833, in _apply param_applied = fn(param) File "/u01/liuys/anaconda3/envs/ms-swift/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1158, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) NotImplementedError: Cannot copy out of meta tensor; no data! jsonl 数据格式如下: {"query": "// language: Java\n// 日志一条信息", "response": "public synchronized void info(String msg){\n LogRecord record=new LogRecord(Level.INFO,msg);\n log(record);\n}"} {"query": "// language: Java\n// 处理 gateway 接收器创建", "response": "public void handleGatewayReceiverCreate(GatewayReceiver recv) throws ManagementException {\n if (!isServiceInitialised(\"handleGatewayReceiverCreate\")) {\n return;\n }\n if (!recv.isManualStart()) {\n return;\n }\n createGatewayReceiverMBean(recv);\n}"} {"query": "// language: Java\n// 这个方法将收到数据提供者的无论何时数据更改的通知", "response": "public void dataChanged(IDataProvider dataProvider);"}

Jintao-Huang commented 9 months ago

这个报错是显存不够了, 你是什么机器呀

bravelll commented 9 months ago

我是4张3090 24g内存的,我用codefuse-ai/CodeFuse-CodeLlama-34B-4bits模型能微调不?

这个报错是显存不够了, 你是什么机器呀

我是4张3090 24g内存的,我用codefuse-ai/CodeFuse-CodeLlama-34B-4bits模型能微调不?

Jintao-Huang commented 9 months ago

codefuse-ai/CodeFuse-CodeLlama-34B-4bits 好像不支持微调. 你用lora_mp是可以跑codefuse-ai/CodeFuse-CodeLlama-34B的微调

bravelll commented 9 months ago

lora_mp

lora_mp 没用过,有具体的例子吗,提供一下,谢谢!

Jintao-Huang commented 9 months ago

https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/lora_mp 可以参考这个例子哈

bravelll commented 9 months ago

https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/lora_mp 谢谢,我试试哈

bravelll commented 9 months ago

https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/qwen_72b_chat/lora_mp 可以参考这个例子哈 里面的shell 没看到跟哪个参数不同lora_mp 主要是哪个参数

Jintao-Huang commented 9 months ago

CUDA_VISIBLE_DEVICES中gpu的个数是world_size的整数倍时, 自动开启mp or mp_ddp

bravelll commented 9 months ago

mp

我试试,谢谢!