同一份数据集，llamafactory-cli的训练和 deespeed的训练不一致

SafeCool commented 2 months ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory version: 0.8.2.dev0
Platform: Linux-5.4.119-19-0009.11-x86_64-with-glibc2.35
Python version: 3.11.7
PyTorch version: 2.3.0+cu121 (GPU)
Transformers version: 4.41.2
Datasets version: 2.19.2
Accelerate version: 0.30.1
PEFT version: 0.11.1
TRL version: 0.8.6
GPU type: NVIDIA Graphics Device
DeepSpeed version: 0.14.0
vLLM version: 0.4.3

Reproduction

1、使用下面两种命令开启训练 nohup deepspeed --include localhost:1,2,3,4,5,6,7 --master_port 35200 src/train.py \ --stage sft \ --do_train \ --model_name_or_path /mnt/data/pre_model/Qwen2-7B-Instruct/ \ --dataset fun_call_v1 \ --val_size 0.2 \ --dataset_dir /mnt/LLaMA-Factory/data \ --template qwen \ --finetuning_type full \ --output_dir /mnt/output/models/ \ --overwrite_cache \ --overwrite_output_dir \ --max_sample 100000 \ --cutoff_len 4096 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 16 \ --lr_scheduler_type cosine \ --logging_steps 20 \ --save_steps 20 \ --learning_rate 1e-5 \ --max_steps 300 \ --evaluation_strategy steps \ --eval_steps 2 \ --plot_loss \ --bf16 \ --warmup_ratio 0.1 \ --preprocessing_num_workers 64 \ --deepspeed /mnt/LLaMA-Factory/examples/deepspeed/ds_z3_config.json > ../../output/logs/sft_qwen2_7B_Instruct_v1.log 2>&1 & 2、第二种训练命令

model

model_name_or_path: /mnt/data/pre_model/Qwen2-7B-Instruct/

method

stage: sft do_train: true finetuning_type: full deepspeed: examples/deepspeed/ds_z3_config.json

dataset

dataset: fun_call_v1 template: qwen cutoff_len: 4096 max_samples: 100000 overwrite_cache: true preprocessing_num_workers: 32

output

output_dir: /mnt/output/models/ logging_steps: 5 save_steps: 30 plot_loss: true overwrite_output_dir: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 16 learning_rate: 1.0e-5 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.05 bf16: true ddp_timeout: 180000000

eval

val_size: 0.2 per_device_eval_batch_size: 2 eval_strategy: steps eval_steps: 5 第一种方式运行报错，数据集解析报下面的错误

rank1: Traceback (most recent call last): rank1: File "/root/anaconda3/lib/python3.11/site-packages/multiprocess/pool.py", line 125, in worker rank1: result = (True, func(*args, **kwds))

rank1: File "/root/anaconda3/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue rank1: for i, result in enumerate(func(**kwargs)): rank1: File "/root/anaconda3/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3547, in _map_single rank1: batch = apply_function_on_filtered_inputs(

rank1: File "/root/anaconda3/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3416, in apply_function_on_filtered_inputs rank1: processed_inputs = function(fn_args, additional_args, **fn_kwargs)

rank1: File /mnt/LLaMA-Factory/src/llamafactory/data/processors/supervised.py", line 102, in preprocess_supervised_dataset rank1: input_ids, labels = _encode_supervised_example(

rank1: File /mnt/LLaMA-Factory/src/llamafactory/data/processors/supervised.py", line 54, in _encode_supervised_example rank1: encoded_pairs = template.encode_multiturn(tokenizer, messages, system, tools)

rank1: File /m n t/LLaMA-Factory/src/llamafactory/data/template.py", line 76, in encode_multiturn rank1: encoded_messages = self._encode(tokenizer, messages, system, tools)

rank1: File /mnt/LLaMA-Factory/src/llamafactory/data/template.py", line 105, in _encode rank1: tool_text = self.format_tools.apply(content=tools)[0] if tools else ""

rank1: File /mnt/LLaMA-Factory/src/llamafactory/data/formatter.py", line 136, in apply rank1: return [self._tool_formatter(tools) if len(tools) != 0 else ""]

rank1: File /mnt/LLaMA-Factory/src/llamafactory/data/tool_utils.py", line 72, in tool_formatter rank1: if param.get("enum", None):

第二种方式运行没有问题

Expected behavior

No response

Others

No response

hiyouga commented 2 months ago

请使用后者

SafeCool commented 2 months ago

请使用后者

请问第一种方式和第二种方式应该都是同一套代码吧，为啥会不一样呢？还有我一般使用 src/train.py进行调试代码，如果使用llamafactory-cli该如何调试

hiyouga commented 1 month ago

src/llamafactory/cli.py

SafeCool commented 1 month ago

该邮件从移动设备发送

Exception has occurred: ImportError

attempted relative import with no known parent package File "/mnt/LLaMA-Factory/src/llamafactory/cli.py", line 21, in <module> from . import launcher ImportError: attempted relative import with no known parent package { "name": "llama_factory_deepspeed_sft", "type": "python", "request": "launch", "program": "/root/anaconda3/bin/deepspeed", "console": "integratedTerminal", "justMyCode": false, "args": [ "--include=localhost:0", "--master_port", "35212", "/mnt/LLaMA-Factory/src/llamafactory/cli.py", "--train" ] }

请问一下vscode调试的配置具体应该怎么写呢？ ------------------ 原始邮件 ------------------ 发件人: "hiyouga/LLaMA-Factory" @.>; 发送时间: 2024年7月9日(星期二) 晚上9:00 @.>; @.**@.>; 主题: Re: [hiyouga/LLaMA-Factory] 同一份数据集，llamafactory-cli的训练和 deespeed的训练不一致 (Issue #4725)

src/llamafactory/cli.py

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

hiyouga / LLaMA-Factory