dataset bug help - Githubissues

YJSoooooo commented 11 months ago

我在sft阶段遇到这个问题：datasets.arrow_writer.SchemaInferenceError: Please pass features or at least one example when writing data

datasets==2.14.6

transformers==4.33.2

sft微调参数：

CUDA_VISIBLE_DEVICES=4,5,6,7 torchrun --nproc_per_node 4 supervised_finetuning.py \

--deepspeed deepspeed_zero_stage2_config.json \

--model_type baichuan \

--model_name_or_path /data1/llm_base/baichuan7b \

--train_file_dir ./data/finetune/gpt_train_data.jsonl \

--validation_file_dir ./data/finetune/gpt_val_data.jsonl \

--per_device_train_batch_size 16 \

--per_device_eval_batch_size 4 \

--do_train \

--do_eval \

--use_peft True \

--fp16 \

--num_train_epochs 1 \

--learning_rate 2e-5 \

--warmup_ratio 0.05 \

--weight_decay 0.05 \

--logging_strategy steps \

--logging_steps 10 \

--eval_steps 50 \

--evaluation_strategy steps \

--save_steps 1000 \

--save_strategy steps \

--save_total_limit 3 \

--gradient_accumulation_steps 16 \

--preprocessing_num_workers 4 \

--output_dir outputs-sft-baichuan-v1 \

--overwrite_output_dir \

--ddp_timeout 30000 \

--logging_first_step True \

--target_modules all \

--lora_rank 8 \

--lora_alpha 16 \

--lora_dropout 0.05 \

--torch_dtype float16 \

--device_map auto \

--report_to tensorboard \

--ddp_find_unused_parameters False \

--gradient_checkpointing True \

--cache_dir ./cache

数据集格式：

大佬可以帮忙看看是哪出错了吗求助

YJSoooooo commented 11 months ago

数据集格式是这样的

YJSoooooo commented 11 months ago

文件夹如下

shibing624 commented 11 months ago

--train_file_dir ./data/finetune/

shibing624 commented 11 months ago

参考run_sft.sh

YJSoooooo commented 11 months ago

谢谢大佬已解决这样是不是直接把 finetune下的数据全部合并了那他还会划分train set 和 val set吗

shibing624 commented 11 months ago

会划分

YJSoooooo commented 11 months ago

那白明白谢谢大佬

shibing624 / MedicalGPT

dataset bug help #242