KeyError: 'messages' - Githubissues

rickeyhhh commented 5 days ago

System Info

When I use Lora finetune Llama2-7b-chat-hf, a bug appears. rank1: Traceback (most recent call last): rank1: File "/mnt/data/maruiqi/finetune/train.py", line 155, in rank1: main(model_args, data_args, training_args) rank1: File "/mnt/data/maruiqi/finetune/train.py", line 115, in main rank1: train_dataset, eval_dataset = create_datasets( rank1: File "/mnt/data/maruiqi/finetune/utils.py", line 73, in create_datasets rank1: raw_datasets = raw_datasets.map( rank1: File "/mnt/data/maruiqi/anaconda3/envs/exp/lib/python3.10/site-packages/datasets/dataset_dict.py", line 886, in map

rank1: File "/mnt/data/maruiqi/anaconda3/envs/exp/lib/python3.10/site-packages/datasets/dataset_dict.py", line 887, in rank1: k: dataset.map( rank1: File "/mnt/data/maruiqi/anaconda3/envs/exp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 560, in wrapper rank1: out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) rank1: File "/mnt/data/maruiqi/anaconda3/envs/exp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3055, in map rank1: for rank, done, content in Dataset._map_single(dataset_kwargs): rank1: File "/mnt/data/maruiqi/anaconda3/envs/exp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3458, in _map_single rank1: batch = apply_function_on_filtered_inputs( rank1: File "/mnt/data/maruiqi/anaconda3/envs/exp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3320, in apply_function_on_filtered_inputs rank1: processed_inputs = function(fn_args, *additional_args, **fn_kwargs) rank1: File "/mnt/data/maruiqi/finetune/utils.py", line 52, in preprocess rank1: for conversation in samples["messages"]: rank1: File "/mnt/data/maruiqi/anaconda3/envs/exp/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 277, in getitem rank1: value = self.data[key]

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder
[ ] My own task or dataset (give details below)

Reproduction

rank1: Traceback (most recent call last): rank1: File "/mnt/data/maruiqi/finetune/train.py", line 155, in rank1: main(model_args, data_args, training_args) rank1: File "/mnt/data/maruiqi/finetune/train.py", line 115, in main rank1: train_dataset, eval_dataset = create_datasets( rank1: File "/mnt/data/maruiqi/finetune/utils.py", line 73, in create_datasets rank1: raw_datasets = raw_datasets.map( rank1: File "/mnt/data/maruiqi/anaconda3/envs/exp/lib/python3.10/site-packages/datasets/dataset_dict.py", line 886, in map

rank1: File "/mnt/data/maruiqi/anaconda3/envs/exp/lib/python3.10/site-packages/datasets/dataset_dict.py", line 887, in rank1: k: dataset.map( rank1: File "/mnt/data/maruiqi/anaconda3/envs/exp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 560, in wrapper rank1: out: Union["Dataset", "DatasetDict"] = func(self, args, kwargs) rank1: File "/mnt/data/maruiqi/anaconda3/envs/exp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3055, in map rank1: for rank, done, content in Dataset._map_single(dataset_kwargs): rank1: File "/mnt/data/maruiqi/anaconda3/envs/exp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3458, in _map_single rank1: batch = apply_function_on_filtered_inputs( rank1: File "/mnt/data/maruiqi/anaconda3/envs/exp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3320, in apply_function_on_filtered_inputs rank1: processed_inputs = function(fn_args, *additional_args, **fn_kwargs) rank1: File "/mnt/data/maruiqi/finetune/utils.py", line 52, in preprocess rank1: for conversation in samples["messages"]: rank1: File "/mnt/data/maruiqi/anaconda3/envs/exp/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 277, in getitem rank1: value = self.data[key]

Expected behavior

none

JINO-ROHIT commented 5 days ago

can you share a sample code snippet?

BenjaminBossan commented 4 days ago

It looks like you try to access a non existing key from the dataset. Check if your script contains an argument to indicate the columns containing the text. You can try jumping into a debugger to inspect the data and figure out the real column names.

rickeyhhh commented 3 days ago

can you share a sample code snippet? I just use the run_peftmultigpu.sh script in example with suggested environment. I think there's something wrong with my data structure. **{ "instruction": "Solve the math problem.", "input": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?", "output": "How many eggs does Janet sell? Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.\nHow much does Janet make at the farmers' market? She makes 9 2 = $<<92=18>>18 every day at the farmer’s market.\n#### 18" },_** Is the index name incorrect？Can you tell me which standard is used for data format? Thank you very much!

rickeyhhh commented 3 days ago

It looks like you try to access a non existing key from the dataset. Check if your script contains an argument to indicate the columns containing the text. You can try jumping into a debugger to inspect the data and figure out the real column names.

I just use the run_peftmultigpu.sh script in example with suggested environment. I think there's something wrong with my data structure. **{ "instruction": "Solve the math problem.", "input": "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?", "output": "How many eggs does Janet sell? Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.\nHow much does Janet make at the farmers' market? She makes 9 2 = $<<92=18>>18 every day at the farmer’s market.\n#### 18" },_** Is the index name incorrect？Can you tell me which standard is used for data format? Thank you very much!

JINO-ROHIT commented 3 days ago

@rickeyhhh can you show the flags that you pass for the train script

rickeyhhh commented 3 days ago

@rickeyhhh can you show the flags that you pass for the train script

@JINO-ROHIT Here are the details. torchrun --nproc_per_node 2 --nnodes 1 train.py \ --seed 100 \ --model_name_or_path "/mnt/data/maruiqi/Llama-2-7b-chat-hf" \ --dataset_name "/mnt/data/maruiqi/data/gsm8k_socratic" \ --chat_template_format "json" \ --add_special_tokens False \ --append_concat_token False \ --splits "train,test" \ --max_seq_len 2048 \ --num_train_epochs 1 \ --logging_steps 5 \ --log_level "info" \ --logging_strategy "steps" \ --eval_strategy "epoch" \ --save_strategy "epoch" \ --push_to_hub \ --hub_private_repo True \ --hub_strategy "every_save" \ --bf16 True \ --packing True \ --learning_rate 1e-4 \ --lr_scheduler_type "cosine" \ --weight_decay 1e-4 \ --warmup_ratio 0.0 \ --max_grad_norm 1.0 \ --output_dir "llama2-sft-lora-multigpu" \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 8 \ --gradient_accumulation_steps 8 \ --gradient_checkpointing True \ --use_reentrant False \ --dataset_text_field "content" \ --use_peft_lora True \ --lora_r 8 \ --lora_alpha 16 \ --lora_dropout 0.1 \ --lora_target_modules "all-linear" \ --use_4bit_quantization True \ --use_nested_quant True \ --bnb_4bit_compute_dtype "bfloat16" \ --use_flash_attn True Thanks for your reply!

JINO-ROHIT commented 3 days ago

@rickeyhhh im guessing its this flag

--dataset_text_field "content"

can you replace it with your dataset field. heres a reference for you - https://huggingface.co/datasets/smangrul/ultrachat-10k-chatml?row=0

lmk if youre able to figure it out

huggingface / peft

KeyError: 'messages' #2204

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior