hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
30.85k stars 3.81k forks source link

ValueError: Failed to convert pandas DataFrame to Arrow Table from file #4650

Open fzp0424 opened 2 months ago

fzp0424 commented 2 months ago

Reminder

System Info

Generating train split: 0 examples [00:00, ? examples/s]Failed to convert pandas Da[62/1867]
o Arrow Table from file '/data/zhaopengfeng/LLaMA-Factory/data/kddcup/openai_track4_0702.jso
n' with error <class 'pyarrow.lib.ArrowInvalid'>: ('cannot mix list and non-list, non-null v
alues', 'Conversion failed for column messages with type object')                           
Generating train split: 0 examples [00:00, ? examples/s]                                    
[rank3]: Traceback (most recent call last):                                                 
[rank3]:   File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/builder.py", line 1997, in _prepare_split_single                                 
[rank3]:     for _, table in generator:                                                     
[rank3]:   File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/packaged_modules/json/json.py", line 165, in _generate_tables                    
[rank3]:     raise ValueError(                                                              
[rank3]: ValueError: Failed to convert pandas DataFrame to Arrow Table from file /data/zhaop
engfeng/LLaMA-Factory/data/kddcup/openai_track4_0702.json.                                  

[rank3]: The above exception was the direct cause of the following exception:               

[rank3]: Traceback (most recent call last):                                                 
[rank3]:   File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in
 <module>                                                                                   
[rank3]:     launch()                                                                       
[rank3]:   File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in
 launch                                                                                     
[rank3]:     run_exp()                                                                      
[rank3]:   File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50,
 in run_exp                                                                                 
[rank3]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args,
 callbacks)                                                                                 
[rank3]:   File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", l
ine 48, in run_sft                                                                          
[rank3]:     dataset = get_dataset(model_args, data_args, training_args, stage="sft", **toke
nizer_module)                                                                               
[rank3]:   File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/data/loader.py", line 174
, in get_dataset                                                                            
[rank3]:     all_datasets.append(load_single_dataset(dataset_attr, model_args, data_args, tr
aining_args))                                                                               
[rank3]:   File "/data/zhaopengfeng/LLaMA-Factory/src/llamafactory/data/loader.py", line 109
, in load_single_dataset                                                                    
[rank3]:     dataset = load_dataset(                                                        
[rank3]:   File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/load.py", line 2616, in load_dataset
[rank3]:     builder_instance.download_and_prepare(
[rank3]:   File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/builder.py", line 1029, in download_and_prepare
[rank3]:     self._download_and_prepare(
[rank3]:   File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/builder.py", line 1124, in _download_and_prepare
[rank3]:     self._prepare_split(split_generator, **prepare_split_kwargs)
[rank3]:   File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/builder.py", line 1884, in _prepare_split
[rank3]:     for job_id, done, content in self._prepare_split_single(
[rank3]:   File "/home/zhaopengfeng/anaconda3/envs/llama_factory/lib/python3.10/site-package
s/datasets/builder.py", line 2040, in _prepare_split_single
[rank3]:     raise DatasetGenerationError("An error occurred while generating the dataset") 
from e...

Reproduction

Dataset info

  "openai_track4_0702": {
    "file_name": "/data/zhaopengfeng/LLaMA-Factory/data/kddcup/openai_track4_0702.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "messages"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant",
      "system_tag": "system"
    }
  }

Data example

[
  {
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful online shopping assistant.Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n"
      },
      {
        "role": "user",
        "content": "Instructions: Explain the product category name\nInput: iPhone\n"
      },
      {
        "role": "assistant",
        "content": "An iPhone is a smartphone designed and manufactured by Apple Inc. that combines mobile phone capabilities with features such as internet access, multimedia playback, and a touch screen interface."
      }
    ]
  },
  {
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful online shopping assistant.Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n"
      },
      {
        "role": "user",
        "content": "Input: Explain the product type Serving Bowl\n"
      },
      {
        "role": "assistant",
        "content": "A serving bowl is a large bowl used for presenting and serving food, such as salads, vegetables, or side dishes, at the dining table.\n"
      },
      {
        "role": "user",
        "content": "Input: Explain the product type Apron\n"
      },
      {
        "role": "assistant",
        "content": "An apron is a protective garment worn over clothing to shield the wearer from spills, stains, and other kitchen or work-related messes. Aprons are commonly used in cooking, baking, and various crafts or occupations."
      }
    ]
  },...
]

train script

### model
model_name_or_path: /data/zhaopengfeng/models/glm-4-9b-chat

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: all
deepspeed: examples/deepspeed/ds_z2_config.json

### dataset
dataset: openai_track4_0702
template: glm4
cutoff_len: 1024
max_samples: 5000
overwrite_cache: true
preprocessing_num_workers: 16
...

My environment

transformers                      4.42.3
triton                            2.3.0
llamafactory                      0.8.3.dev0   /data/zhaopengfeng/LLaMA-Factory   
...
CUDA Driver 12.5

BTW, the alpaca template works well.

Expected behavior

lora sft

Others

No response

Winston-Yuan commented 2 weeks ago

I also encountered this problem. Have you solved it?