Wrong dataset format for conversational colab in wiki?

unslothai / unsloth

Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

12.35k stars 799 forks source link

Wrong dataset format for conversational colab in wiki? #321

Open erebous opened 2 months ago

erebous commented 2 months ago

I'm getting an error while using the conversational colab: <class 'AttributeError'>: 'list' object has no attribute 'keys'

I just replaced: load_dataset("json", data_files = {"train" : "my_data.json"})

To use a file with the recommended format in the wiki:

[
  [
    {"from": "human", "value": "foo"},
    {"from": "gpt", "value": "bar"},
  ]
]

danielhanchen commented 2 months ago

@erebous We support the ShareGPT style dataset like https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style

erebous commented 2 months ago

@erebous We support the ShareGPT style dataset like https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style

Isn't that ShareGPT already?

From the wiki: Assuming your dataset is a list of list of dictionaries: https://github.com/unslothai/unsloth/wiki#chat-templates

Qualzz commented 4 weeks ago

For those having the same issue you can try formatting your dataset like this:


    {
        "conversations": [
            {
                "from": "human",
                "value": "Lorem ipsum"
            },
            {
                "from": "gpt",
                "value": "Lorem ipsum"
            }
        ]
    },
    {
        "conversations": [
            {
                "from": "human",
                "value": "Lorem ipsum"
            },
            {
                "from": "gpt",
                "value": "Lorem ipsum"
            },
            {
                "from": "human",
                "value": "Lorem ipsum"
            },
            {
                "from": "gpt",
                "value": "Lorem ipsum"
            }
        ]
    }
]```

I don't understand why, but the error went gone, and the load_dataset function was able to load it.

danielhanchen commented 3 weeks ago

Yep that formatting is great!