how do i use my own dataset?

unslothai / unsloth

Finetune Llama 3, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

12.35k stars 799 forks source link

how do i use my own dataset? #694

Open ares0027 opened 1 week ago

ares0027 commented 1 week ago

i used augmentoolkit to generate sharegpt format dataset and trying to use "Llama-3 8b Instruct Unsloth 2x faster finetuning.ipynb". i am not good with coding so i dont know what to change in

from datasets import load_dataset

dataset = load_dataset("philschmid/guanaco-sharegpt-style", split = "train")

dataset = dataset.map(formatting_prompts_func, batched = True,)

or anything else really. i watched 8613 different "use your custom data" videos but they all use this exact code without changing anything. my data is confidential, it is not online, it is not on hugging face, i cannot share it with anyone. what do i do?

brthor commented 1 week ago

I used a combination of the code from this blog, and the recommendation in the TRL docs for unsloth.

This is the sort of thing that requires some coding though.

timothelaborie commented 1 week ago

I do this to load a csv into a dataset object:

data = pd.read_csv(output_dir + "data_cleaned.csv")

from sklearn.model_selection import train_test_split

# keep a subset
data_sample = data.sample(n=8000, random_state=42)

# find frac so val size is 5000
train_df, val_df = train_test_split(data_sample, test_size=5000/len(data_sample), random_state=42)
# save to output_dir
train_df.to_csv(output_dir + "train.csv", index=False)
print(len(train_df))

dataset = load_dataset(output_dir,data_files="train.csv", split="train")

brthor commented 1 week ago

You don't need to use the hf datasets library, a pytorch dataset instance will work as well, but you will need to do the tokenization and such like in the blog article I linked.

danielhanchen commented 1 week ago

Apologies on the late reply - I just relocated to SF, so just got to this! On loading a dataset from not the HF hub, you can upload datasets from your own PC - see https://huggingface.co/docs/datasets/en/loading#csv.

Another option is to download Unsloth on your local PC, and you'll need to follow the install instructions on the home page