pytorch / torchtune

A Native-PyTorch Library for LLM Fine-tuning
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
4.05k stars 370 forks source link

Clarify in using local dataset #1021

Closed alirezag closed 2 months ago

alirezag commented 4 months ago

The tutorial here suggest we can set source to 'txt' and use 'data_files' key to load local files. But after I get an error saying the dataset does not exist on hugging face. Still fater creaing a dummy HF dataset it still doesn't work.

Using the component directly in code works:


# Load in tokenizer
tokenizer = ...
dataset = text_completion_dataset(
    tokenizer,
    source="txt",
    data_files="path/to/my_data.txt",
)
RdoubleA commented 4 months ago

Hi @alirezag, can you share the exact error you are getting? And is this from specifying a local dataset in the config or in code?