pytorch / torchtune

A Native-PyTorch Library for LLM Fine-tuning
https://pytorch.org/torchtune/main/
BSD 3-Clause "New" or "Revised" License
3.94k stars 357 forks source link

Improve documentation on custom datasets finetune #1369

Open dpalmasan opened 3 weeks ago

dpalmasan commented 3 weeks ago

I am trying to do a finetune using a custom dataset, in particular: https://huggingface.co/datasets/truthfulqa/truthful_qa

I haven't found any clear documentation, only partial docs explaining bits https://pytorch.org/torchtune/stable/tutorials/datasets.html

I am following the instructions, and this is my custom class:

class TruthfulQATemplate(InstructDataset):
    template = "Instruction:\n{instruction}\n\nInput:\n{input}\n\nResponse: "

    @classmethod
    def format(cls, sample, column_map):
        return cls.template.format(**sample)

Here is my config:

dataset:
  _component_: torchtune.datasets.instruct_dataset
  template: truthful_qa.TruthfulQATemplate
  max_seq_len: 4096
  source: truthfulqa/truthful_qa
  split: train
  data_dir: truthful_qa

However, when I try to tune a model I am getting:

File "/home/dpalmasan/local/miniconda3/lib/python3.12/site-packages/torchtune/config/_instantiate.py", line 20, in _create_component
    return _component_(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dpalmasan/local/miniconda3/lib/python3.12/site-packages/torchtune/datasets/_instruct.py", line 183, in instruct_dataset
    ds = InstructDataset(
         ^^^^^^^^^^^^^^^^
  File "/home/dpalmasan/local/miniconda3/lib/python3.12/site-packages/torchtune/datasets/_instruct.py", line 76, in __init__
    if not isinstance(template(), InstructTemplate):
                      ^^^^^^^^^^
TypeError: InstructDataset.__init__() missing 3 required positional arguments: 'tokenizer', 'source', and 'template'

What are the steps to use a custom dataset from hugging face for an instruct task?

joecummings commented 3 weeks ago

Hi @dpalmasan! Looks like you might be inheriting from the wrong base class for your template: you should try InstructTemplate, not InstructDataset like so:

class TruthfulQATemplate(InstructTemplate):
    template = "Instruction:\n{instruction}\n\nInput:\n{input}\n\nResponse: "

    @classmethod
    def format(cls, sample, column_map):
        return cls.template.format(**sample)
dpalmasan commented 3 weeks ago

Ohh that was the issue. I still had to make the script a package installed or add it to PYTHONPATH to make it work. But it worked. Docs are a little bit confusing, for setting the mapping I also had to look at the code:

dataset:
  _component_: torchtune.datasets.instruct_dataset
  template: truthful_qa.TruthfulQATemplate
  max_seq_len: 4096
  source: truthfulqa/truthful_qa
  split: validation
  data_dir: generation
  column_map:
    instruction: question
    input: type
    output: best_answer
seed: null
shuffle: True
batch_size: 2

Maybe some examples might be good. Thanks for the quick answer.

joecummings commented 3 weeks ago

Yep, we've definitely got to make our docs more clear - your input is much appreciated!

cc @RdoubleA