unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
17.72k stars 1.23k forks source link

How to fine-tune using pytroch dataset instead of hf's dataset #958

Open xugy16 opened 2 months ago

xugy16 commented 2 months ago

How can I use pytorch's dataset to fine-tune llama3.1.

When I try to use pytorch's dataset, I keep getting the following errors related to collator:

File ~/anaconda3/envs/llama/lib/python3.10/site-packages/transformers/data/data_collator.py:589, in
...
#labels = [feature[label_name] for feature in features] if label_name in features[0].keys() else None
# reconvert list[None] to None if necessary
 # this might occur when we pass {..., "labels": None}

AttributeError: 'str' object has no attribute 'keys'

The reason is that I want to add noise to the word (data-augmentation) and the dataset is dynamic as below.

def __getitem__(self, idx):
        # only add noise to input text
        # tmp = self.data[idx]
        true_qry = [self.data['true_qry'][idx]](url)
        if random.random() < self.noise_prob:
            sample_edit_distance = random.randint(1, self.max_edit_distance)
            input_qry = self.add_noise(true_qry, sample_edit_distance)
        else:
            input_qry = true_qry

And then I follow the fine-tune scipt and use chatml template

<|im_start|>user
iobwin<|im_end|>
<|im_start|>assistant
ibowin<|im_end|>

The trainer is as below:

def my_formatting_func(example):
    return example

trainer=SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        # dataset_text_field="text",
        max_seq_length=max_seq_length,
        dataset_num_proc=2,
        packing=False,
        data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
        args=train_args,
        formatting_func=my_formatting_func  # 添加这一行
    )
danielhanchen commented 2 months ago

I think HF's datasets has like a converter - unsure though - maybe https://github.com/huggingface/datasets/issues/4983?

xugy16 commented 2 months ago

@danielhanchen Really appreciate for your reply. Supposing we do not do converserter, is possible just FT llama3.1 with SFTTrainer using: 1) pytorch dataset using data augamentation; 2) chatml format;

I tried several methods, but seem that SFTTrainer do not tokenize my chatml input and throw "'str' object has no attribute 'keys'" error.