jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.3k stars 425 forks source link

Is there any simple demo of fine-tuning TinyLlama #175

Closed Bill-Cai closed 2 months ago

Bill-Cai commented 3 months ago

I'm a newbie and don't quite understand the code in the finetuning.py script. So, I wonder if it's possible to provide any simple demo of fine-tuning TinyLlama.

For example, now I have a dataset, and it just have two columns (input, output), how can I pre-process the dataset correctly so that it can be put into the trainer and run rightly.

My pre-processing dataset is like below:

from datasets import Dataset, load_dataset

def preprocess_function(examples):
    return {"input_ids": examples["text"].split("\t")[0], "labels": examples["text"].split("\t")[1]}

dataset = load_dataset('text', data_files=data_name_or_path)['train']
dataset = dataset.map(preprocess_function).remove_columns('text')
print(dataset)
print(dataset[0])

output:

Dataset({
    features: ['input_ids', 'labels'],
    num_rows: 27773
})
{'input_ids': 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
 'labels': 'xxxxxxxxxxxxxxx'}

When I run the trainer:

# 定义训练参数
training_args = TrainingArguments(
    output_dir="/tmp/pycharm_project_787/src/result/test_ft",  # 输出目录
    num_train_epochs=3,  # 训练轮数
    per_device_train_batch_size=16,  # 每个设备的训练批次大小
    per_device_eval_batch_size=64,  # 每个设备的评估批次大小
    warmup_steps=500,  # 预热步数
    weight_decay=0.01,  # 权重衰减
    logging_dir="/tmp/pycharm_project_787/src/logs/test_ft",  # 日志目录
    # remove_unused_columns=False, # https://discuss.huggingface.co/t/indexerror-invalid-key-16-is-out-of-bounds-for-size-0/14298
)

# 定义训练器
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer
    # data_collator=DataCollatorForSeq2Seq(
    #     tokenizer, return_tensors="pt"
    # ),
)

# 训练模型
trainer.train()

It got error:

ValueError: type of xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <class 'str'>. Should be one of a python, numpy, pytorch or tensorflow object.

I'm wondering if there's something wrong with my data set construction and preprocessing, or if the trainer is run in the wrong way.

I'd be grateful if someone could answer this as soon as possible!

Bill-Cai commented 3 months ago

And when I change dataset in the following format:

def preprocess_function(examples):
    return {"input": examples["text"].split("\t")[0], "output": examples["text"].split("\t")[1]}

It got error:

IndexError: Invalid key: 63 is out of bounds for size 0

(Here I use 64 records to test)

RonanKMcGovern commented 3 months ago

you can use one of the notebooks from unsloth - look up their github and find a colab notebook, then swap out llama for tinyllama in the model name

On Tue, Apr 2, 2024 at 9:35 AM Bill-Cai @.***> wrote:

And when I change dataset in the following format:

def preprocess_function(examples): return {"input": examples["text"].split("\t")[0], "output": examples["text"].split("\t")[1]}

It got error:

IndexError: Invalid key: 63 is out of bounds for size 0

(Here I use 64 records to test)

— Reply to this email directly, view it on GitHub https://github.com/jzhang38/TinyLlama/issues/175#issuecomment-2031401324, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASVG6CSTKHDE4UQZLOMXLY3Y3JUWHAVCNFSM6AAAAABFS33VI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZRGQYDCMZSGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Bill-Cai commented 2 months ago

Thanks, I'll take a look.

Bill-Cai commented 2 months ago

OK, I know how to solve it. I will write the demo below so that it can be learned by those who come after me.

The dataset should be orgnized as:

{'input': 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
 'output': 'xxxxxxxxxxxxxxxxxxxxxxxxxxxx'}

Because I read the code of finetune.py and it suggests:

def make_data_module(tokenizer: transformers.PreTrainedTokenizer, args) -> Dict:
    """
    Make dataset and collator for supervised fine-tuning.
    Datasets are expected to have the following columns: { `input`, `output` }
    ...

The core process of defining trainer is like below:

# 定义训练参数
training_args = TrainingArguments(
    output_dir="/tmp/pycharm_project_787/src/result/test_ft",  # 输出目录
    num_train_epochs=1,  # 训练轮数
    per_device_train_batch_size=16,  # 每个设备的训练批次大小
    per_device_eval_batch_size=64,  # 每个设备的评估批次大小
    weight_decay=0.005,  # 权重衰减
    logging_dir="/tmp/pycharm_project_787/src/logs/test_ft",  # 日志目录
    remove_unused_columns=False,
)

training_args.generation_config = None

# 定义训练器
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset,
    data_collator=DataCollatorForCausalLM(
        tokenizer=tokenizer,
        source_max_len=128,
        target_max_len=128,
        train_on_source=True,
        predict_with_generate=None,
    ),
)

# 训练模型
trainer.train()

The function DataCollatorForCausalLM() is defined in finetune.py too:

@dataclass
class DataCollatorForCausalLM(object):
    tokenizer: transformers.PreTrainedTokenizer
    source_max_len: int
    target_max_len: int
    train_on_source: bool
    predict_with_generate: bool

    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
        # Extract elements
        sources = [f"{self.tokenizer.bos_token}{example['input']}" for example in instances]
        targets = [f"{example['output']}{self.tokenizer.eos_token}" for example in instances]
        # Tokenize
        tokenized_sources_with_prompt = self.tokenizer(
            sources,
            max_length=self.source_max_len,
            truncation=True,
            add_special_tokens=False,
        )
        tokenized_targets = self.tokenizer(
            targets,
            max_length=self.target_max_len,
            truncation=True,
            add_special_tokens=False,
        )
        # Build the input and labels for causal LM
        input_ids = []
        labels = []
        for tokenized_source, tokenized_target in zip(
                tokenized_sources_with_prompt['input_ids'],
                tokenized_targets['input_ids']
        ):
            if not self.predict_with_generate:
                input_ids.append(torch.tensor(tokenized_source + tokenized_target))
                if not self.train_on_source:
                    labels.append(
                        torch.tensor(
                            [IGNORE_INDEX for _ in range(len(tokenized_source))] + copy.deepcopy(tokenized_target))
                    )
                else:
                    labels.append(torch.tensor(copy.deepcopy(tokenized_source + tokenized_target)))
            else:
                input_ids.append(torch.tensor(tokenized_source))
        # Apply padding
        input_ids = pad_sequence(input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id)
        labels = pad_sequence(labels, batch_first=True,
                              padding_value=IGNORE_INDEX) if not self.predict_with_generate else None
        data_dict = {
            'input_ids': input_ids,
            'attention_mask': input_ids.ne(self.tokenizer.pad_token_id),
        }
        if labels is not None:
            data_dict['labels'] = labels
        return data_dict

It will tokenize the input and output columes into input_ids and labels, which is a vectorization process.

It can run successfully using the above defined trainer.