Closed Bill-Cai closed 2 months ago
And when I change dataset in the following format:
def preprocess_function(examples):
return {"input": examples["text"].split("\t")[0], "output": examples["text"].split("\t")[1]}
It got error:
IndexError: Invalid key: 63 is out of bounds for size 0
(Here I use 64 records to test)
you can use one of the notebooks from unsloth - look up their github and find a colab notebook, then swap out llama for tinyllama in the model name
On Tue, Apr 2, 2024 at 9:35 AM Bill-Cai @.***> wrote:
And when I change dataset in the following format:
def preprocess_function(examples): return {"input": examples["text"].split("\t")[0], "output": examples["text"].split("\t")[1]}
It got error:
IndexError: Invalid key: 63 is out of bounds for size 0
(Here I use 64 records to test)
— Reply to this email directly, view it on GitHub https://github.com/jzhang38/TinyLlama/issues/175#issuecomment-2031401324, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASVG6CSTKHDE4UQZLOMXLY3Y3JUWHAVCNFSM6AAAAABFS33VI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZRGQYDCMZSGQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks, I'll take a look.
OK, I know how to solve it. I will write the demo below so that it can be learned by those who come after me.
The dataset should be orgnized as:
{'input': 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
'output': 'xxxxxxxxxxxxxxxxxxxxxxxxxxxx'}
Because I read the code of finetune.py
and it suggests:
def make_data_module(tokenizer: transformers.PreTrainedTokenizer, args) -> Dict:
"""
Make dataset and collator for supervised fine-tuning.
Datasets are expected to have the following columns: { `input`, `output` }
...
The core process of defining trainer is like below:
# 定义训练参数
training_args = TrainingArguments(
output_dir="/tmp/pycharm_project_787/src/result/test_ft", # 输出目录
num_train_epochs=1, # 训练轮数
per_device_train_batch_size=16, # 每个设备的训练批次大小
per_device_eval_batch_size=64, # 每个设备的评估批次大小
weight_decay=0.005, # 权重衰减
logging_dir="/tmp/pycharm_project_787/src/logs/test_ft", # 日志目录
remove_unused_columns=False,
)
training_args.generation_config = None
# 定义训练器
trainer = Seq2SeqTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset,
data_collator=DataCollatorForCausalLM(
tokenizer=tokenizer,
source_max_len=128,
target_max_len=128,
train_on_source=True,
predict_with_generate=None,
),
)
# 训练模型
trainer.train()
The function DataCollatorForCausalLM()
is defined in finetune.py
too:
@dataclass
class DataCollatorForCausalLM(object):
tokenizer: transformers.PreTrainedTokenizer
source_max_len: int
target_max_len: int
train_on_source: bool
predict_with_generate: bool
def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
# Extract elements
sources = [f"{self.tokenizer.bos_token}{example['input']}" for example in instances]
targets = [f"{example['output']}{self.tokenizer.eos_token}" for example in instances]
# Tokenize
tokenized_sources_with_prompt = self.tokenizer(
sources,
max_length=self.source_max_len,
truncation=True,
add_special_tokens=False,
)
tokenized_targets = self.tokenizer(
targets,
max_length=self.target_max_len,
truncation=True,
add_special_tokens=False,
)
# Build the input and labels for causal LM
input_ids = []
labels = []
for tokenized_source, tokenized_target in zip(
tokenized_sources_with_prompt['input_ids'],
tokenized_targets['input_ids']
):
if not self.predict_with_generate:
input_ids.append(torch.tensor(tokenized_source + tokenized_target))
if not self.train_on_source:
labels.append(
torch.tensor(
[IGNORE_INDEX for _ in range(len(tokenized_source))] + copy.deepcopy(tokenized_target))
)
else:
labels.append(torch.tensor(copy.deepcopy(tokenized_source + tokenized_target)))
else:
input_ids.append(torch.tensor(tokenized_source))
# Apply padding
input_ids = pad_sequence(input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id)
labels = pad_sequence(labels, batch_first=True,
padding_value=IGNORE_INDEX) if not self.predict_with_generate else None
data_dict = {
'input_ids': input_ids,
'attention_mask': input_ids.ne(self.tokenizer.pad_token_id),
}
if labels is not None:
data_dict['labels'] = labels
return data_dict
It will tokenize the input
and output
columes into input_ids
and labels
, which is a vectorization process.
It can run successfully using the above defined trainer
.
I'm a newbie and don't quite understand the code in the finetuning.py script. So, I wonder if it's possible to provide any simple demo of fine-tuning TinyLlama.
For example, now I have a dataset, and it just have two columns (input, output), how can I pre-process the dataset correctly so that it can be put into the trainer and run rightly.
My pre-processing dataset is like below:
output:
When I run the trainer:
It got error:
I'm wondering if there's something wrong with my data set construction and preprocessing, or if the trainer is run in the wrong way.
I'd be grateful if someone could answer this as soon as possible!