meta-llama / llama-recipes

Scripts for fine-tuning Meta Llama with composable FSDP & PEFT methods to cover single/multi-node GPUs. Supports default & custom datasets for applications such as summarization and Q&A. Supporting a number of candid inference solutions such as HF TGI, VLLM for local or cloud deployment. Demo apps to showcase Meta Llama for WhatsApp & Messenger.
15.19k stars 2.2k forks source link

finetuning using notebook on custom dataset #788

Open amoghskanda opened 4 days ago

amoghskanda commented 4 days ago

System Info

python 3.10.15 torch 2.5.1 transformers 4.46.2 tokenizers 0.20.3

Information

🐛 Describe the bug

I had to finetune llama3.2 11B Vision Instruct and I downloaded the model from huggingface(https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)> I'm trying to finetune the model on a custom dataset of mine by following the finetuning notebook. When I start finetuning, I run into list conversion to tensor issue which I'm guessing is because the dataset is not in the right format. Could anybody suggest the dataset format? I have ~4k images, metadata.csv which contains 20 columns encompassing all the information about the images, a prompt for finetuning. The code I used for generating the dataset :


import os
import pandas as pd
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer
from PIL import Image
from torchvision import transforms
import torch

image_folder = 'path to images folder'
csv_file = 'path to metadata.csv'
prompt = "The prompt used for FT"
metadata = pd.read_csv(csv_file)
metadata['image_path'] = metadata['file_name'].apply(lambda x: os.path.join(image_folder, x))

def load_image(image_path):
    image = Image.open(image_path).convert("RGB")
    return image

def preprocess_image(image):
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ])
    return transform(image)

def tokenize_prompt(prompt, tokenizer):
    return tokenizer(prompt, return_tensors="pt", padding="max_length", truncation=True, max_length=512)

tokenizer = AutoTokenizer.from_pretrained("path to llama model")

data = []
for idx, row in metadata.iterrows():
    image_path = os.path.join(image_folder, row["image_path"])
    image = load_image(image_path)
    image = preprocess_image(image)

    tokenized_prompt = tokenize_prompt(prompt, tokenizer)

    data_entry = {
        "image": image,
        "text": prompt,
        "input_ids": tokenized_prompt["input_ids"].squeeze().tolist(),
        "attention_mask": tokenized_prompt["attention_mask"].squeeze().tolist(),
        "metadata": row.to_dict()
    }
    data.append(data_entry)

dataset = Dataset.from_pandas(pd.DataFrame(data))

dataset_dict = DatasetDict({
    "train": dataset
})

dataset_dict.save_to_disk("train_dataset")

Error logs

{ "name": "AttributeError", "message": "'list' object has no attribute 'to'", "stack": "--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[9], line 15 12 scheduler = StepLR(optimizer, step_size=1, gamma=train_config.gamma) 14 # Start the training process ---> 15 results = train( 16 model, 17 train_dataloader['train'], 18 eval_dataloader['test'], 19 tokenizer, 20 optimizer, 21 scheduler, 22 train_config.gradient_accumulation_steps, 23 train_config, 24 None, 25 None, 26 None, 27 wandb_run=None, 28 )

File ~/anaconda3/envs/llama/lib/python3.10/site-packages/llama_recipes/utils/train_utils.py:151, in train(model, train_dataloader, eval_dataloader, tokenizer, optimizer, lr_scheduler, gradient_accumulation_steps, train_config, fsdp_config, local_rank, rank, wandb_run) 149 batch[key] = batch[key].to('xpu:0') 150 elif torch.cuda.is_available(): --> 151 batch[key] = batch[key].to('cuda:0') 152 with autocast(): 153 loss = model(**batch).loss

AttributeError: 'list' object has no attribute 'to'" }

I have tried keeping input_ids and attention_mask as pytorch tensors but there was a problem during conversion of tensors to arrow objects during dataset creation.

Expected behavior

Any guide on how to create a dataset compatible with llama3.2 11B Vision Instruct with images, metadata and a prompt

HamidShojanazeri commented 3 hours ago

cc: @wukaixingxp

wukaixingxp commented 12 minutes ago

@amoghskanda You need to convert list into tensor, something like batch["labels"] = torch.tensor(label_list). Please check this example about how to convert the dialogs into tokens