huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.05k stars 26.3k forks source link

The llava processor does not appear to support batch process. #33233

Open zhangzef opened 2 weeks ago

zhangzef commented 2 weeks ago

System Info

Who can help?

@ArthurZucker No response

Information

Tasks

Reproduction

When I used the llava processor for multiprocess preprocessing of my data set, the program seemed to get stuck. It paused at the beginning of the mapping phase, but when I switched the processor to CLIP it was able to map normally.

Code that works properly:

from transformers import AutoModel, AutoProcessor, LlavaProcessor, LlavaForConditionalGeneration
from datasets import load_dataset
import torch
import os

model_name = "./model_para/clip-vit-large-patch14-336"
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoProcessor.from_pretrained(model_name)

def preprocess_function(examples):
    images = [img.convert("RGB") for img in examples["image"]]
    prompts = [f"<image>\nQuestion: {q}\nAnswer:" for q in examples["question"]]
    chosen = examples["chosen"]
    rejected = examples["rejected"]

    inputs = tokenizer(text=prompts, images=images)

    return {
        "images": images,
        "prompt": prompts,
        "chosen": chosen,
        "rejected": rejected
  }

dataset = load_dataset('./datasets/RLAIF-V-Dataset-question')["train"].select(range(256))
train_dataset = dataset.map(
    preprocess_function,
    batched=True,
    batch_size=128,
    num_proc=os.cpu_count(),
    )

Code that doesn't work:

from transformers import AutoModel, AutoProcessor, LlavaProcessor, LlavaForConditionalGeneration
from datasets import load_dataset
import torch
import os

model_name = "./model_para/llava-1.5-7b-hf"
model = LlavaForConditionalGeneration.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = LlavaProcessor.from_pretrained(model_name)

def preprocess_function(examples):
    images = [img.convert("RGB") for img in examples["image"]]
    prompts = [f"<image>\nQuestion: {q}\nAnswer:" for q in examples["question"]]
    chosen = examples["chosen"]
    rejected = examples["rejected"]

    inputs = tokenizer(text=prompts, images=images)

    return {
        "images": images,
        "prompt": prompts,
        "chosen": chosen,
        "rejected": rejected
    }

dataset = load_dataset('./datasets/RLAIF-V-Dataset-question')["train"].select(range(256))
train_dataset = dataset.map(
    preprocess_function,
    batched=True,
    batch_size=128,
    num_proc=os.cpu_count(),
    )

and for the detail discussion could see this issue https://github.com/huggingface/trl/issues/1964#issue-2484568153

Expected behavior

...

NielsRogge commented 2 weeks ago

Hi,

LLaVa does support batched generation, see here and here for example code snippets.

I also wonder why you are passing the images + text through the processor but then not using the inputs created?

zhangzef commented 2 weeks ago

Hi,

LLaVa does support batched generation, see here and here for example code snippets.

I also wonder why you are passing the images + text through the processor but then not using the inputs created?

thank you for your reply! it just the test code, but the second code will be blocking in the mapping processing

zucchini-nlp commented 1 week ago

@zhangzef indeed, using the transformers version 4.41.2 also doesn't run for me, yet updating it to the latest 4.44.2 works. I will see what was wrong with the older version, doesn't seem to be related to LLaVa code per se as the code hasn't changed drastically for a long time

ArthurZucker commented 1 week ago

@zucchini-nlp actually worth it to add multiprocessing tests WDYT?

zucchini-nlp commented 1 week ago

@ArthurZucker not sure I got you, do you mean adding a test with datasets? Aren't we supposed to test that within datasets repo?

FYI, I tried to get the same error by simple multiprocessing.Pool but it works fine, and I didn't have time to dive into the issue yet

ArthurZucker commented 1 week ago

I meant to make sure our processors work with multiprocessing pools! Not necessarily dataset. Cool if that worked!