The llava processor does not appear to support batch process.

zhangzef commented 2 weeks ago

System Info

transformers version: 4.41.2
Platform: Linux-5.10.0-1.0.0.28-x86_64-with-glibc2.31
Python version: 3.10.14
Huggingface_hub version: 0.24.6
Safetensors version: 0.4.4
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.4.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

When I used the llava processor for multiprocess preprocessing of my data set, the program seemed to get stuck. It paused at the beginning of the mapping phase, but when I switched the processor to CLIP it was able to map normally.

Code that works properly:

from transformers import AutoModel, AutoProcessor, LlavaProcessor, LlavaForConditionalGeneration
from datasets import load_dataset
import torch
import os

model_name = "./model_para/clip-vit-large-patch14-336"
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoProcessor.from_pretrained(model_name)

def preprocess_function(examples):
    images = [img.convert("RGB") for img in examples["image"]]
    prompts = [f"<image>\nQuestion: {q}\nAnswer:" for q in examples["question"]]
    chosen = examples["chosen"]
    rejected = examples["rejected"]

    inputs = tokenizer(text=prompts, images=images)

    return {
        "images": images,
        "prompt": prompts,
        "chosen": chosen,
        "rejected": rejected
  }

dataset = load_dataset('./datasets/RLAIF-V-Dataset-question')["train"].select(range(256))
train_dataset = dataset.map(
    preprocess_function,
    batched=True,
    batch_size=128,
    num_proc=os.cpu_count(),
    )

Code that doesn't work:

from transformers import AutoModel, AutoProcessor, LlavaProcessor, LlavaForConditionalGeneration
from datasets import load_dataset
import torch
import os

model_name = "./model_para/llava-1.5-7b-hf"
model = LlavaForConditionalGeneration.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = LlavaProcessor.from_pretrained(model_name)

def preprocess_function(examples):
    images = [img.convert("RGB") for img in examples["image"]]
    prompts = [f"<image>\nQuestion: {q}\nAnswer:" for q in examples["question"]]
    chosen = examples["chosen"]
    rejected = examples["rejected"]

    inputs = tokenizer(text=prompts, images=images)

    return {
        "images": images,
        "prompt": prompts,
        "chosen": chosen,
        "rejected": rejected
    }

dataset = load_dataset('./datasets/RLAIF-V-Dataset-question')["train"].select(range(256))
train_dataset = dataset.map(
    preprocess_function,
    batched=True,
    batch_size=128,
    num_proc=os.cpu_count(),
    )

and for the detail discussion could see this issue https://github.com/huggingface/trl/issues/1964#issue-2484568153

Expected behavior

...

NielsRogge commented 2 weeks ago

Hi,

LLaVa does support batched generation, see here and here for example code snippets.

I also wonder why you are passing the images + text through the processor but then not using the inputs created?

zhangzef commented 2 weeks ago

Hi,

LLaVa does support batched generation, see here and here for example code snippets.

I also wonder why you are passing the images + text through the processor but then not using the inputs created?

thank you for your reply! it just the test code, but the second code will be blocking in the mapping processing

zucchini-nlp commented 1 week ago

@zhangzef indeed, using the transformers version 4.41.2 also doesn't run for me, yet updating it to the latest 4.44.2 works. I will see what was wrong with the older version, doesn't seem to be related to LLaVa code per se as the code hasn't changed drastically for a long time

ArthurZucker commented 1 week ago

@zucchini-nlp actually worth it to add multiprocessing tests WDYT?

zucchini-nlp commented 1 week ago

@ArthurZucker not sure I got you, do you mean adding a test with datasets? Aren't we supposed to test that within datasets repo?

FYI, I tried to get the same error by simple multiprocessing.Pool but it works fine, and I didn't have time to dive into the issue yet

ArthurZucker commented 1 week ago

I meant to make sure our processors work with multiprocessing pools! Not necessarily dataset. Cool if that worked!

huggingface / transformers