huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.56k stars 26.91k forks source link

llava-next does not support batched processing/generation when batched images are not of same size #29832

Closed aliencaocao closed 5 months ago

aliencaocao commented 7 months ago

System Info

Who can help?

@ArthurZucker @younesbelkada

Information

Tasks

Reproduction

Use 2 images that has different resolution

        inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt").to(model.device)

where images is a list of PIL.Image and prompts too. They are of the same length

Error:

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llava_next/processing_llava_next.py", line 105, in __call__
    image_inputs = self.image_processor(images, return_tensors=return_tensors)
  File "/usr/local/lib/python3.10/dist-packages/transformers/image_processing_utils.py", line 551, in __call__
    return self.preprocess(images, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llava_next/image_processing_llava_next.py", line 608, in preprocess
    return BatchFeature(data=data, tensor_type=return_tensors)
  File "/usr/local/lib/python3.10/dist-packages/transformers/feature_extraction_utils.py", line 78, in __init__
    self.convert_to_tensors(tensor_type=tensor_type)
  File "/usr/local/lib/python3.10/dist-packages/transformers/feature_extraction_utils.py", line 188, in convert_to_tensors
    raise ValueError(
ValueError: Unable to create tensor, you should probably activate padding with 'padding=True' to have batched tensors with the same length.

This issue does not appear in the original implementation

I see that https://github.com/huggingface/transformers/blob/cbe58b4269457a6ca66a556224b23f9ef246f905/src/transformers/models/llava_next/convert_llava_next_weights_to_hf.py#L296 exists because of this issue and I hope that is not the solution since changing it out of no where surely causes some issues with the model.

Expected behavior

able to use images with different resolutions.

amyeroberts commented 7 months ago

cc @NielsRogge

NielsRogge commented 7 months ago

Hi,

This was my conclusion as well after checking how batched generation would work. I only think batched inference is possible when you make sure images all have the same image resolution (either by padding them to the same size, or by setting the image_sizes to the same value as done in the conversion script). This is also not supported in the original LLaVa inference script, so I'm curious if the author (cc @ haotian-liu) could shed some light on this.

Edit: batched inference can also work when you have images of different resolution, this can be achieved by first calculating the maximum size, create a tensor filled with zeros, and then fill it in

aliencaocao commented 7 months ago

I did make it work in original impl. While I cannot share the whole code here as it is a private project, I can share the part that makes it work

from llava.constants import DEFAULT_IMAGE_TOKEN, DEFAULT_IM_END_TOKEN, DEFAULT_IM_START_TOKEN, IMAGE_TOKEN_INDEX
from llava.conversation import conv_templates
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init

    def infer_batch(im_paths: list[str], images: list[Image], prompts: list[str], sample: bool) -> list[dict[str, Union[str, bool, float]]]:
        image_sizes = [x.size for x in images]
        image_tensor = process_images(images, image_processor, model.config)
        image_tensor = [image.to(model.device, dtype=torch.float16) for image in image_tensor]

        convs = []
        for prompt in prompts:
            prompt = f'{prompt_prefix}\n{prompt}'
            conv = conv_templates[conv_mode].copy()
            conv.append_message(conv.roles[0], prompt)
            conv.append_message(conv.roles[1], None)
            convs.append(conv.get_prompt())

        input_tokens = [tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in convs]
        input_tokens_padded = torch.nn.utils.rnn.pad_sequence(input_tokens, batch_first=True, padding_value=tokenizer.pad_token_id).to(model.device)
        attention_mask = input_tokens_padded != tokenizer.pad_token_id

        with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=False, enable_mem_efficient=True):
            output = model.generate(
                input_tokens_padded,
                attention_mask=attention_mask,
                images=image_tensor,
                image_sizes=image_sizes,
                do_sample=sample,
                temperature=0.9 if sample else None,
                top_p=top_p if sample else None,
                num_beams=num_beams,
                max_new_tokens=max_new_tokens,
                use_cache=True,
                output_logits=True,
                return_dict_in_generate=True,
                pad_token_id=tokenizer.pad_token_id,
            )
        return process_responses(im_paths, output)

However, we used right padding which isn't optimal, but I believe that can be easily changed to left (we didn't as we switched to HF impl)

aliencaocao commented 7 months ago

they are able to pass in a image sizes which I cannot with HF impl (it says its already passed in)

NielsRogge commented 7 months ago

Hi,

Ok I've reproduced that the original implementation supports batched inference (branch is here).

For this to be supported in Transformers, 2 things will need to be added.

Processor

The processor class currently supports passing multiple images and corresponding prompts, however this only works when the number of patches extracted per image is the same. For instance the code snippet below works since each image has 5 patches extracted.

from transformers import LlavaNextProcessor
from PIL import Image
import requests

processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")

url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image1 = Image.open(requests.get(url, stream=True).raw)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image2 = Image.open(requests.get(url, stream=True).raw)

images = [image1, image2]
prompts = ["[INST] <image>\nWhat is shown in this image? [/INST]", "[INST] <image>\nHow many cats are there? [/INST]"]

inputs = processor(images=images, text=prompts, padding=True, return_tensors="pt")

for k,v in inputs.items():
    print(k, v.shape)

However, when the number of patches differs, then you'll get the ValueError: Unable to create tensor, you should probably activate padding with 'padding=True' to have batched tensors with the same length. As models in the Transformers library always expect tensors (rather than list of tensors as the original implementation does), I assume we will have to pad the tensors with zeros in order to batch them together.

Model

The original implementation supports batched inference as it pads the embeddings with zeros. This functionality would have to be added to the merge_input_ids_with_image_features method.

aliencaocao commented 6 months ago

Comment to keep open

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

aliencaocao commented 5 months ago

Closing as solved by https://github.com/huggingface/transformers/pull/29850