Closed aliencaocao closed 5 months ago
cc @NielsRogge
Hi,
This was my conclusion as well after checking how batched generation would work. I only think batched inference is possible when you make sure images all have the same image resolution (either by padding them to the same size, or by setting the image_sizes
to the same value as done in the conversion script). This is also not supported in the original LLaVa inference script, so I'm curious if the author (cc @ haotian-liu) could shed some light on this.
Edit: batched inference can also work when you have images of different resolution, this can be achieved by first calculating the maximum size, create a tensor filled with zeros, and then fill it in
I did make it work in original impl. While I cannot share the whole code here as it is a private project, I can share the part that makes it work
from llava.constants import DEFAULT_IMAGE_TOKEN, DEFAULT_IM_END_TOKEN, DEFAULT_IM_START_TOKEN, IMAGE_TOKEN_INDEX
from llava.conversation import conv_templates
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init
def infer_batch(im_paths: list[str], images: list[Image], prompts: list[str], sample: bool) -> list[dict[str, Union[str, bool, float]]]:
image_sizes = [x.size for x in images]
image_tensor = process_images(images, image_processor, model.config)
image_tensor = [image.to(model.device, dtype=torch.float16) for image in image_tensor]
convs = []
for prompt in prompts:
prompt = f'{prompt_prefix}\n{prompt}'
conv = conv_templates[conv_mode].copy()
conv.append_message(conv.roles[0], prompt)
conv.append_message(conv.roles[1], None)
convs.append(conv.get_prompt())
input_tokens = [tokenizer_image_token(prompt, tokenizer, return_tensors='pt') for prompt in convs]
input_tokens_padded = torch.nn.utils.rnn.pad_sequence(input_tokens, batch_first=True, padding_value=tokenizer.pad_token_id).to(model.device)
attention_mask = input_tokens_padded != tokenizer.pad_token_id
with torch.backends.cuda.sdp_kernel(enable_flash=False, enable_math=False, enable_mem_efficient=True):
output = model.generate(
input_tokens_padded,
attention_mask=attention_mask,
images=image_tensor,
image_sizes=image_sizes,
do_sample=sample,
temperature=0.9 if sample else None,
top_p=top_p if sample else None,
num_beams=num_beams,
max_new_tokens=max_new_tokens,
use_cache=True,
output_logits=True,
return_dict_in_generate=True,
pad_token_id=tokenizer.pad_token_id,
)
return process_responses(im_paths, output)
However, we used right padding which isn't optimal, but I believe that can be easily changed to left (we didn't as we switched to HF impl)
they are able to pass in a image sizes which I cannot with HF impl (it says its already passed in)
Hi,
Ok I've reproduced that the original implementation supports batched inference (branch is here).
For this to be supported in Transformers, 2 things will need to be added.
The processor class currently supports passing multiple images and corresponding prompts, however this only works when the number of patches extracted per image is the same. For instance the code snippet below works since each image has 5 patches extracted.
from transformers import LlavaNextProcessor
from PIL import Image
import requests
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
image1 = Image.open(requests.get(url, stream=True).raw)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image2 = Image.open(requests.get(url, stream=True).raw)
images = [image1, image2]
prompts = ["[INST] <image>\nWhat is shown in this image? [/INST]", "[INST] <image>\nHow many cats are there? [/INST]"]
inputs = processor(images=images, text=prompts, padding=True, return_tensors="pt")
for k,v in inputs.items():
print(k, v.shape)
However, when the number of patches differs, then you'll get the ValueError: Unable to create tensor, you should probably activate padding with 'padding=True' to have batched tensors with the same length
. As models in the Transformers library always expect tensors (rather than list of tensors as the original implementation does), I assume we will have to pad the tensors with zeros in order to batch them together.
The original implementation supports batched inference as it pads the embeddings with zeros. This functionality would have to be added to the merge_input_ids_with_image_features method.
Comment to keep open
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Closing as solved by https://github.com/huggingface/transformers/pull/29850
System Info
transformers
version: 4.39.1Who can help?
@ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Use 2 images that has different resolution
where images is a list of
PIL.Image
andprompts
too. They are of the same lengthError:
This issue does not appear in the original implementation
I see that https://github.com/huggingface/transformers/blob/cbe58b4269457a6ca66a556224b23f9ef246f905/src/transformers/models/llava_next/convert_llava_next_weights_to_hf.py#L296 exists because of this issue and I hope that is not the solution since changing it out of no where surely causes some issues with the model.
Expected behavior
able to use images with different resolutions.