Closed haotian-liu closed 9 months ago
Besides batch inference for VQA tasks, can batch inference be performed on image caption tasks?
@shams2023 Yes we will also support that.
@shams2023 Yes we will also support that.
I really need to use this task to generate text descriptions for my lower resolution image dataset. I have used BLIP (after fine-tuning), but the results are not good, so I need to try using it.
This would be really great to have! thank you for working on it.
@haotian-liu Hi, batch evaluation is really important for VQAv2, which takes too much time. I think issue #675 has provided a solution to batch image captioning without evaluation. If would be great If you could support batch image captioning evaluation (e.g., CIDEr) for some traditional tasks (e.g., COCO and NoCaps).
I understand that the generate()
function is not behaving as intended! I did some profiling a few weeks ago.
hellow, is any new progress? Besides batch inference for VQA tasks, can batch inference be performed on image caption tasks?
I found in my experiments that using the same question, e.g. describe the image, the answers are basically consistent when using different batch sizes for inference. However, when each image uses a different question, the model outputs strange answers. May I ask why this problem occurs? thank you so much!
{"question_id": 0, "prompt": "What is the color of the two suitcases in the image?", "text": "The color of the two suitcases in the image is black.", "answer_id": "2sdWELixEN6LQ7BRen73b6", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 1, "prompt": "Analyze the image in a comprehensive and detailed manner.", "text": "The image features a close-up of a young man's face, with a focus on his eyes and lips. The man appears to be looking at the camera, and his eyes are slightly open. The image is a digital drawing or illustration, capturing the man's facial features in detail. The background is white, which further emphasizes the subject's face and expression.", "answer_id": "LcavJhTSXJLvaCaXUBvk9f", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 2, "prompt": "What potential factors could make these suitcases valuable?", "text": "The two-pure-tubes-and-barges-in-the-cows-and-barges-in-the-cows-and-barges-in the-cows-and-barges-in the-cows-and-barges-in the-cows-and-barges-in the-cows-and-barges-in the-cows-and-barges-in the-cows-and-barges-in the-cows-and-barges-in the", "answer_id": "RfX2ZosPS5npuMxrGszWkV", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 3, "prompt": "What are the main objects on the table in the image?", "text": "The, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and,", "answer_id": "k9Lzv2yd5cj5684fRM9EDF", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 4, "prompt": "Describe the following image.", "text": "A black and white photo of a person wearing a black and white shirt. The person is wearing a black and white shirt.", "answer_id": "mJVswmZ4ctREnnwp6nnaUB", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 5, "prompt": "What activity might have recently taken place around this table and what could be its aftermath?", "text": "It is likely that a recent activity involving the use of the two broken wooden chopsticks took place around the table. The chopsticks are now broken and lying on the table, which suggests that they were used for eating or cooking and have since broken during the process. The aftermath of this situation could be that the person using the chopsticks might need to find an alternative method to eat or cook, as the broken chopsticks are no longer functional. Additionally, the broken chopsticks may pose a safety hazard if not properly disposed of, as they could cause injury if", "answer_id": "NaP9Dq8n2NRSMQrg69YUmN", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 6, "prompt": "What is the main focus of the image?", "text": "The main focus of the image is the two-pure-white-fie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie", "answer_id": "FvY25Pqw6zued6Abeirwoh", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 7, "prompt": "What is this photo about'?", "text": "The image is a photo of a person's feet, which are the most visible part of the person's body. The person's feet are the main focus of the photo, and the person's body is the most visible part of the photo.", "answer_id": "jGL4pqaVXXYL3driouLDrW", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 8, "prompt": "What could be a reason for the cat's interest in the laptop?", "text": "The cat's interest in the laptop, which is a part of the laptop, is a result of the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat", "answer_id": "jawMBaRhvyz98K93QRqPap", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 9, "prompt": "What color is the dog in the image?", "text": "The dog in the image is a black dog, and the person is a white person.", "answer_id": "BHNjYWe8sVuPGLj4ikBG78", "model_id": "llava-v1.5-13b", "metadata": {}} """
hellow, is any new progress? Besides batch inference for VQA tasks, can batch inference be performed on image caption tasks?
Have you completed the task for image captions?
Hi, When I use multiple GPU for multi-batch inference, I encounter the following error. However, when performing multi-batch inference on a single GPU (develop branch), or single-batch inference on multiple GPU (main branch), I do not encounter this problem. I look forward to your reply.
Traceback (most recent call last):
File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-perception/baiyan02/conda_env/llava_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-perception/baiyan02/conda_env/llava_env/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/workdir/baiyan02/2071702/70b4c7f21b349f3a43f519b683820c89/llava/eval/model_vqa_batch.py", line 177, in
![2023-11-27 13-16-37屏幕截图](https://github.com/haotian-liu/LLaVA/assets/15922438/2cc2da43-f874-42e4-a02d-820f12d5c165
For me, batch evaluation also gave nan values, after fine-tuning. Performing evaluation also with bfloat16 instead of float16 solved this for me. (I also fine-tuned using bf16 True). Like this, batchsize 1 and higher ones give very similar results.
@haotian-liu thank you for this amazing work. I just started getting familiar with this repository recently. I would like to point out a few things and also ask a question. You provide an example here, where you load the model once
tokenizer, model, image_processor, context_len = load_pretrained_model( model_path=model_path, model_base=None, model_name=get_model_name_from_path(model_path) )
and then again when calling eval_model(args)
. (The model is loaded twice into memory).
I haven't dug deeper into batch inference but came here instead and I see it is not supported yet? What I don't understand, there is a function to load multiple images., which is passed to the model with a single prompt, I expected to generate multiple outputs from one prompt and multiple images. However the model generates one output even if I remove the index here.
I read through your papers but haven't found a case of multiple images passed to the model at once. Could you clarify and maybe comment on the batch inference issue?
For anyone interested in this issue, we collaborated with @haotian-liu and implemented a high-throughput inference server. You can find examples of batch processing here https://github.com/sgl-project/sglang/tree/main/benchmark/llava_bench, which can be 5x faster on LLaVA bench.
@merrymercy Thanks. I was about to try sglang but in the documentation on this repo it mentions the tokenizer needs to come from llava-hf on HF. But right now there is only 7b for 1.6, none of the others for 1.6. Is that intentional? How should one proceed? Thanks!
@haotian-liu Related, I notice if gradio server is hit with (say) 3 concurrent requests, the generation is about 3x slower. It would be nice to try sglang, but from your docs it seems we need to have those other tokenizers in llava-hf?
@pseudotensor Sorry for the confusion.
Tokenizers (temporary): llava-hf/llava-1.5-7b-hf
, llava-hf/llava-1.5-13b-hf
, liuhaotian/llava-v1.6-34b-tokenizer
.
We'll update the full repo to remove the need of tokenizers soon.
Also, you would need to use SGLang for the continuous batching and it does not have visible degradation in generation speed when batching.
@pseudotensor Sorry for the confusion.
Tokenizers (temporary):
llava-hf/llava-1.5-7b-hf
,llava-hf/llava-1.5-13b-hf
,liuhaotian/llava-v1.6-34b-tokenizer
.We'll update the full repo to remove the need of tokenizers soon.
Also, you would need to use SGLang for the continuous batching and it does not have visible degradation in generation speed when batching.
Thanks. so just the 34b tokenizer for 1.6 for now. I expect you'd rather do the planned removal of need for tokenizers than add the other tokenizers.
@pseudotensor
Yep we'll remove the need of doing so. Btw, the 7B/13B tokenizers are valid for 1.6 :)
Hi, I think for SGLang we need a feature for multiple rounds discussions for the ICL, or for few-shot. Can you provide a template for ICL in SGLang? currently the example is only one-time inference.
Need someone real hardcore to answer my question haha 👍
Question
I saw at README saying those models are supported, got two questions: (1) what if I wanna use
llava-v1.6-vicuna-13b
or any other LLaVa models, is that possible? Thanks! (2) After I fine-tune or lora the existing models, how can I do a batch inference with it since in SGLang it looks that I need --model-path and --tokenizer-path to do batch inference.LLaVA python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000 python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000 python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-34b --tokenizer-path liuhaotian/llava-v1.6-34b-tokenizer --port 3000
Is there any plan to make the batch inferencing available with standard hugging face code? I am trying to use serverless GPUs so running sgl's inference server is not going to work.
Right now when I attempt to batch inference, it only runs inference over the first image in the batch repeatedly. For example, in the code below, the output is the same every time:
async def completion_stream(self, user_questions, images_data):
hf_logging.set_verbosity_info()
# Prepare batch inputs
batch_inputs = []
for user_question, image_data in zip(user_questions, images_data):
image = Image.open(BytesIO(image_data))
prompt = f"system\nAnswer the questions.user\n<image>\n{user_question}assistant\n"
inputs = self.processor(prompt, image, return_tensors="pt")
batch_inputs.append(inputs['input_ids'])
# Concatenate all input_ids in a batch
batch_input_ids = torch.cat(batch_inputs, dim=0).to("cuda:0")
# Perform batch inference
output = self.model.generate(input_ids=batch_input_ids, max_new_tokens=1536)
# Decode each output in the batch and yield word by word
for o in output:
answer = self.processor.decode(o, skip_special_tokens=True)
words = answer.split()
for word in words:
yield word + ' '
yield '\n'
At the bare minimum, a much more detailed and well explained example of how to run the batching on sglang would be extremely helpful for helping to run and reverse engineer the code.
Has anyone attempted batch inference without sglang? I am noticing that batch size is affecting the output. It looks like batch size is impacting the preprocessing and padding of the input tokenized sequences. When you use a batchsize > 1, all the token sequences are padded with 0 to be the same length. Llava doesn't understand this padding -- has anyone tried a work around?
Has anyone attempted batch inference without sglang? I am noticing that batch size is affecting the output. It looks like batch size is impacting the preprocessing and padding of the input tokenized sequences. When you use a batchsize > 1, all the token sequences are padded with 0 to be the same length. Llava doesn't understand this padding -- has anyone tried a work around?
@Vignesh-Valaboju Hi, same problem, have you solved this issue?
@Vignesh-Valaboju @dacian7 #269 provides a feasible solution for this: Change the padding side to tokenizer.padding_side = "left", and modify KeywordsStoppingCriteria to make it support batch inference.
@XuGW-Kevin I do want to know in which file and on which line to modify tokenizer.padding_side = "left" and in which file modify KeywordsStoppingCriteria? Thank you.
Add on any line: model.llm.config.tokenizer_padding_side = "left" KeywordsStoppingCriteria: llava/mm_utils.py
@XuGW-Kevin How did you support batch inference? Even thoug I updated keywordstopping uisng the issue you linked, I cant add model.llm.config.tokenizer_padding_side = "left"
It says AttributeError: 'LlavaLlamaForCausalLM' object has no attribute 'llm'
. I then added simply tokenizer.padding_side = "left"
as you suggested @CongYep
This is my code:
for prompt in prompts_batch:
# Set args.query to the specific prompt in the batch
args.query = prompt
# Generate the prompt for each input in the batch, with the correct image handling
qs = get_prompt(args, model)
# Create a new conversation template for each prompt in the batch
conv = conv_templates[args.conv_mode].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
# Add the complete prompt for this instance to the batch
batched_prompts.append(conv.get_prompt())
tokenizer.padding_side = "left"
# Tokenize the batch of prompts
tokenized_prompts = [
tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0)
for prompt in batched_prompts
]
input_ids = torch.cat(tokenized_prompts, dim=0).cuda()
# Process images if provided (batch image loading and processing)
if img_files_batch:
# For each batch, parse image files, load them, and process
image_files_batch = [image_parser(img_files, args.sep) for img_files in img_files_batch]
images = [load_images(image_files) for image_files in image_files_batch]
flat_images = [item for sublist in images for item in sublist]
images_tensor = process_images(flat_images, image_processor, model.config).to(model.device, dtype=torch.float16)
image_sizes = [img.size for img in flat_images]
else:
images_tensor = None
image_sizes = None
attention_mask = torch.ones_like(input_ids)
with torch.inference_mode(), torch.cuda.amp.autocast():
outputs = model.forward(
input_ids=input_ids,
images=None if images_tensor is None else images_tensor,
image_sizes=image_sizes,
attention_mask=attention_mask
)
logits = outputs.logits[:, -1, :] # Get the logits for the last token position
probabilities = F.softmax(logits, dim=-1).squeeze()
But it wont do concatenation at input_ids = torch.cat(tokenized_prompts, dim=0).cuda()
Error: RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 158 but got size 179 for tensor number 2 in the list.
I know input ids are different but then how do I pad them equally so that inference can happen in batch?
@Vignesh-Valaboju I did left padding like this
# left padding
def left_pad_sequence_to_max_length(sequence, max_length, padding_value=0):
"""Pad a sequence to the desired max length."""
if len(sequence) >= max_length:
return sequence
return torch.cat([torch.full((max_length - len(sequence),), padding_value, dtype=sequence.dtype), sequence])
# Tokenize the batch of prompts
tokenized_prompts = [
tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0)
for prompt in batched_prompts
]
# Determine the maximum length of input_ids in the batch
max_len = max([len(tokenized_prompt.squeeze()) for tokenized_prompt in tokenized_prompts])
# Pad the input_ids to the maximum length
padded_tokenized_ids= [left_pad_sequence_to_max_length(tokenized_prompt.squeeze(), max_len) for tokenized_prompt in tokenized_prompts]
batched_input_ids = torch.stack(padded_tokenized_ids).to(model.device)
attention_mask = torch.ones_like(batched_input_ids)
and pass attention_mask
to model.generate(), but results look completely wrong. Did it work for you?
Update: Batch evaluation is supported with SGLang.
Batch eval example: https://github.com/sgl-project/sglang/tree/main/benchmark/llava_bench, which can be 5x faster on LLaVA bench.
Continuous batching for serving: https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#launch-a-sglang-worker
Batch eval has been the most wanted features and I have tried to create one for that here in the
dev
branch. Currently, we identified an issue with the script above, that can cause NaN in the generation.We'll use this issue to track the status of the batch evaluation.