Batch evaluation - Githubissues

haotian-liu commented 1 year ago

Update: Batch evaluation is supported with SGLang.

Batch eval example: https://github.com/sgl-project/sglang/tree/main/benchmark/llava_bench, which can be 5x faster on LLaVA bench.

Continuous batching for serving: https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#launch-a-sglang-worker

Batch eval has been the most wanted features and I have tried to create one for that here in the dev branch. Currently, we identified an issue with the script above, that can cause NaN in the generation.

We'll use this issue to track the status of the batch evaluation.

[x] NaN in batch eval
[ ] Minor inconsistency between different batch sizes
[ ] Improve efficiency

shams2023 commented 1 year ago

Besides batch inference for VQA tasks, can batch inference be performed on image caption tasks?

haotian-liu commented 1 year ago

@shams2023 Yes we will also support that.

shams2023 commented 1 year ago

@shams2023 Yes we will also support that.

I really need to use this task to generate text descriptions for my lower resolution image dataset. I have used BLIP (after fine-tuning), but the results are not good, so I need to try using it.

tweeter0830 commented 1 year ago

This would be really great to have! thank you for working on it.

HenryHZY commented 1 year ago

@haotian-liu Hi, batch evaluation is really important for VQAv2, which takes too much time. I think issue #675 has provided a solution to batch image captioning without evaluation. If would be great If you could support batch image captioning evaluation (e.g., CIDEr) for some traditional tasks (e.g., COCO and NoCaps).

rabiulcste commented 1 year ago

I understand that the generate() function is not behaving as intended! I did some profiling a few weeks ago.

dongzhiwu commented 1 year ago

hellow, is any new progress? Besides batch inference for VQA tasks, can batch inference be performed on image caption tasks?

BrainWWW commented 12 months ago

I found in my experiments that using the same question, e.g. describe the image, the answers are basically consistent when using different batch sizes for inference. However, when each image uses a different question, the model outputs strange answers. May I ask why this problem occurs? thank you so much!

{"question_id": 0, "prompt": "What is the color of the two suitcases in the image?", "text": "The color of the two suitcases in the image is black.", "answer_id": "2sdWELixEN6LQ7BRen73b6", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 1, "prompt": "Analyze the image in a comprehensive and detailed manner.", "text": "The image features a close-up of a young man's face, with a focus on his eyes and lips. The man appears to be looking at the camera, and his eyes are slightly open. The image is a digital drawing or illustration, capturing the man's facial features in detail. The background is white, which further emphasizes the subject's face and expression.", "answer_id": "LcavJhTSXJLvaCaXUBvk9f", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 2, "prompt": "What potential factors could make these suitcases valuable?", "text": "The two-pure-tubes-and-barges-in-the-cows-and-barges-in-the-cows-and-barges-in the-cows-and-barges-in the-cows-and-barges-in the-cows-and-barges-in the-cows-and-barges-in the-cows-and-barges-in the-cows-and-barges-in the-cows-and-barges-in the", "answer_id": "RfX2ZosPS5npuMxrGszWkV", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 3, "prompt": "What are the main objects on the table in the image?", "text": "The, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and, the, and,", "answer_id": "k9Lzv2yd5cj5684fRM9EDF", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 4, "prompt": "Describe the following image.", "text": "A black and white photo of a person wearing a black and white shirt. The person is wearing a black and white shirt.", "answer_id": "mJVswmZ4ctREnnwp6nnaUB", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 5, "prompt": "What activity might have recently taken place around this table and what could be its aftermath?", "text": "It is likely that a recent activity involving the use of the two broken wooden chopsticks took place around the table. The chopsticks are now broken and lying on the table, which suggests that they were used for eating or cooking and have since broken during the process. The aftermath of this situation could be that the person using the chopsticks might need to find an alternative method to eat or cook, as the broken chopsticks are no longer functional. Additionally, the broken chopsticks may pose a safety hazard if not properly disposed of, as they could cause injury if", "answer_id": "NaP9Dq8n2NRSMQrg69YUmN", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 6, "prompt": "What is the main focus of the image?", "text": "The main focus of the image is the two-pure-white-fie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie-pie-tie", "answer_id": "FvY25Pqw6zued6Abeirwoh", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 7, "prompt": "What is this photo about'?", "text": "The image is a photo of a person's feet, which are the most visible part of the person's body. The person's feet are the main focus of the photo, and the person's body is the most visible part of the photo.", "answer_id": "jGL4pqaVXXYL3driouLDrW", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 8, "prompt": "What could be a reason for the cat's interest in the laptop?", "text": "The cat's interest in the laptop, which is a part of the laptop, is a result of the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat's natural, and the cat", "answer_id": "jawMBaRhvyz98K93QRqPap", "model_id": "llava-v1.5-13b", "metadata": {}} {"question_id": 9, "prompt": "What color is the dog in the image?", "text": "The dog in the image is a black dog, and the person is a white person.", "answer_id": "BHNjYWe8sVuPGLj4ikBG78", "model_id": "llava-v1.5-13b", "metadata": {}} """

shams2023 commented 12 months ago

hellow, is any new progress? Besides batch inference for VQA tasks, can batch inference be performed on image caption tasks?

Have you completed the task for image captions?

yanbai1993 commented 11 months ago

Hi, When I use multiple GPU for multi-batch inference, I encounter the following error. However, when performing multi-batch inference on a single GPU (develop branch), or single-batch inference on multiple GPU (main branch), I do not encounter this problem. I look forward to your reply.

Traceback (most recent call last): File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-perception/baiyan02/conda_env/llava_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-perception/baiyan02/conda_env/llava_env/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/workdir/baiyan02/2071702/70b4c7f21b349f3a43f519b683820c89/llava/eval/model_vqa_batch.py", line 177, in eval_model(args) File "/workdir/baiyan02/2071702/70b4c7f21b349f3a43f519b683820c89/llava/eval/model_vqa_batch.py", line 133, in eval_model output_ids = model.generate( File "/mnt/dolphinfs/hdd_pool/docker/user/hadoop-perception/baiyan02/conda_env/llava_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/workdir/baiyan02/2071702/70b4c7f21b349f3a43f519b683820c89/llava/model/language_model/llava_llama.py", line 121, in generate ) = self.prepare_inputs_labels_for_multimodal( File "/workdir/baiyan02/2071702/70b4c7f21b349f3a43f519b683820c89/llava/model/llava_arch.py", line 183, in prepare_inputs_labels_for_multimodal cur_new_input_embeds = torch.cat(cur_new_input_embeds) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)

![2023-11-27 13-16-37屏幕截图](https://github.com/haotian-liu/LLaVA/assets/15922438/2cc2da43-f874-42e4-a02d-820f12d5c165

ChantalMP commented 10 months ago

For me, batch evaluation also gave nan values, after fine-tuning. Performing evaluation also with bfloat16 instead of float16 solved this for me. (I also fine-tuned using bf16 True). Like this, batchsize 1 and higher ones give very similar results.

hkristof03 commented 10 months ago

@haotian-liu thank you for this amazing work. I just started getting familiar with this repository recently. I would like to point out a few things and also ask a question. You provide an example here, where you load the model once

tokenizer, model, image_processor, context_len = load_pretrained_model( model_path=model_path, model_base=None, model_name=get_model_name_from_path(model_path) )

and then again when calling eval_model(args). (The model is loaded twice into memory).

I haven't dug deeper into batch inference but came here instead and I see it is not supported yet? What I don't understand, there is a function to load multiple images., which is passed to the model with a single prompt, I expected to generate multiple outputs from one prompt and multiple images. However the model generates one output even if I remove the index here.

I read through your papers but haven't found a case of multiple images passed to the model at once. Could you clarify and maybe comment on the batch inference issue?

merrymercy commented 9 months ago

For anyone interested in this issue, we collaborated with @haotian-liu and implemented a high-throughput inference server. You can find examples of batch processing here https://github.com/sgl-project/sglang/tree/main/benchmark/llava_bench, which can be 5x faster on LLaVA bench.

pseudotensor commented 9 months ago

@merrymercy Thanks. I was about to try sglang but in the documentation on this repo it mentions the tokenizer needs to come from llava-hf on HF. But right now there is only 7b for 1.6, none of the others for 1.6. Is that intentional? How should one proceed? Thanks!

pseudotensor commented 9 months ago

@haotian-liu Related, I notice if gradio server is hit with (say) 3 concurrent requests, the generation is about 3x slower. It would be nice to try sglang, but from your docs it seems we need to have those other tokenizers in llava-hf?

haotian-liu commented 9 months ago

@pseudotensor Sorry for the confusion.

Tokenizers (temporary): llava-hf/llava-1.5-7b-hf, llava-hf/llava-1.5-13b-hf, liuhaotian/llava-v1.6-34b-tokenizer.

We'll update the full repo to remove the need of tokenizers soon.

Also, you would need to use SGLang for the continuous batching and it does not have visible degradation in generation speed when batching.

pseudotensor commented 9 months ago

@pseudotensor Sorry for the confusion.

Tokenizers (temporary): llava-hf/llava-1.5-7b-hf, llava-hf/llava-1.5-13b-hf, liuhaotian/llava-v1.6-34b-tokenizer.

We'll update the full repo to remove the need of tokenizers soon.

Also, you would need to use SGLang for the continuous batching and it does not have visible degradation in generation speed when batching.

Thanks. so just the 34b tokenizer for 1.6 for now. I expect you'd rather do the planned removal of need for tokenizers than add the other tokenizers.

haotian-liu commented 9 months ago

@pseudotensor

Yep we'll remove the need of doing so. Btw, the 7B/13B tokenizers are valid for 1.6 :)

fisher75 commented 7 months ago

Hi, I think for SGLang we need a feature for multiple rounds discussions for the ICL, or for few-shot. Can you provide a template for ICL in SGLang? currently the example is only one-time inference.

fisher75 commented 7 months ago

Need someone real hardcore to answer my question haha 👍

Question

I saw at README saying those models are supported, got two questions: (1) what if I wanna use llava-v1.6-vicuna-13b or any other LLaVa models, is that possible? Thanks! (2) After I fine-tune or lora the existing models, how can I do a batch inference with it since in SGLang it looks that I need --model-path and --tokenizer-path to do batch inference.

LLaVA python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000 python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000 python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-34b --tokenizer-path liuhaotian/llava-v1.6-34b-tokenizer --port 3000

dgarnitz commented 7 months ago

Is there any plan to make the batch inferencing available with standard hugging face code? I am trying to use serverless GPUs so running sgl's inference server is not going to work.

Right now when I attempt to batch inference, it only runs inference over the first image in the batch repeatedly. For example, in the code below, the output is the same every time:

async def completion_stream(self, user_questions, images_data):
    hf_logging.set_verbosity_info()

    # Prepare batch inputs
    batch_inputs = []
    for user_question, image_data in zip(user_questions, images_data):
        image = Image.open(BytesIO(image_data))
        prompt = f"system\nAnswer the questions.user\n<image>\n{user_question}assistant\n"
        inputs = self.processor(prompt, image, return_tensors="pt")
        batch_inputs.append(inputs['input_ids'])

    # Concatenate all input_ids in a batch
    batch_input_ids = torch.cat(batch_inputs, dim=0).to("cuda:0")

    # Perform batch inference
    output = self.model.generate(input_ids=batch_input_ids, max_new_tokens=1536)

    # Decode each output in the batch and yield word by word
    for o in output:
        answer = self.processor.decode(o, skip_special_tokens=True)
        words = answer.split()
        for word in words:
            yield word + ' '
        yield '\n'

At the bare minimum, a much more detailed and well explained example of how to run the batching on sglang would be extremely helpful for helping to run and reverse engineer the code.

Vignesh-Valaboju commented 5 months ago

Has anyone attempted batch inference without sglang? I am noticing that batch size is affecting the output. It looks like batch size is impacting the preprocessing and padding of the input tokenized sequences. When you use a batchsize > 1, all the token sequences are padded with 0 to be the same length. Llava doesn't understand this padding -- has anyone tried a work around?

dacian7 commented 4 months ago

Has anyone attempted batch inference without sglang? I am noticing that batch size is affecting the output. It looks like batch size is impacting the preprocessing and padding of the input tokenized sequences. When you use a batchsize > 1, all the token sequences are padded with 0 to be the same length. Llava doesn't understand this padding -- has anyone tried a work around?

@Vignesh-Valaboju Hi, same problem, have you solved this issue?

XuGW-Kevin commented 3 months ago

@Vignesh-Valaboju @dacian7 #269 provides a feasible solution for this: Change the padding side to tokenizer.padding_side = "left", and modify KeywordsStoppingCriteria to make it support batch inference.

CongYep commented 3 months ago

@XuGW-Kevin I do want to know in which file and on which line to modify tokenizer.padding_side = "left" and in which file modify KeywordsStoppingCriteria? Thank you.

XuGW-Kevin commented 3 months ago

Add on any line: model.llm.config.tokenizer_padding_side = "left" KeywordsStoppingCriteria: llava/mm_utils.py

copperwiring commented 1 month ago

@XuGW-Kevin How did you support batch inference? Even thoug I updated keywordstopping uisng the issue you linked, I cant add model.llm.config.tokenizer_padding_side = "left"

It says AttributeError: 'LlavaLlamaForCausalLM' object has no attribute 'llm'. I then added simply tokenizer.padding_side = "left" as you suggested @CongYep

This is my code:

    for prompt in prompts_batch:
        # Set args.query to the specific prompt in the batch
        args.query = prompt

        # Generate the prompt for each input in the batch, with the correct image handling
        qs = get_prompt(args, model)

        # Create a new conversation template for each prompt in the batch
        conv = conv_templates[args.conv_mode].copy()
        conv.append_message(conv.roles[0], qs)
        conv.append_message(conv.roles[1], None)

        # Add the complete prompt for this instance to the batch
        batched_prompts.append(conv.get_prompt())

    tokenizer.padding_side = "left"
    # Tokenize the batch of prompts
    tokenized_prompts = [
        tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0)
        for prompt in batched_prompts
    ]

    input_ids = torch.cat(tokenized_prompts, dim=0).cuda()

    # Process images if provided (batch image loading and processing)
    if img_files_batch:
        # For each batch, parse image files, load them, and process
        image_files_batch = [image_parser(img_files, args.sep) for img_files in img_files_batch]
        images = [load_images(image_files) for image_files in image_files_batch]
        flat_images = [item for sublist in images for item in sublist]
        images_tensor = process_images(flat_images, image_processor, model.config).to(model.device, dtype=torch.float16)
        image_sizes = [img.size for img in flat_images]
    else:
        images_tensor = None
        image_sizes = None

    attention_mask = torch.ones_like(input_ids)

    with torch.inference_mode(), torch.cuda.amp.autocast():
        outputs = model.forward(
            input_ids=input_ids, 
            images=None if images_tensor is None else images_tensor,
            image_sizes=image_sizes,
            attention_mask=attention_mask
            )

    logits = outputs.logits[:, -1, :]  # Get the logits for the last token position
    probabilities = F.softmax(logits, dim=-1).squeeze()

But it wont do concatenation at input_ids = torch.cat(tokenized_prompts, dim=0).cuda()

Error: RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 158 but got size 179 for tensor number 2 in the list. I know input ids are different but then how do I pad them equally so that inference can happen in batch?

copperwiring commented 1 month ago

@Vignesh-Valaboju I did left padding like this

# left padding
def left_pad_sequence_to_max_length(sequence, max_length, padding_value=0):
    """Pad a sequence to the desired max length."""
    if len(sequence) >= max_length:
        return sequence
    return torch.cat([torch.full((max_length - len(sequence),), padding_value, dtype=sequence.dtype), sequence])


    # Tokenize the batch of prompts
    tokenized_prompts = [
        tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0)
        for prompt in batched_prompts
    ]

    # Determine the maximum length of input_ids in the batch
    max_len = max([len(tokenized_prompt.squeeze()) for tokenized_prompt in tokenized_prompts])

    # Pad the input_ids to the maximum length
    padded_tokenized_ids= [left_pad_sequence_to_max_length(tokenized_prompt.squeeze(), max_len) for tokenized_prompt in tokenized_prompts]
    batched_input_ids = torch.stack(padded_tokenized_ids).to(model.device)

attention_mask = torch.ones_like(batched_input_ids)

and pass attention_mask to model.generate(), but results look completely wrong. Did it work for you?

haotian-liu / LLaVA

Batch evaluation #754

Question