haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
19.29k stars 2.12k forks source link

[Usage] Batch inference does not work with llava-v1.6-vicuna-7b #1149

Open Luciennnnnnn opened 7 months ago

Luciennnnnnn commented 7 months ago

Describe the issue

Below is a snippet of my code, I want to generate captions for my images.

def gen_image_caption(self, imgs, temperature=0.2, top_p=0.7, num_beams=1, qs=None, max_new_tokens=512, batch_size=8, image_aspect_ratio=None):
        '''
        [PIL.Image, ...]
        '''
        image_sizes = [x.size for x in imgs]

        images_tensor = process_images(
            imgs,
            self.image_processor,
            self.model.config,
            image_aspect_ratio=image_aspect_ratio)

        if isinstance(images_tensor, list):
            images_tensor = [x.to(self.device, dtype=torch.float16) for x in images_tensor]
            num_img = len(images_tensor)
        else:
            images_tensor = images_tensor.to(self.device, dtype=torch.float16)
            num_img = images_tensor.shape[0]

        if batch_size == -1:
            batch_size = images_tensor.shape[0]

        with torch.inference_mode():
            outputs = []
            for i in range(0, num_img, batch_size):
                if isinstance(images_tensor, list):
                    bs = len(images_tensor[i : i + batch_size])
                else:
                    bs = images_tensor[i : i + batch_size].shape[0]
                input_ids = self.input_ids.repeat(bs, 1)
                output_ids = self.model.generate(
                    input_ids,
                    images=images_tensor[i : i + batch_size],
                    image_sizes=image_sizes[i : i + batch_size],
                    do_sample=True if temperature > 0 else False,
                    temperature=temperature,
                    top_p=top_p,
                    num_beams=num_beams,
                    max_new_tokens=max_new_tokens,
                    use_cache=True)          
                outputs += self.tokenizer.batch_decode(output_ids, skip_special_tokens=True)

Provide my images is a list of PIL Image, like: images = [Image, Image, Image ....]

When I call gen_image_caption(images, batch_size=1), everything is ok, captions are correctly generated. However, when I call 'gen_image_caption(images, batch_size=8)`, the resulting captions looks incorrect and strange:

outputs=['nobody, 1, 1,. nobody,.\n nobody,.,,,.,.,.,.,.....,.,..\n,.,.,.,.,.,..,.,.,.,.,...\n,....,..,.,..,.\n,., 1..,0, 1. 1. 1. 1. 1. 1. 1. 1. 1, 1, 1, 1. 1. 1. 1. 1, 1,0,0, 1. 1. 1. 1, 1,0, 1, 1. 1. 1, 1. 1. 1,0,0,0, 1. 1. 1. 1. 1,0, 1,0, 1,0,0,0,0, 1.0,. 1. 1. 1,0,0. 1, 1.0, 1.0, 1, 1. 1. 1, 1. 1,.0, 1, 1,. 1,0,.0,, 1,,, 1, 1, 1, 1, 1,0, 1. 1, 1.0, 1, 1, 1.0, 1. 1, 1, 1, 1, 1. 1, 1. 1, 1, 1, 1. 1, 1. 1, 1, 1. 1, 1, 1 1, 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ', 'nobody', 'nobody', 'nobody', ',', 'The image presents a vibrant and colorful poster, divided into two distinct halves. The left half is dominated by a cartoon character, a cheerful figure with a blue hat and a green shirt, waving enthusiastically. The character is surrounded by a variety of objects and scenes, including a car, a truck, a construction site, and a factory, all rendered in a playful and cartoonish style.\n\nThe right half of the poster is a stark contrast, featuring a realistic depiction of a city skyline. The buildings, rendered in shades of gray and brown, stand tall against a backdrop of a clear blue sky. The cityscape is punctuated by a red and white warning sign, a stark reminder of safety precautions.\n\nThe poster is rich in text, with Chinese characters scattered throughout, adding a layer of complexity to the image. The characters are likely related to the content of the poster, possibly providing information or instructions.\n\nThe overall style of the poster is a blend of cartoon and realism, with the left half being a whimsical cartoon and the right half a more realistic depiction of a city. The use of color and the inclusion of both cartoon and realistic elements create a visually engaging and informative piece.', 'everybody', ',0,']

Note that, there is a correct caption in the batch, while others is wrong.

Additionally, If I use gen_image_caption(images, batch_size=8, image_aspect_ratio='pad'), the results are correct. It seems there are some problems relevant to anyres image aspect ratio.

Luciennnnnnn commented 7 months ago

@haotian-liu cc

lixiaotong97 commented 7 months ago

I also meet this problem, the output (with anyres) has only one correct caption in a batch.

annopackage commented 6 months ago

How did you set padding_side and attention_mask, which maybe influence batch inference?

Luciennnnnnn commented 6 months ago

Hi @annopackage, what do you mean padding_side and attention_mask, I the code above, I do not specify these two arguments, and I do not known what them refers to.

rohit-gupta commented 6 months ago

This feature would be useful

gehong-coder commented 4 months ago

描述问题

下面是我的代码片段,我想为我的图像生成标题。

def gen_image_caption(self, imgs, temperature=0.2, top_p=0.7, num_beams=1, qs=None, max_new_tokens=512, batch_size=8, image_aspect_ratio=None):
        '''
        [PIL.Image, ...]
        '''
        image_sizes = [x.size for x in imgs]

        images_tensor = process_images(
            imgs,
            self.image_processor,
            self.model.config,
            image_aspect_ratio=image_aspect_ratio)

        if isinstance(images_tensor, list):
            images_tensor = [x.to(self.device, dtype=torch.float16) for x in images_tensor]
            num_img = len(images_tensor)
        else:
            images_tensor = images_tensor.to(self.device, dtype=torch.float16)
            num_img = images_tensor.shape[0]

        if batch_size == -1:
            batch_size = images_tensor.shape[0]

        with torch.inference_mode():
            outputs = []
            for i in range(0, num_img, batch_size):
                if isinstance(images_tensor, list):
                    bs = len(images_tensor[i : i + batch_size])
                else:
                    bs = images_tensor[i : i + batch_size].shape[0]
                input_ids = self.input_ids.repeat(bs, 1)
                output_ids = self.model.generate(
                    input_ids,
                    images=images_tensor[i : i + batch_size],
                    image_sizes=image_sizes[i : i + batch_size],
                    do_sample=True if temperature > 0 else False,
                    temperature=temperature,
                    top_p=top_p,
                    num_beams=num_beams,
                    max_new_tokens=max_new_tokens,
                    use_cache=True)          
                outputs += self.tokenizer.batch_decode(output_ids, skip_special_tokens=True)

提供我的图像是 PIL 图像列表,例如:images = [Image, Image, Image ....]

当我打电话时gen_image_caption(images, batch_size=1),一切正常,字幕已正确生成。但是,当我调用 'gen_image_caption(images, batch_size=8)` 时,生成的标题看起来不正确且奇怪:

输出=['无人, 1, 1,.没有人,.\n 没有人,.,,,.,.,.,.,.....,.,..\n,.,.,.,.,.,..,.,.,. ,.,...\n,....,..,.,..,.\n,., 1..,0, 1. 1. 1. 1. 1. 1. 1. 1. 1, 1, 1, 1. 1. 1. 1. 1, 1,0,0, 1. 1. 1. 1, 1,0, 1, 1. 1. 1, 1. 1. 1,0, 0,0, 1. 1. 1. 1. 1,0, 1,0, 1,0,0,0,0, 1.0,. 1.1.1,0,0。 1, 1.0, 1.0, 1, 1. 1. 1, 1. 1,.0, 1, 1,. 1,0,.0,, 1,,, 1, 1, 1, 1, 1,0, 1. 1, 1.0, 1, 1, 1.0, 1. 1, 1, 1, 1, 1. 1, 1. 1, 1, 1, 1. 1, 1. 1, 1, 1. 1, 1, 1 1, 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ', '没人', '没人', '没人', ', ', '图像呈现出一张充满活力且色彩缤纷的海报,分为两个不同的两半。左半部分以一个卡通人物为主,一个戴着蓝色帽子、穿着绿色衬衫的欢快人物,热情地挥手致意。角色周围有各种物体和场景,包括汽车、卡车、建筑工地、工厂,所有这些都以俏皮和卡通的风格呈现。\n\n海报的右半部分形成了鲜明的对比,以真实的城市天际线描绘为特色。灰色和棕色色调的建筑在清澈的蓝天的映衬下高高耸立。城市景观上点缀着红白相间的警示标志,强烈提醒人们注意安全。\n\n海报文字丰富,汉字散布各处,为图像增添了一层复杂性。这些人物很可能与海报的内容相关,可能提供信息或说明。\n\n海报的整体风格是卡通与现实主义的融合,左半边是异想天开的卡通,右半边是更现实的一座城市的描绘。颜色的使用以及卡通和现实元素的加入创造了一个视觉上引人入胜且信息丰富的作品。', '每个人', ',0,']

请注意,该批次中有正确的标题,而其他则错误。

另外,如果我使用gen_image_caption(images, batch_size=8, image_aspect_ratio='pad'),结果是正确的。似乎存在一些与anyres图像长宽比相关的问题。

this code is right? I want achieve this function too, if update the code , please tell me too, thanks !!

bryanwong17 commented 1 month ago

Hi, could you kindly provided full code on how to do batch inference, given the batch of images and the question. Thank you!