[Usage] LLaVA-v1.6 Generates Empty/Truncated Response

WesleyHsieh0806 commented 7 months ago

Describe the issue

Issue:

Hi, I tried to evaluate LLaVA-v1.6 on Science-QA, but the model keeps generating empty responses as shown in the log. Did I miss something?

Prompt and Response (Empty String) science-qa-example

=======prompt=======
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>
Question: Which of the following organisms is the primary consumer in this food web?
Choices:
(A) sea otter
(B) kelp
(C) plainfin midshipman
(D) phytoplankton
Explain your answer in detail, putting the correct option letter in (), e.g., (A), (B), (C), (D), at the end of your response.

Context: ```Below is a food web from an ocean ecosystem in Monterey Bay, off the coast of California. A food web models how the matter eaten by organisms moves through an ecosystem. The arrows in a food web represent how matter moves between organisms in an ecosystem.```

ASSISTANT:
=======Answer=======

Code

class LLaVA_v1_6_13B:
    def __init__(self) -> None:
        self.model_path = "liuhaotian/llava-v1.6-vicuna-13b"

        self.tokenizer, self.model, self.image_processor, self.context_len = load_pretrained_model(
            model_path=self.model_path,
            model_base=None,
            model_name=get_model_name_from_path(self.model_path),
            cache_dir=CACHE_DIR,
            load_8bit=True,
            device='cuda',
        )

    def generate(self, prompt: str, img_path: Union[str, list]) -> str:
        print('{:=^20}\n{}'.format('prompt', prompt))
        if type(img_path) == str:
            img_path = [img_path]

        images = [Image.open(img) for img in img_path]
        image_sizes = [x.size for x in images]
        image_tensor = process_images(
            images,
            self.image_processor,
            self.model.config
        )
        if type(image_tensor) is list:
            image_tensor = [image.to(self.model.device, dtype=torch.float16)
                            for image in image_tensor]
        else:
            image_tensor = image_tensor.to(
                self.model.device, dtype=torch.float16)

        input_ids = (
            tokenizer_image_token(
                prompt,
                self.tokenizer,
                IMAGE_TOKEN_INDEX,
                return_tensors="pt")
            .unsqueeze(0)
            .to(self.model.device)
        )

        with torch.inference_mode():
            output_ids = self.model.generate(
                input_ids,
                images=image_tensor,
                image_sizes=image_sizes,
                do_sample=True,
                use_cache=True,
                temperature=0.2,
                max_new_tokens=1000,
            )

        outputs = self.tokenizer.decode(
            output_ids[0, input_ids.shape[1]:], skip_special_tokens=True)
        print('{:=^20}\n{}'.format('Answer', outputs))
        return outputs

model = LLaVA_v1_6_13B()
outputs = model.generate(prompt, img_path)

WesleyHsieh0806 commented 7 months ago

[Issue solved] Looks like the output_ids from LLaVA-v1.6 does not include the input prompt. We should thus change outputs = self.tokenizer.decode(output_ids[0, input_ids.shape[1]:]) to outputs = self.tokenizer.decode(output_ids[0]) to avoid truncation.

I'm not sure if we need to update cli.py for llava-v1.6? @haotian-liu

LumenYoung commented 7 months ago

That is the problem I observed several days ago. but I didn't make a PR since it was not clear how should we distinguish different behaviors from model. I hope @haotian-liu can either create a unified behavior between models or determine which criterion to use to distinguish between two different kinds of behavior

haotian-liu commented 7 months ago

Thanks for reporting, yes I missed that file, and we should make that change. Just pushed the change to main

haotian-liu commented 7 months ago

That is the problem I observed several days ago. but I didn't make a PR since it was not clear how should we distinguish different behaviors from model. I hope @haotian-liu can either create a unified behavior between models or determine which criterion to use to distinguish between two different kinds of behavior

Definitely. We're working on a major refactor to make these behaviors more consistent. Thank you!

haotian-liu / LLaVA

[Usage] LLaVA-v1.6 Generates Empty/Truncated Response #1097

Describe the issue