haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
18.35k stars 2k forks source link

python inference demo #57

Open sssssshf opened 1 year ago

sssssshf commented 1 year ago

Do I have to use a browser to demonstrate when running a large model locally? Is there a demo in Python that directly feeds images and language into Python?

haotian-liu commented 1 year ago

Hi, thank you for your interest in our work.

This is a great suggestion! We have added an example script for CLI inference (single-turn Q-A session). An interactive CLI interface is WIP.

Please see instruction here: https://github.com/haotian-liu/LLaVA#cli-inference.

sssssshf commented 1 year ago

when I run this shell : python -m llava.eval.run_llava \ --model-name /LLaVA-13B-v0 \ --image-file "https://llava-vl.github.io/static/images/view.jpg" \ --query "What are the things I should be cautious about when I visit here?"

it's error : HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: '/LLaVA-13B-v0'.

sssssshf commented 1 year ago

为什么直接下载这里的的模型不能直接用?https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0

MaxFBurg commented 1 year ago

Cool, thank you very much @haotian-liu ! Do you have plans for providing a CLI that allows to feed multiple images and text prompts turn by turn anytime soon? This would be super cool to use your model for new downstream tasks.

vishaal27 commented 1 year ago

Yes, I agree with @MaxFBurg, are there any such implementation plans?

haotian-liu commented 1 year ago

@MaxFBurg @vishaal27

Yes, that's a great suggestion, and as mentioned in my previous reply, the interactive CLI support is planned. We are planning to upgrade to the Vicuna v1.1 soon, as it has a better support for these. Stay tuned! And if you are interested in contributing, please let me know!

haotian-liu commented 1 year ago

为什么直接下载这里的的模型不能直接用?https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0

We are not allowed to share the full model weights due to the LLaMA license, please see here for weight conversion.

vishaal27 commented 1 year ago

Thanks for your response @haotian-liu I tried replacing these lines in your eval script llava.eval.run_llava.py:

    qs = args.query
    if mm_use_im_start_end:
        qs = qs + '\n' + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
    else:
        qs = qs + '\n' + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len

with

    qs = args.query
    if mm_use_im_start_end:
        qs = qs + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
    else:
        qs = qs + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len

I think this would be a naive extension of the current single image one turn inference procedure to a single turn inference procedure that can take two images as input in the prompt. Do you think something as straightforward as this will work out of the box?

However, this doesn't seem to work well in practice for multi-image comparisons, a few examples follow:

For all the below examples, I used the following prompt with the modified code above: {<img_1> <img_2> <"Describe the change applied to the first image to get to the second image">}.

Screenshot 2023-04-29 at 1 57 07 PM

Screenshot 2023-04-29 at 1 57 22 PM

Screenshot 2023-04-29 at 1 57 15 PM

As you can see, the generations completely ignore the first image and give a detailed description of the second image. However, the model does understand that there are two images in the third case, given that its description contains "In the second image".

For a comparison, these are the model's responses when prompting with the same images and prompt on the web demo:

Screenshot 2023-04-29 at 2 02 59 PM

This response is more coherent, and describes the difference between the two images fairly reasonably.

I am wondering if this is an inherent limitation of the single-turn multi-image prompting style I've used above since it could be out-of-distribution (since your visual instruction tuning dataset only contains a single image per sample) for the model. Do you have any suggestions on a better evaluation strategy for this multi-image comparison either through single turn or multi turn prompting?

penghe2021 commented 1 year ago

Thanks for your response @haotian-liu I tried replacing these lines in your eval script llava.eval.run_llava.py:

    qs = args.query
    if mm_use_im_start_end:
        qs = qs + '\n' + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
    else:
        qs = qs + '\n' + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len

with

    qs = args.query
    if mm_use_im_start_end:
        qs = qs + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
    else:
        qs = qs + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len

I think this would be a naive extension of the current single image one turn inference procedure to a single turn inference procedure that can take two images as input in the prompt. Do you think something as straightforward as this will work out of the box?

However, this doesn't seem to work well in practice for multi-image comparisons, a few examples follow:

For all the below examples, I used the following prompt with the modified code above: {<img_1> <img_2> <"Describe the change applied to the first image to get to the second image">}.

Screenshot 2023-04-29 at 1 57 07 PM

Screenshot 2023-04-29 at 1 57 22 PM

Screenshot 2023-04-29 at 1 57 15 PM

As you can see, the generations completely ignore the first image and give a detailed description of the second image. However, the model does understand that there are two images in the third case, given that its description contains "In the second image".

For a comparison, these are the model's responses when prompting with the same images and prompt on the web demo:

Screenshot 2023-04-29 at 2 02 59 PM

This response is more coherent, and describes the difference between the two images fairly reasonably.

I am wondering if this is an inherent limitation of the single-turn multi-image prompting style I've used above since it could be out-of-distribution (since your visual instruction tuning dataset only contains a single image per sample) for the model. Do you have any suggestions on a better evaluation strategy for this multi-image comparison either through single turn or multi turn prompting?

Did you also change the image part

vishaal27 commented 1 year ago

Yes, this is the code I updated:

    image = load_image(args.image_file)
    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]

with

   image_tensor = torch.stack(
        [
            image_processor.preprocess(load_image(image_file), return_tensors="pt")["pixel_values"][0]
            for image_file in args.image_file.split(",")
        ]
    )

    input_ids = torch.as_tensor(inputs.input_ids).cuda()

I just pass in comma-separated image input files. Please let me know whether there is an issue in this impl?

haotian-liu commented 1 year ago

@penghe2021 Due to the current way of training, we do not observe the model having very good capability referring to / comparing with multiple images. We are working on improving this aspect as well, stay tuned!

vishaal27 commented 1 year ago

Thanks @haotian-liu, so I assume the above implementation for a single turn multi-image inference is correct, but its an OOD problem due to the current training set-up of the model (sees only one image per sample during visual instruction tuning). However, I still see the model performing well on multiple images when used in the multi-turn set-up, so looking forward to your demo implementation of that, do you have a plan for when that can be released?

Marcusntnu commented 1 year ago

It might be a superfluous or large request, but if the model could be integrated into a huggingface AutoModel or pipeline setup, I think it would be very accessible. Especially for experimenting with different use-cases.

haotian-liu commented 1 year ago

Hi @Marcusntnu, thank you for your interest in our work, and thank you for the great suggestion. This is WIP, and our first step is to move that the LLaVA model implementation to this repo, which has been completed. It should be implemented very soon, thanks and stay tuned!

wjjlisa commented 1 year ago

@haotian-liu Have you ever considered releasing the multi turn infernece code?

haotian-liu commented 1 year ago

@wjjlisa, do you mean the multi-turn conversation in CLI, as in our Gradio demo? This is planned for release by the end of this month. Was busy working on the NeurIPS recently...

SeungyounShin commented 1 year ago
Screen Shot 2023-05-29 at 8 13 09 PM

This is my experiments with prompt tunning. Not perfect but pretty amazing

Seems like img1,img2,text is better performing.

cyril-mino commented 1 year ago

@vishaal27 I Would like to know what the structure of the data input looks like. I am trying to do a similar thing.

vishaal27 commented 1 year ago

@cyril-mino Sorry I don't get your question -- what do you mean by structure of data input? I just pass in two images to the model as a list of tensors (with the updated code above) and pass in the prompt that asks to compare the two images.

cyril-mino commented 1 year ago

@vishaal27 apologies, I thought you were finetuning.

adrielkuek commented 1 year ago
Screen Shot 2023-05-29 at 8 13 09 PM

This is my experiments with prompt tunning. Not perfect but pretty amazing

Seems like img1,img2,text is better performing.

Hi, possible to share the input query prompts for this output? Thanks!

adrielkuek commented 1 year ago

@wjjlisa, do you mean the multi-turn conversation in CLI, as in our Gradio demo? This is planned for release by the end of this month. Was busy working on the NeurIPS recently...

can I check whether the multi-turn framework would be updated into the repo anytime soon? Thanks for the great work.

adrielkuek commented 1 year ago

Yes, this is the code I updated:

    image = load_image(args.image_file)
    image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]

with

   image_tensor = torch.stack(
        [
            image_processor.preprocess(load_image(image_file), return_tensors="pt")["pixel_values"][0]
            for image_file in args.image_file.split(",")
        ]
    )

    input_ids = torch.as_tensor(inputs.input_ids).cuda()

I just pass in comma-separated image input files. Please let me know whether there is an issue in this impl?

Hi Vishaal, by stacking the tensors we create a new dimension input for the model which would throw an exception, how did you overcome this issue?

vishaal27 commented 1 year ago

Hi Adriel, as far as I recall (I must admit I haven't looked at this script in over a month) the model.generate function was able to take in multiple input images as a concatenated tensor. For full clarity, here is the script I used, hope it helps (disclaimer: this script uses a fork of the repository that is quite old, it is possible a few things might have changed since then): https://github.com/MaxFBurg/LLaVA/blob/main/llava/eval/run_llava_two_images.py#L51

adrielkuek commented 1 year ago

Hi Vishaal, thanks for sharing the code. Indeed the fork has changed quite a fair bit. Seems like the mm-projector is removed, and the pretrained model as well. I can confirm that the current fork with the modified image tensor input is unable to work due to the dimensionality error in one of the nn.modules during forward pass. Can I do a quick check with you, did you use the llama-13b model or the facebook/opt model for your testing back then?

vishaal27 commented 1 year ago

Great, thanks for letting me know -- I will however need to get back to this script at some point and get it to work, so I can let you know if I can figure something out for this use-case. Please do let me know if you are able to as well :) Re. your question -- We used the llama-13b model back then, I think at that stage the opt model was not available if I recall correctly.

codybum commented 1 year ago

Hi Vishaal, thanks for sharing the code. Indeed the fork has changed quite a fair bit. Seems like the mm-projector is removed, and the pretrained model as well. I can confirm that the current fork with the modified image tensor input is unable to work due to the dimensionality error in one of the nn.modules during forward pass. Can I do a quick check with you, did you use the llama-13b model or the facebook/opt model for your testing back then?

@adrielkuek @vishaal27 We are also very interested in using multi-image input. Our interest is less in comparison, but rather using multiple images to represent the same thing, as described here: https://github.com/haotian-liu/LLaVA/issues/197#issuecomment-1567164371

HireTheHero commented 10 months ago

Interested in multiple image input as well. We're wondering if we could perform multimodal few-shot classification (on-the-fly, without fine-tuning) or not. Will test Vishaal's solution and maybe create PR when I have time.

LumenYoung commented 10 months ago

Hi, everyone. I've be surfing in the code base of LLaVA for a while and find it hard to find the exact generate() function implementation for llama based LLaVa. I'm trying to find the generate() for LLaMA. It would be helpful since I want to find a way to work in the multiple image mode. Any help would be appreciated!

LumenYoung commented 10 months ago
Screen Shot 2023-05-29 at 8 13 09 PM

This is my experiments with prompt tunning. Not perfect but pretty amazing

Seems like img1,img2,text is better performing.

Hi, SeungyounShin. Would you mind sharing how you manage to embed both image into one query. It would be really helpful as I currently am not able to find a way to do this.

HireTheHero commented 10 months ago

Something like #432 ? Would appreciate any suggestions.

CreativeBuilds commented 9 months ago
Screen Shot 2023-05-29 at 8 13 09 PM

This is my experiments with prompt tunning. Not perfect but pretty amazing

Seems like img1,img2,text is better performing.

Would you be able to upload this model to hugging face or share it some other way? Very interested in getting this to run with image comparison.

aldoz-mila commented 6 months ago

Hello, I am also interested in inputting more than one image for some experiments. I am trying to find the right template for this, considering that the base template is USER: <image>\n<prompt>\nASSISTANT:.

fisher75 commented 4 months ago

I am in great need of a multi-dialog feature for batch-inference with SGLang.

Sprinter1999 commented 3 months ago

@vishaal27 apologies, I thought you were finetuning.

Hi cyril-mino, did you make it to support the fine-tuning LLaVa with multi-images & text? I wonder if there are some extra steps besides the codes mentioned by @vishaal27. Thanks~

yc-cui commented 3 months ago

Is there any way we can embed other modalities? Such as bbox, class ... ?