Open sssssshf opened 1 year ago
Hi, thank you for your interest in our work.
This is a great suggestion! We have added an example script for CLI inference (single-turn Q-A session). An interactive CLI interface is WIP.
Please see instruction here: https://github.com/haotian-liu/LLaVA#cli-inference.
when I run this shell : python -m llava.eval.run_llava \ --model-name /LLaVA-13B-v0 \ --image-file "https://llava-vl.github.io/static/images/view.jpg" \ --query "What are the things I should be cautious about when I visit here?"
it's error : HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: '/LLaVA-13B-v0'.
为什么直接下载这里的的模型不能直接用?https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0
Cool, thank you very much @haotian-liu ! Do you have plans for providing a CLI that allows to feed multiple images and text prompts turn by turn anytime soon? This would be super cool to use your model for new downstream tasks.
Yes, I agree with @MaxFBurg, are there any such implementation plans?
@MaxFBurg @vishaal27
Yes, that's a great suggestion, and as mentioned in my previous reply, the interactive CLI support is planned. We are planning to upgrade to the Vicuna v1.1 soon, as it has a better support for these. Stay tuned! And if you are interested in contributing, please let me know!
为什么直接下载这里的的模型不能直接用?https://huggingface.co/liuhaotian/LLaVA-13b-delta-v0
We are not allowed to share the full model weights due to the LLaMA license, please see here for weight conversion.
Thanks for your response @haotian-liu
I tried replacing these lines in your eval script llava.eval.run_llava.py
:
qs = args.query
if mm_use_im_start_end:
qs = qs + '\n' + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
else:
qs = qs + '\n' + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len
with
qs = args.query
if mm_use_im_start_end:
qs = qs + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
else:
qs = qs + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len
I think this would be a naive extension of the current single image one turn inference procedure to a single turn inference procedure that can take two images as input in the prompt. Do you think something as straightforward as this will work out of the box?
However, this doesn't seem to work well in practice for multi-image comparisons, a few examples follow:
For all the below examples, I used the following prompt with the modified code above: {<img_1> <img_2> <"Describe the change applied to the first image to get to the second image">}
.
As you can see, the generations completely ignore the first image and give a detailed description of the second image. However, the model does understand that there are two images in the third case, given that its description contains "In the second image".
For a comparison, these are the model's responses when prompting with the same images and prompt on the web demo:
This response is more coherent, and describes the difference between the two images fairly reasonably.
I am wondering if this is an inherent limitation of the single-turn multi-image prompting style I've used above since it could be out-of-distribution (since your visual instruction tuning dataset only contains a single image per sample) for the model. Do you have any suggestions on a better evaluation strategy for this multi-image comparison either through single turn or multi turn prompting?
Thanks for your response @haotian-liu I tried replacing these lines in your eval script
llava.eval.run_llava.py
:qs = args.query if mm_use_im_start_end: qs = qs + '\n' + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN else: qs = qs + '\n' + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len
with
qs = args.query if mm_use_im_start_end: qs = qs + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN + "\n" + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN else: qs = qs + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + "\n" + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len
I think this would be a naive extension of the current single image one turn inference procedure to a single turn inference procedure that can take two images as input in the prompt. Do you think something as straightforward as this will work out of the box?
However, this doesn't seem to work well in practice for multi-image comparisons, a few examples follow:
For all the below examples, I used the following prompt with the modified code above:
{<img_1> <img_2> <"Describe the change applied to the first image to get to the second image">}
.
As you can see, the generations completely ignore the first image and give a detailed description of the second image. However, the model does understand that there are two images in the third case, given that its description contains "In the second image".
For a comparison, these are the model's responses when prompting with the same images and prompt on the web demo:
This response is more coherent, and describes the difference between the two images fairly reasonably.
I am wondering if this is an inherent limitation of the single-turn multi-image prompting style I've used above since it could be out-of-distribution (since your visual instruction tuning dataset only contains a single image per sample) for the model. Do you have any suggestions on a better evaluation strategy for this multi-image comparison either through single turn or multi turn prompting?
Did you also change the image part
Yes, this is the code I updated:
image = load_image(args.image_file)
image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
with
image_tensor = torch.stack(
[
image_processor.preprocess(load_image(image_file), return_tensors="pt")["pixel_values"][0]
for image_file in args.image_file.split(",")
]
)
input_ids = torch.as_tensor(inputs.input_ids).cuda()
I just pass in comma-separated image input files. Please let me know whether there is an issue in this impl?
@penghe2021 Due to the current way of training, we do not observe the model having very good capability referring to / comparing with multiple images. We are working on improving this aspect as well, stay tuned!
Thanks @haotian-liu, so I assume the above implementation for a single turn multi-image inference is correct, but its an OOD problem due to the current training set-up of the model (sees only one image per sample during visual instruction tuning). However, I still see the model performing well on multiple images when used in the multi-turn set-up, so looking forward to your demo implementation of that, do you have a plan for when that can be released?
It might be a superfluous or large request, but if the model could be integrated into a huggingface AutoModel or pipeline setup, I think it would be very accessible. Especially for experimenting with different use-cases.
Hi @Marcusntnu, thank you for your interest in our work, and thank you for the great suggestion. This is WIP, and our first step is to move that the LLaVA model implementation to this repo, which has been completed. It should be implemented very soon, thanks and stay tuned!
@haotian-liu Have you ever considered releasing the multi turn infernece code?
@wjjlisa, do you mean the multi-turn conversation in CLI, as in our Gradio demo? This is planned for release by the end of this month. Was busy working on the NeurIPS recently...
This is my experiments with prompt tunning. Not perfect but pretty amazing
Seems like img1,img2,text is better performing.
@vishaal27 I Would like to know what the structure of the data input looks like. I am trying to do a similar thing.
@cyril-mino Sorry I don't get your question -- what do you mean by structure of data input? I just pass in two images to the model as a list of tensors (with the updated code above) and pass in the prompt that asks to compare the two images.
@vishaal27 apologies, I thought you were finetuning.
![]()
This is my experiments with prompt tunning. Not perfect but pretty amazing
Seems like img1,img2,text is better performing.
Hi, possible to share the input query prompts for this output? Thanks!
@wjjlisa, do you mean the multi-turn conversation in CLI, as in our Gradio demo? This is planned for release by the end of this month. Was busy working on the NeurIPS recently...
can I check whether the multi-turn framework would be updated into the repo anytime soon? Thanks for the great work.
Yes, this is the code I updated:
image = load_image(args.image_file) image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]
with
image_tensor = torch.stack( [ image_processor.preprocess(load_image(image_file), return_tensors="pt")["pixel_values"][0] for image_file in args.image_file.split(",") ] ) input_ids = torch.as_tensor(inputs.input_ids).cuda()
I just pass in comma-separated image input files. Please let me know whether there is an issue in this impl?
Hi Vishaal, by stacking the tensors we create a new dimension input for the model which would throw an exception, how did you overcome this issue?
Hi Adriel, as far as I recall (I must admit I haven't looked at this script in over a month) the model.generate
function was able to take in multiple input images as a concatenated tensor. For full clarity, here is the script I used, hope it helps (disclaimer: this script uses a fork of the repository that is quite old, it is possible a few things might have changed since then): https://github.com/MaxFBurg/LLaVA/blob/main/llava/eval/run_llava_two_images.py#L51
Hi Vishaal, thanks for sharing the code. Indeed the fork has changed quite a fair bit. Seems like the mm-projector is removed, and the pretrained model as well. I can confirm that the current fork with the modified image tensor input is unable to work due to the dimensionality error in one of the nn.modules during forward pass. Can I do a quick check with you, did you use the llama-13b model or the facebook/opt model for your testing back then?
Great, thanks for letting me know -- I will however need to get back to this script at some point and get it to work, so I can let you know if I can figure something out for this use-case. Please do let me know if you are able to as well :) Re. your question -- We used the llama-13b model back then, I think at that stage the opt model was not available if I recall correctly.
Hi Vishaal, thanks for sharing the code. Indeed the fork has changed quite a fair bit. Seems like the mm-projector is removed, and the pretrained model as well. I can confirm that the current fork with the modified image tensor input is unable to work due to the dimensionality error in one of the nn.modules during forward pass. Can I do a quick check with you, did you use the llama-13b model or the facebook/opt model for your testing back then?
@adrielkuek @vishaal27 We are also very interested in using multi-image input. Our interest is less in comparison, but rather using multiple images to represent the same thing, as described here: https://github.com/haotian-liu/LLaVA/issues/197#issuecomment-1567164371
Interested in multiple image input as well. We're wondering if we could perform multimodal few-shot classification (on-the-fly, without fine-tuning) or not. Will test Vishaal's solution and maybe create PR when I have time.
Hi, everyone. I've be surfing in the code base of LLaVA for a while and find it hard to find the exact generate()
function implementation for llama based LLaVa. I'm trying to find the generate() for LLaMA. It would be helpful since I want to find a way to work in the multiple image mode. Any help would be appreciated!
![]()
This is my experiments with prompt tunning. Not perfect but pretty amazing
Seems like img1,img2,text is better performing.
Hi, SeungyounShin. Would you mind sharing how you manage to embed both image into one query. It would be really helpful as I currently am not able to find a way to do this.
Something like #432 ? Would appreciate any suggestions.
![]()
This is my experiments with prompt tunning. Not perfect but pretty amazing
Seems like img1,img2,text is better performing.
Would you be able to upload this model to hugging face or share it some other way? Very interested in getting this to run with image comparison.
Hello, I am also interested in inputting more than one image for some experiments. I am trying to find the right template for this, considering that the base template is USER: <image>\n<prompt>\nASSISTANT:
.
USER: <image1><image2>\n<prompt>\nASSISTANT:
? output = pipe( image, prompt=text_prompt, generate_kwargs={"max_new_tokens": 200} )
I am in great need of a multi-dialog feature for batch-inference with SGLang.
@vishaal27 apologies, I thought you were finetuning.
Hi cyril-mino, did you make it to support the fine-tuning LLaVa with multi-images & text? I wonder if there are some extra steps besides the codes mentioned by @vishaal27. Thanks~
Is there any way we can embed other modalities? Such as bbox, class ... ?
Do I have to use a browser to demonstrate when running a large model locally? Is there a demo in Python that directly feeds images and language into Python?