Open hjsg1010 opened 1 year ago
I met the similar question with the model llava-v1.5-7b, the responses are extremely worse than yours —— The model repeats some nonsense words or numbers.
what can be done to overcome that? @RicoWjr (Me too encountered that)
Hey, these are the projector weights that are only trained with image-text pairs, and are not NOT instruction tuned, which means they do NOT follow instructions as good as our official models, and can output repetitive, lengthy, and garbled outputs.
You need to use LLaVA v1.5 models directly.
I just added these clarifications in the MODEL ZOO, hopefully that clears some of the doubts.
These are projector weights we have pretrained. You can use these projector weights for visual instruction tuning. They are just pretrained on image-text pairs, and are NOT instruction tuned, which means they do NOT follow instructions as good as our official models, and can output repetitive, lengthy, and garbled outputs. If you want to have nice conversations with LLaVA, use the checkpoints above (LLaVA v1.5).
Hey, these are the projector weights that are only trained with image-text pairs, and are not NOT instruction tuned, which means they do NOT follow instructions as good as our official models, and can output repetitive, lengthy, and garbled outputs.
You need to use LLaVA v1.5 models directly.
I just added these clarifications in the MODEL ZOO, hopefully that clears some of the doubts.
These are projector weights we have pretrained. You can use these projector weights for visual instruction tuning. They are just pretrained on image-text pairs, and are NOT instruction tuned, which means they do NOT follow instructions as good as our official models, and can output repetitive, lengthy, and garbled outputs. If you want to have nice conversations with LLaVA, use the checkpoints above (LLaVA v1.5).
Thx for reply. I should read your clarifications.
I have another question. I have a vicuna13b model that has been finetuned with my own text data (only language finetuning, not image-text pair finetuning using your script). Would it be possible to utilize this in your llava framework?
Could I perhaps change the model-base in this command, or modify my custom vicuna config in some way?
python -m llava.serve.cli --model-path ./data-vol-1/model/llava/llava-336px-pretrain-vicuna-13b-v1.3 --model-base ./data-vol-1/model/llava/vicuna_13b_v1.3 --image-file "./llava/view.jpg"
@hjsg1010 the clarifications are added just now after I see this issue :(
If your finetuned vicuna is based on Vicuna v1.3, you may try this one: https://huggingface.co/liuhaotian/llava-v1-0719-336px-lora-vicuna-13b-v1.3
It is lora tuned, which means it may be somehow compatible and can be plugged in with a modified version of Vicuna v1.3, to give it visual capabilities, but I haven't tried something like this so there is no guarantee.
Check out instructions here on how to launch a model worker with LoRA adapters. CLI should be similar.
https://github.com/haotian-liu/LLaVA/blob/main/docs/LoRA.md#launch-a-model-worker
@hjsg1010 the clarifications are added just now after I see this issue :(
If your finetuned vicuna is based on Vicuna v1.3, you may try this one: https://huggingface.co/liuhaotian/llava-v1-0719-336px-lora-vicuna-13b-v1.3
It is lora tuned, which means it may be somehow compatible and can be plugged in with a modified version of Vicuna v1.3, to give it visual capabilities, but I haven't tried something like this so there is no guarantee.
Check out instructions here on how to launch a model worker with LoRA adapters. CLI should be similar.
https://github.com/haotian-liu/LLaVA/blob/main/docs/LoRA.md#launch-a-model-worker
Are you suggesting that I should specify this model https://huggingface.co/liuhaotian/llava-v1-0719-336px-lora-vicuna-13b-v1.3 as the model-path, and set my custom model as the model-base?
thx for reply. I'll read through it again carefully and give it a try.
Your understanding is correct.
@haotian-liu Thanks a lot. You've been a great help to me. I wish the multi-turn conversation feature would be available soon in the CLI environment or Jupyter. For now, I'm also trying to implement it myself.
Wait, multi-turn conversation is already supported. See the gif (wait for around 10 seconds or more to see the second query): https://github.com/haotian-liu/LLaVA#cli-inference
what can be done to overcome that? @RicoWjr (Me too encountered that)
I just find that i pulled the llava docker image which pushed months ago by someone and its code is not compatible with the latest llava-v1.5. You can check whether your code version is compatible with the model version
Wait, multi-turn conversation is already supported. See the gif (wait for around 10 seconds or more to see the second query): https://github.com/haotian-liu/LLaVA#cli-inference
oh, I mean several conversation with several images. so I can test few-shot with my images
Question
I served the model via cli using the following command.
these llava-336px-pretrain-vicuna-13b-v1.3 and vicuna_13b_v1.3 is downloaded from your links.
However, as you can see in below, the model is providing excessively long responses. Would you happen to have any advice on this matter?