[REQUEST] Vision Models

theroyallab / tabbyAPI

An OAI compatible exllamav2 API that's both lightweight and fast

GNU Affero General Public License v3.0

609 stars 75 forks source link

[REQUEST] Vision Models #235

Open bdashore3 opened 1 week ago

bdashore3 commented 1 week ago

Problem

Tracking issue for getting vision models working, supersedes #229.

TImeline:

[x] Add support for loading vision tower
[ ] Accept image objects in chat completions
[ ] Parse and insert images in exl2 backend (base64 -> Pillow)
[ ] Handle images in chat templates

Solution

Contributions are welcome here as I am not sure how vision prompting works in the first place and that'll require research and a lot of time.

Please PR to the vision branch

Alternatives

No response

Explanation

Tracking issue

Examples

No response

Additional context

No response

Acknowledgements

[X] I have looked for similar requests before submitting this one.
[X] I understand that the developers have lives and my issue will be answered when possible.
[X] I understand the developers of this program are human, and I will make my requests politely.

Ph0rk0z commented 6 days ago

Can I request text completion vision support too? Chat completion is much more difficult to control.

bdashore3 commented 5 days ago

Code-wise, text completion with vision is not possible since chat completion separates images and text into defined payloads with roles. If there's examples proving otherwise, I can look into it.

turboderp commented 5 days ago

This should be possible in theory, since images in Llava-style models are inserted into the context as essentially tokens, and the way it's implemented in ExLlama is flexible enough to allow it. All it would need is an extension to the completion API to accept images with a placeholder text for each image.

But I don't think there are any vision models trained on image inputs outside the context of instruct tuning. So I wouldn't expect reliable results.

DocShotgun commented 4 days ago

Looking at implementing the vision support right now. It's definitely theoretically possible to implement exllamav2 vision on the text completion endpoint, but I'm not aware of any API standard defining how to format such things.

If you had some kind of standardized string that specifies an image URL or base64 image in the prompt, in theory we could find all of those, create embedding/text alias pairs, and feed them into the generator the same way as a formatted chat completion prompt.

Ph0rk0z commented 3 days ago

I think it hasn't been established and you have more or less free reign. Hopefully SillyTavern would implement such an addition to Tabby.