Open bdashore3 opened 1 week ago
Can I request text completion vision support too? Chat completion is much more difficult to control.
Code-wise, text completion with vision is not possible since chat completion separates images and text into defined payloads with roles. If there's examples proving otherwise, I can look into it.
This should be possible in theory, since images in Llava-style models are inserted into the context as essentially tokens, and the way it's implemented in ExLlama is flexible enough to allow it. All it would need is an extension to the completion API to accept images with a placeholder text for each image.
But I don't think there are any vision models trained on image inputs outside the context of instruct tuning. So I wouldn't expect reliable results.
Looking at implementing the vision support right now. It's definitely theoretically possible to implement exllamav2 vision on the text completion endpoint, but I'm not aware of any API standard defining how to format such things.
If you had some kind of standardized string that specifies an image URL or base64 image in the prompt, in theory we could find all of those, create embedding/text alias pairs, and feed them into the generator the same way as a formatted chat completion prompt.
I think it hasn't been established and you have more or less free reign. Hopefully SillyTavern would implement such an addition to Tabby.
Problem
Tracking issue for getting vision models working, supersedes #229.
TImeline:
Solution
Contributions are welcome here as I am not sure how vision prompting works in the first place and that'll require research and a lot of time.
Please PR to the vision branch
Alternatives
No response
Explanation
Tracking issue
Examples
No response
Additional context
No response
Acknowledgements