guidance-ai / guidance

A guidance language for controlling large language models.
MIT License
18.14k stars 1.01k forks source link

Input template for Transformers vision language models ? #880

Open vpellegrain opened 1 month ago

vpellegrain commented 1 month ago

Hi,

I'm trying to constrain the generation of my VLMs using this repo; however i can't figure out the way to personalize the pipeline for handling inputs (query+image). Whereas it is documented as

gemini = models.VertexAI("gemini-pro-vision")

with user():
    lm = gemini + "What is this a picture of?" + image("longs_peak.jpg")

with assistant():
    lm += gen("answer")

for VertexAI models (here gemini), it is not transposable to Transformers models. Hence:

model = models.Transformers("openbmb/MiniCPM-Llama3-V-2_5")

with user():
    lm = model + "What is this a picture of?" + image("longs_peak.jpg")

with assistant():
    lm += gen("answer")

results in:

TypeError: MiniCPMV.forward() missing 1 required positional argument: 'data'

while trying with "microsoft/Phi-3-vision-128k-instruct" results in: ValueError: The tokenizer being used is unable to convert a special character in ’•¶∂ƒ˙∆£Ħ爨ൠᅘ∰፨.

(I also tried to manually import the model and the tokenizer and to pass them to the guidance.models call, but it does not change the error).

Is it possible to specify/personalize the pipeline for reading inputs on such models?

Thanks

Harsha-Nori commented 1 month ago

Hi @vpellegrain -- we're in the process of revamping our support for image inputs, but @nking-1 is looking into this right now :). We should have updates on this front shortly!

liqul commented 3 weeks ago

Got this error from a non-vision model here.

from guidance import models

model_id = 'THUDM/glm-4-9b-chat'
glm_model = models.Transformers(model_id, device_map='auto', trust_remote_code=True)

The error message is the same as the first post here.

dittops commented 3 weeks ago

Same here, I got the issue while using "microsoft/Phi-3-medium-4k-instruct"

Harsha-Nori commented 3 weeks ago

@dittops, are you trying to use a vision input for Phi-3, or just doing plain text generation? We're still working on multimodal support -- will update here when we have the image function working again :).

@liqul -- Thanks for sharing this with us! Tagging @riedgar-ms who might be able to take a look