stanfordnlp / dspy

DSPy: The framework for programming—not prompting—language models
https://dspy.ai
MIT License
19.06k stars 1.46k forks source link

Image support inside complex types #1767

Open isaacbmiller opened 2 weeks ago

isaacbmiller commented 2 weeks ago

Currently, only you can only pass a single image at a time in a signature.

E.g. this will work

class ImageSignature(dspy.Signature):
    image1: dspy.Image = dspy.InputField()
    image2: dspy.Image = dspy.InputField()

But any more complex types involving images wont:

class ImageSignature(dspy.Signature):
    images: List[dspy.Image] = dspy.InputField()

class ImageSignature(dspy.Signature):
    labeled_images: Dict[str, dspy.Image] = dspy.InputField()

This is due to how images are compiled into OAI compatible messages, where inside chat_adapter.py we create a large list of content blocks by giving fields with an image_url special privileges:

{
    "content": [{
         "type": "text",
         "text": "...",
    },
    {
         "type": "image_url"
         "image_url": {"url": "..."} # url is either an actual url or the base64 data
    }]
}

I do some fairly naive parsing inside ChatAdapter, and there is definitely a more elegant solution here.

1763 addresses the List case, but I want a more generalized solution.

cc @okhat

thomasahle commented 2 weeks ago

This is how I did it in fewshot:

def format_input_simple(pydantic_object: BaseModel, img_formatter=None) -> dict[str, Any]:
    if img_formatter is None:
        img_formatter = gpt_format_image

    image_map = {}

    def replace_image_with_id(obj: Any) -> Any:
        image_id = f"[image {len(image_map) + 1}]"
        image_map[image_id] = obj.base64()
        return image_id

    dict_obj = map_images(pydantic_object, replace_image_with_id)
    processed = json.dumps(dict_obj)

    content = [{"type": "text", "text": processed}]
    for image_id, image in image_map.items():
        content.append({"type": "text", "text": image_id + ":"})
        content.append(img_formatter(image))

    return {"role": "user", "content": content}

Basically when I turn the input object into json, I replace all images with an ID. Then at the end of the message I send the list of (ID, img) pairs.

Works reasonably well.

rzr2kor commented 2 weeks ago

Currently, only you can only pass a single image at a time in a signature.

E.g. this will work

class ImageSignature(dspy.Signature):
    image1: dspy.Image = dspy.InputField()
    image2: dspy.Image = dspy.InputField()

But any more complex types involving images wont:

class ImageSignature(dspy.Signature):
    images: List[dspy.Image] = dspy.InputField()

class ImageSignature(dspy.Signature):
    labeled_images: Dict[str, dspy.Image] = dspy.InputField()

This is due to how images are compiled into OAI compatible messages, where inside chat_adapter.py we create a large list of content blocks by giving fields with an image_url special privileges:

{
    "content": [{
         "type": "text",
         "text": "...",
    },
    {
         "type": "image_url"
         "image_url": {"url": "..."} # url is either an actual url or the base64 data
    }]
}

I do some fairly naive parsing inside ChatAdapter, and there is definitely a more elegant solution here. #1763 addresses the List case, but I want a more generalized solution.

cc @okhat

Hey, I was trying to perform VQA with an LLM using dspy for optimized prompting and I'm not able to pass the base64image to LLM via dspy. Could you let me know how you were able to do it? I tried dspy.Image but I get an error saying No module called dspy.Image. Thanks

okhat commented 1 week ago

@rzr2kor Are you on the latest version of DSPy? pip install -U dspy

isaacbmiller commented 1 week ago

Then at the end of the message I send the list of (ID, img) pairs.

@thomasahle Did you find that this worked better than interweaving the {"type": "image_url", "image_url": ...}) into your actual text content, or just a design decision

glesperance commented 1 week ago

With images complex types it seems like we could unlock MiproV2 w fewshots aware enabled as DescribeProgram / DescribeModule could then be modified to receive program_example that contains images.

thomasahle commented 1 week ago

Then at the end of the message I send the list of (ID, img) pairs.

@thomasahle Did you find that this worked better than interweaving the {"type": "image_url", "image_url": ...}) into your actual text content, or just a design decision

I couldn't put it in "the actual context", since that was just one big json string