simonw / llm

Access large language models from the command-line
https://llm.datasette.io
Apache License 2.0
4.67k stars 259 forks source link

Multi-modal support for vision models such as GPT-4 vision #331

Closed cmungall closed 2 weeks ago

cmungall commented 1 year ago

https://platform.openai.com/docs/guides/vision

I think this is best handled by command line options --image and --image-urls to either encode and pass as base64, or to pass a URL.

tomviner commented 1 year ago

Indeed this would be awesome. Does it require changes to llm or can it be done in a plugin?

cmungall commented 1 year ago

I suspect we'll be seeing more multimodal models so inclusion in core makes sense, but I defer to @simonw on this!

simonw commented 1 year ago

I've been thinking about this a lot.

The challenge here is that we need to be able to mix both text and images together in the same prompt - because you can call GPT-4 vision with this kind of thing:

Take a look at this image:

<image 1>

Now compare it to this:

<image 2>

My first instinct was to support syntax like this:

llm -m gpt-4-vision \
  "Take a look at this image:" \
  -i image1.jpeg \
  "Now compare it to this:" \
  -i https://example.com/image2.png

Note that the -i/--image option here takes a filename or a URL, detecting files by seeing if they correspond to files on disk.

But... I don't think I can implement this, because Click really, really doesn't want to provide a mechanism for storing and retrieving the order of different arguments and parameters relative to each other:

I spent some time trying to get this to work with a custom Click command class and parse_args() but determined that I'd effectively have to re-implement the whole Click argument parser from scratch to handle cases like --enable-logging boolean flags and -p key value multi-value parameters. This doesn't feel worthwhile to me.

So now I'm considering the following instead:

llm "look at this image" -i image.jpeg --tbc
llm -c "and compare it with" -i https://example.com/image.png

The trick here is that new --tbc flag, which stands for "to be continued". It causes the prompt to be stored but NOT executed against he model yet - instead, any following llm -c calls can be used to stack up more context in the prompt which will be executed the first time --tbc is NOT used.

On a related note: llm chat could also support this - maybe letting you do this kind of thing:

llm chat -m gpt-4-vision
look at this image
!image image.jpeg

For multi-lined chats you would use the existing !multi command:

llm chat -m gpt-4-vision
!multi
look at this image
!image image.jpeg
and compare it with
!image https://example.com/image.png
!end
simonw commented 1 year ago

Crucially, I want to leave the door open for other LLM models provided by plugins - like maybe https://github.com/SkunkworksAI/BakLLaVA - to also support multi-modal inputs like this.

So I think the model class would have a supports_images = True property it could set on to tell LLM that images are supported - otherwise using -i/--image would return an error.

simonw commented 1 year ago

One note about the --tbc thing is that we can get basic image support working without it - we could implement this and say that support for multiple images in the same prompt is coming later:

llm -m gpt-4-vision "Caption for this image" -i image.jpeg
simonw commented 1 year ago

This work is blocked on:

simonw commented 1 year ago

Would be amazing to get this working with a Bakllava local model - relevant example code using llama.cpp here https://github.com/cocktailpeanut/mirror/blob/main/app.py

psychemedia commented 12 months ago

Another claimed bakllava example (not tried it yet), this one using llama-cpp-python: https://advanced-stack.com/resources/multi-modalities-inference-using-mistral-ai-llava-bakllava-and-llama-cpp.html

[Actually uses from llm_core.llm import LLaVACPPModel ; Trying to run the example code on my MacBook Pro M2 16GB and it just falls over...; other chat models of a similar size seem to work okay.)

neomanic commented 11 months ago

@simonw how about f-strings/templating style?

llm "look at this image {src_image} and compare it to {compare_image}" \
    --infile src_image=sample.jpeg --infile compare_image=known.jpeg
def _infiles_to_dict(
        ctx: click.Context, attribute: click.Option, infiles: tuple[str, ...]) -> dict[str, str]:
     return {k:v for k,v in (f.split("=") for f in infiles)}
@click.command()
@click.option(
    "-i",
    "--infile",
    multiple=True,
    callback=_infiles_to_dict,
    help="Input files in the form key=filename. Multiple files can be included."
)

Misc thoughts:

NightMachinery commented 10 months ago

https://github.com/tbckr/sgpt

SGPT additionally facilitates the utilization of the GPT-4 Vision API. Include input images using the -i or --input flag, supporting both URLs and local images.

$ sgpt -m "gpt-4-vision-preview" -i "https://upload.wikimedia.org/wikipedia/en/c/cb/Marvin_%28HHGG%29.jpg" "what can you see on the picture?"
The image shows a figure resembling a robot with a humanoid form. It has a
$ sgpt -m "gpt-4-vision-preview" -i pkg/fs/testdata/marvin.jpg "what can you see on the picture?"
The image shows a figure resembling a robot with a sleek, metallic surface. It

It is also possible to combine URLs and local images:

$ sgpt -m "gpt-4-vision-preview" -i "https://upload.wikimedia.org/wikipedia/en/c/cb/Marvin_%28HHGG%29.jpg" -i pkg/fs/testdata/marvin.jpg "what is the difference between those two pictures"
The two images provided appear to be identical. Both show the same depiction of a
simonw commented 8 months ago

I built a prototype of this today, in the image-experimental branch - just for OpenAI so far using docs on https://platform.openai.com/docs/guides/vision but I want to also ship support for Gemini and Claude (and eventually local models like LLaVA).

I gave it this image:

image

And ran this:

llm -m 4v 'describe this image' -i image.jpg -o max_tokens 200

And got back:

This image shows a young pig being held by a person. The pig has a light brown coat with some bristle-like hair and a prominent snout that is characteristic of pigs. It appears to be a juvenile, given its size. The pig's snout is a bit dirty, suggesting it may have been rooting around in the ground, which is common pig behavior. The person is out of frame with only their arm visible, dressed in a red garment with a seemingly soft texture. They are holding the pig securely against their body. The background indicates that this is an indoor setting with wooden structures, possibly inside a barn or a similar animal enclosure.

simonw commented 8 months ago

Lots still to do on this - I want it to support either URLs or file paths or - as an input but those should then be made available to the model such that models like GPT-4 that support URL images can pass the URL in directly, while models like Claude 3 that only support base64 fetch that URL and then send it base64 encoded instead.

Maybe have a thing with Pillow as an optional dependency which can resize the images before sending them?

Have to decide what to do about logs. I think I need to log the images to the SQLite database (maybe in a new BLOB table) because I need them in conversations so I can send follow-up prompts - but that could take a lot of space. So I need to add tooling that helps users clean up old images from their database if it gets too big.

simonw commented 8 months ago

I am going to pass around an image object that has a .url property that may or may not return a URL string (otherwise None) and a .bytes and .base64 property that ALWAYS return binary data or that data base64 encoded.

That way plugins like OpenAI that can be sent URLs can use .url first and fall back to .base64 if the URL is not available, and plugins like Claude 3 can use base64 every time.

I'm tempted to offer a .resized(max_width, max_height) method which returns a Pillow resized image for models that know there is a maximum or recommended size limit and want to send a smaller request.

simonw commented 8 months ago

Idea: rather than store the images in the database, I'll store the path to the files on disk.

If you attempt to continue a conversation where the file paths no longer resolve to existing images, you'll get an error.

tomviner commented 8 months ago

Would be nice if the API server gave you a reference for every uploaded image, that you could just refer back to

anarcat commented 8 months ago

came here looking for non-text API endpoints... i was hoping to have a direct view into the audio and text-to-speech API endpoints, in particular.

so while it would be nice to have llm have a chat-like interface to interleave images, maybe an easier first step would be to have just a simple "prompt-to-image", "prompt-to-audio", "audio-to-text" kind of commands?

simonw commented 7 months ago

Quick survey on Twitter: https://twitter.com/simonw/status/1768445876274635155

Consensus is loosely to do image and then text, rather than text then image:

[{"type":"image_url","image_url":{"url":"..."}}, [{"type":"text","text":"Describe image"}]

simonw commented 7 months ago

Claude 3 Haiku is cheaper than GPT-3.5 Turbo and supports image inputs - a great incentive to finally get this feature shipped!

simonw commented 7 months ago

https://twitter.com/invisiblecomma/status/1768561708090417603

The Claude Vision docs recommend image first

https://docs.anthropic.com/claude/docs/vision#image-best-practices

Image placement: Just as with document-query placement, Claude works best when images come before text. Images placed after text or interpolated with text will still perform well, but if your use case allows it, we recommend image-then-text structure. See vision prompting tips for more details.

simonw commented 7 months ago

the maximum allowed image file size is 5MB per image

Should I enforce this for the Claude model? Easiest to let Claude API return an error at first.

I'm not yet sure if LLM should depend on Pillow and use it to resize large images before sending them.

Maybe a plugin hook to allow things like resizing and HEIC conversion would be useful?

simonw commented 7 months ago

https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/design-multimodal-prompts#prompt-design-fundamentals

Put your image first for single-image prompts: While Gemini can handle image and text inputs in any order, for prompts containing a single image, it might perform better if that image (or video) is placed before the text prompt. However, for prompts that require images to be highly interleaved with texts to make sense, use whatever order is most natural.

NightMachinery commented 7 months ago

the maximum allowed image file size is 5MB per image

Should I enforce this for the Claude model? Easiest to let Claude API return an error at first.

I'm not yet sure if LLM should depend on Pillow and use it to resize large images before sending them.

Maybe a plugin hook to allow things like resizing and HEIC conversion would be useful?

IMO llm should compress/resize images to avoid errors and make things easy to use. You can add an option --no-image-resize which disables this behavior, and people who care will disable it. The average user (myself included) just want the image to go the model, and the error is unhelpful.

BTW, OpenAI supports both low and high detail levels for processing images. Does Anthropic have sth similar? Is this exposed in llm?

irthomasthomas commented 7 months ago

I made a simple cli for vision, if anyone needs it before llm-vision is ready. Only supports GPT4 for now. :( https://github.com/irthomasthomas/llm-vision

It supports specifying an output format that prompts the model to generate markdown, or json in addition to plain text. One thing odd about gpt-4-vision is that it doesn't know you have given it an image, and sometimes doesn't believe it has vision capabilities unless you give it a phrase like 'describe the image'. But, if you want to extract an image to json, then a text description isn't very useful. So, I prompt it with 'describe the image in your head, then write the json document'.

There's also a work-in-progress gpt4-vision-screen-compare.py - this takes a screenshot every few seconds and compares the similarity with the last screenshot and if different enough it sends it to the model asking it explain the changes between them.

And here's a demo of what you can do with it: https://twitter.com/xundecidability/status/1763219017160867840 Problem: I Want to import blocked domains list from kagi to Bing Custom Search.

Solution: A little bash script that: Screenshots kagi blocked domains list Gpt4-vision streams a text list of domains xdotool types the domains into bing webpage as they stream in.

simonw commented 7 months ago

Current status:

The main sticking point is what to do with the SQLite logging mechanism

It's important that llm -c "..." works for sending follow-up prompts. This means it needs to be able to send the image again.

Some ways that could work:

irthomasthomas commented 7 months ago

Very nice! I'm not sure I'd want to include the image in every turn, though. I send a lot of full screenshots and my poor connection doesn't help. What I do currently is generate the description with a python script and pipe that to llm to chat about it. If it's important I might include the file path in the prompt. Then the llm can act on the file, and I can I search for the file in the logs dB.

Cheers, Thomas

On Thu, 4 Apr 2024, 02:42 Simon Willison, @.***> wrote:

Current status:

  • Branch has -i support
  • I have GPT-4 Vision support, plus branches of llm-gemini and llm-claude-3

The main sticking point is what to do with the SQLite logging mechanism

It's important that llm -c "..." works for sending follow-up prompts. This means it needs to be able to send the image again.

Some ways that could work:

  • For images on disk, store the path to that image on disk. Use that again in follow-up prompts, and throw a hard error if the file is no longer visible.
  • Some models support URLs. For public URLs to images I can store those URLs, and let the APIs themselves error if the URLs are 404ing
  • Images fed in to standard input could be stored in the database, maybe as BLOB columns
  • But since being able to compare prompts responses is so useful, maybe I should store images from disk in BLOB too? The cost in terms of SQLite space taken up may be worth it.

— Reply to this email directly, view it on GitHub https://github.com/simonw/llm/issues/331#issuecomment-2035960595, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE476NAPPZBB4BQDLUGVRU3Y3SV2HAVCNFSM6AAAAAA7AKZ476VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZVHE3DANJZGU . You are receiving this because you commented.Message ID: @.***>

NightMachinery commented 7 months ago

@simonw Just add an option --image-log-mode which can be set to db-blob. By default, don't store them, it will take disk space for probably junk files.

simonw commented 7 months ago

Another open question: how should this work in chat?

I'm inclined to add !image path-to-image.jpg as a thing you can use in chat to reference an image.

But then should it be submitted the moment you hit enter, or should you get the opportunity to add a prompt afterwards? I think adding a prompt afterwards makes sense.

Also should !image be allowed inside !multi? I'm not sure. If it IS, then how would you send that raw text to a model e.g. as part of a longer code sample you are pasting in?

simonw commented 7 months ago

@simonw Just add an option --image-log-mode which can be set to db-blob. By default, don't store them, it will take disk space for probably junk files.

Yeah I'm beginning to think I may need to had a whole settings/preferences mechanism to help solve this. llm settings set image_log_mode blob kind of thing.

NightMachinery commented 7 months ago

@simonw

I'm inclined to add !image path-to-image.jpg as a thing you can use in chat to reference an image.

Perhaps you can use a TUI hotkey? E.g., Ctrl-i for inserting images. Though this will quickly spiral out of control ... E.g., should the TUI present a dialogue for selecting files?

The ideal case is to be able to just paste, and detect images from the clipboard. But this seems impossible to do using native paste. Perhaps you can add a custom hotkey for pasting that checks the clipboard.

I have some functions for macOS that paste images, e.g.,

class='«class PNGf»'
osascript -e "tell application \"System Events\" to ¬
                  write (the clipboard as ${class}) to ¬
                          (make new file at folder \"${dir}\" with properties ¬
                                  {name:\"${name}\"})"
simonw commented 7 months ago

For pasting I think I'll hold off until I have a web UI working - much easier to handle paste there (e.g. https://tools.simonwillison.net/ocr does that) than figure it out for the terminal.

It would be good to get this working though:

pbpaste | llm -m claude-3-opus 'describe this image' -i -

Oh, that's frustrating: it looks like pbpaste only works for text content, I tried pbpaste > /tmp/image.png and got a 0 byte file.

ChatGPT did come up with this recipe which seems to work:

osascript -e 'set theImage to the clipboard as «class PNGf»' \
  -e 'set theFile to open for access POSIX file "/tmp/clipboard.png" with write permission' \
  -e 'write theImage to theFile' \
  -e 'close access theFile' \
  && cat /tmp/clipboard.png && rm /tmp/clipboard.png

I imagine there are cleaner implementations than that. Would be easy to wrap one into a little zsh script or similar.

simonw commented 7 months ago

I saved this in ~/.local/bin (on my path) as impaste and chmod 755 ~/.local/bin/impaste and it seems to work:

#!/bin/zsh

# Generate a unique temporary filename
tempfile=$(mktemp -t clipboard.XXXXXXXXXX.png)

# Save the clipboard image to the temporary file
osascript -e 'set theImage to the clipboard as «class PNGf»' \
  -e "set theFile to open for access POSIX file \"$tempfile\" with write permission" \
  -e 'write theImage to theFile' \
  -e 'close access theFile'

# Output the image data to stdout
cat "$tempfile"

# Delete the temporary file
rm "$tempfile"

Opus conversation here: https://gist.github.com/simonw/736bcc9bcfaef40a55deaa959fd57ca8

simonw commented 7 months ago

Turned that into a TIL: https://til.simonwillison.net/macos/impaste

paulsmith commented 7 months ago

@simonw I was inspired by your TIL to try a little Swift. Here's a executable that does roughly the same thing: https://github.com/paulsmith/pbimg

Also used Claude Opus to help get started.

simonw commented 7 months ago

OK, design decision regarding logging of images.

All models will support URL input. If the model can handle URLs directly those will be passed to the model - for models that can't retrieve URLs themselves LLM will fetch the content and pass it to the model.

If you provide a URL, then just that URL string will be logged to the database.

If you provide a path to a file on disk, the full resolved path will be stored.

If you pipe an image into the tool (with -i .) the image will be stored as a BLOB in an llm_images table.

You can also pass image file names and use a --save-images option to write them tot hat table too. This is mainly useful if you are building a research database of prompts and responses and want to pass that around.

NightMachinery commented 7 months ago

@simonw I guess you should add a command to clean the database from image blobs, and automatically purge blobs older than LLM_CLEAN_OLDER_THAN which should by default be 90 days.

simonw commented 7 months ago

The option for storing the images should be --store for consistency with the llm embed-multi command. Which already has the ability to store images in BLOB columns: https://github.com/simonw/llm/blob/12e027d3e48cf3615396e4190a02ee04392771fe/llm/embeddings.py#L145

cmungall commented 6 months ago

I'm not sure if the discussion about ways to pass in multiple images is still open, but what about just using markdown? E.g. if llm encounters an [img](...) and it is being invoked against a vision-capable model, it checks to see if the linked images (local or remote URL) is there, and if it is, it gets incorporated into the multimodal prompt.

This has the advantage that a lot of existing markdown files in the wild could just be passed in without modification, and the llm would "see" what the human sees.

tomviner commented 6 months ago

I'm not sure if the discussion about ways to pass in multiple images is still open, but what about just using markdown? E.g. if llm encounters an [img](...) and it is being invoked against a vision-capable model, it checks to see if the linked images (local or remote URL) is there, and if it is, it gets incorporated into the multimodal prompt.

This has the advantage that a lot of existing markdown files in the wild could just be passed in without modification, and the llm would "see" what the human sees.

@cmungall this is the simplest approach for sure. Could even support tags or raw URLs/file paths.

The downside is you're now making network calls based on the input text. You also need a way of turning the feature off, and also escaping whatever syntax is used.

simonw commented 5 months ago

I'm going to change this to -a/--attachment instead of -i/--image because models that accept things like video or audio are rapidly starting to emerge.

simonw commented 5 months ago

... or maybe not. I don't actually know how all of the model that handle images/audio/video are going to work. If they need me to pass them inputs as a specific type - a URL to a video that's marked as video, or to audio that's marked as audio, then bundling everything together as --attachment might not make sense.

So maybe I do -i/--image and -v/--video and -a/--audio instead?

codecrack3 commented 5 months ago

... or maybe not. I don't actually know how all of the model that handle images/audio/video are going to work. If they need me to pass them inputs as a specific type - a URL to a video that's marked as video, or to audio that's marked as audio, then bundling everything together as --attachment might not make sense.

So maybe I do -i/--image and -v/--video and -a/--audio instead?

yes, I think we need to use many variant extensions of the attachments instead. Only --attachment is very good but I think we need to implement a list filter to choose right processor for each extension

NightMachinery commented 5 months ago

I also like --attachment, though perhaps a better name is simply --file. We can use either the extension or libmagic to detect the file type. Perhaps flags such as --image can also be added to force a particular format. (I.e., --attachment would auto-detect, while --image always assumes an image input.)

thiswillbeyourgithub commented 4 months ago

No update on this? The lack of multimodality is really a major reason I'm not using llm as much anymore :/

cjcarroll012 commented 2 months ago

Really want multi-modal as well

simonw commented 2 weeks ago

I'm going to do this in here instead: