Closed cmungall closed 2 weeks ago
Indeed this would be awesome. Does it require changes to llm
or can it be done in a plugin?
I suspect we'll be seeing more multimodal models so inclusion in core makes sense, but I defer to @simonw on this!
I've been thinking about this a lot.
The challenge here is that we need to be able to mix both text and images together in the same prompt - because you can call GPT-4 vision with this kind of thing:
Take a look at this image:
<image 1>
Now compare it to this:
<image 2>
My first instinct was to support syntax like this:
llm -m gpt-4-vision \
"Take a look at this image:" \
-i image1.jpeg \
"Now compare it to this:" \
-i https://example.com/image2.png
Note that the -i/--image
option here takes a filename or a URL, detecting files by seeing if they correspond to files on disk.
But... I don't think I can implement this, because Click really, really doesn't want to provide a mechanism for storing and retrieving the order of different arguments and parameters relative to each other:
I spent some time trying to get this to work with a custom Click command class and parse_args()
but determined that I'd effectively have to re-implement the whole Click argument parser from scratch to handle cases like --enable-logging
boolean flags and -p key value
multi-value parameters. This doesn't feel worthwhile to me.
So now I'm considering the following instead:
llm "look at this image" -i image.jpeg --tbc
llm -c "and compare it with" -i https://example.com/image.png
The trick here is that new --tbc
flag, which stands for "to be continued". It causes the prompt to be stored but NOT executed against he model yet - instead, any following llm -c
calls can be used to stack up more context in the prompt which will be executed the first time --tbc
is NOT used.
On a related note: llm chat
could also support this - maybe letting you do this kind of thing:
llm chat -m gpt-4-vision
look at this image
!image image.jpeg
For multi-lined chats you would use the existing !multi
command:
llm chat -m gpt-4-vision
!multi
look at this image
!image image.jpeg
and compare it with
!image https://example.com/image.png
!end
Crucially, I want to leave the door open for other LLM models provided by plugins - like maybe https://github.com/SkunkworksAI/BakLLaVA - to also support multi-modal inputs like this.
So I think the model class would have a supports_images = True
property it could set on to tell LLM that images are supported - otherwise using -i/--image
would return an error.
One note about the --tbc
thing is that we can get basic image support working without it - we could implement this and say that support for multiple images in the same prompt is coming later:
llm -m gpt-4-vision "Caption for this image" -i image.jpeg
This work is blocked on:
Would be amazing to get this working with a Bakllava local model - relevant example code using llama.cpp here https://github.com/cocktailpeanut/mirror/blob/main/app.py
Another claimed bakllava example (not tried it yet), this one using llama-cpp-python
: https://advanced-stack.com/resources/multi-modalities-inference-using-mistral-ai-llava-bakllava-and-llama-cpp.html
[Actually uses from llm_core.llm import LLaVACPPModel
; Trying to run the example code on my MacBook Pro M2 16GB and it just falls over...; other chat models of a similar size seem to work okay.)
@simonw how about f-strings/templating style?
llm "look at this image {src_image} and compare it to {compare_image}" \
--infile src_image=sample.jpeg --infile compare_image=known.jpeg
def _infiles_to_dict(
ctx: click.Context, attribute: click.Option, infiles: tuple[str, ...]) -> dict[str, str]:
return {k:v for k,v in (f.split("=") for f in infiles)}
@click.command()
@click.option(
"-i",
"--infile",
multiple=True,
callback=_infiles_to_dict,
help="Input files in the form key=filename. Multiple files can be included."
)
Misc thoughts:
--tbc
idea as well.--image
makes sense for now, but later might change to --infile
when they can take audio, video, random multi-modal documents? The model would have to specify what formats it accepts? Then the prompt might have to be `llm --infile {video.mp4:v} unless some auto-detection for file format is done.SGPT additionally facilitates the utilization of the GPT-4 Vision API. Include input images using the -i
or --input
flag, supporting both URLs and local images.
$ sgpt -m "gpt-4-vision-preview" -i "https://upload.wikimedia.org/wikipedia/en/c/cb/Marvin_%28HHGG%29.jpg" "what can you see on the picture?"
The image shows a figure resembling a robot with a humanoid form. It has a
$ sgpt -m "gpt-4-vision-preview" -i pkg/fs/testdata/marvin.jpg "what can you see on the picture?"
The image shows a figure resembling a robot with a sleek, metallic surface. It
It is also possible to combine URLs and local images:
$ sgpt -m "gpt-4-vision-preview" -i "https://upload.wikimedia.org/wikipedia/en/c/cb/Marvin_%28HHGG%29.jpg" -i pkg/fs/testdata/marvin.jpg "what is the difference between those two pictures"
The two images provided appear to be identical. Both show the same depiction of a
I built a prototype of this today, in the image-experimental
branch - just for OpenAI so far using docs on https://platform.openai.com/docs/guides/vision but I want to also ship support for Gemini and Claude (and eventually local models like LLaVA).
I gave it this image:
And ran this:
llm -m 4v 'describe this image' -i image.jpg -o max_tokens 200
And got back:
This image shows a young pig being held by a person. The pig has a light brown coat with some bristle-like hair and a prominent snout that is characteristic of pigs. It appears to be a juvenile, given its size. The pig's snout is a bit dirty, suggesting it may have been rooting around in the ground, which is common pig behavior. The person is out of frame with only their arm visible, dressed in a red garment with a seemingly soft texture. They are holding the pig securely against their body. The background indicates that this is an indoor setting with wooden structures, possibly inside a barn or a similar animal enclosure.
Lots still to do on this - I want it to support either URLs or file paths or -
as an input but those should then be made available to the model such that models like GPT-4 that support URL images can pass the URL in directly, while models like Claude 3 that only support base64 fetch that URL and then send it base64 encoded instead.
Maybe have a thing with Pillow as an optional dependency which can resize the images before sending them?
Have to decide what to do about logs. I think I need to log the images to the SQLite database (maybe in a new BLOB
table) because I need them in conversations so I can send follow-up prompts - but that could take a lot of space. So I need to add tooling that helps users clean up old images from their database if it gets too big.
I am going to pass around an image object that has a .url
property that may or may not return a URL string (otherwise None
) and a .bytes
and .base64
property that ALWAYS return binary data or that data base64 encoded.
That way plugins like OpenAI that can be sent URLs can use .url
first and fall back to .base64
if the URL is not available, and plugins like Claude 3 can use base64
every time.
I'm tempted to offer a .resized(max_width, max_height)
method which returns a Pillow resized image for models that know there is a maximum or recommended size limit and want to send a smaller request.
Idea: rather than store the images in the database, I'll store the path to the files on disk.
If you attempt to continue a conversation where the file paths no longer resolve to existing images, you'll get an error.
Would be nice if the API server gave you a reference for every uploaded image, that you could just refer back to
came here looking for non-text API endpoints... i was hoping to have a direct view into the audio and text-to-speech API endpoints, in particular.
so while it would be nice to have llm have a chat-like interface to interleave images, maybe an easier first step would be to have just a simple "prompt-to-image", "prompt-to-audio", "audio-to-text" kind of commands?
Quick survey on Twitter: https://twitter.com/simonw/status/1768445876274635155
Consensus is loosely to do image and then text, rather than text then image:
[{"type":"image_url","image_url":{"url":"..."}}, [{"type":"text","text":"Describe image"}]
Claude 3 Haiku is cheaper than GPT-3.5 Turbo and supports image inputs - a great incentive to finally get this feature shipped!
https://twitter.com/invisiblecomma/status/1768561708090417603
The Claude Vision docs recommend image first
https://docs.anthropic.com/claude/docs/vision#image-best-practices
Image placement: Just as with document-query placement, Claude works best when images come before text. Images placed after text or interpolated with text will still perform well, but if your use case allows it, we recommend image-then-text structure. See vision prompting tips for more details.
the maximum allowed image file size is 5MB per image
Should I enforce this for the Claude model? Easiest to let Claude API return an error at first.
I'm not yet sure if LLM should depend on Pillow and use it to resize large images before sending them.
Maybe a plugin hook to allow things like resizing and HEIC conversion would be useful?
Put your image first for single-image prompts: While Gemini can handle image and text inputs in any order, for prompts containing a single image, it might perform better if that image (or video) is placed before the text prompt. However, for prompts that require images to be highly interleaved with texts to make sense, use whatever order is most natural.
the maximum allowed image file size is 5MB per image
Should I enforce this for the Claude model? Easiest to let Claude API return an error at first.
I'm not yet sure if LLM should depend on Pillow and use it to resize large images before sending them.
Maybe a plugin hook to allow things like resizing and HEIC conversion would be useful?
IMO llm
should compress/resize images to avoid errors and make things easy to use. You can add an option --no-image-resize
which disables this behavior, and people who care will disable it. The average user (myself included) just want the image to go the model, and the error is unhelpful.
BTW, OpenAI supports both low
and high
detail levels for processing images. Does Anthropic have sth similar? Is this exposed in llm
?
I made a simple cli for vision, if anyone needs it before llm-vision is ready. Only supports GPT4 for now. :( https://github.com/irthomasthomas/llm-vision
It supports specifying an output format that prompts the model to generate markdown, or json in addition to plain text. One thing odd about gpt-4-vision is that it doesn't know you have given it an image, and sometimes doesn't believe it has vision capabilities unless you give it a phrase like 'describe the image'. But, if you want to extract an image to json, then a text description isn't very useful. So, I prompt it with 'describe the image in your head, then write the json document'.
There's also a work-in-progress gpt4-vision-screen-compare.py - this takes a screenshot every few seconds and compares the similarity with the last screenshot and if different enough it sends it to the model asking it explain the changes between them.
And here's a demo of what you can do with it: https://twitter.com/xundecidability/status/1763219017160867840 Problem: I Want to import blocked domains list from kagi to Bing Custom Search.
Solution: A little bash script that: Screenshots kagi blocked domains list Gpt4-vision streams a text list of domains xdotool types the domains into bing webpage as they stream in.
Current status:
-i
supportllm-gemini
and llm-claude-3
The main sticking point is what to do with the SQLite logging mechanism
It's important that llm -c "..."
works for sending follow-up prompts. This means it needs to be able to send the image again.
Some ways that could work:
BLOB
columnsBLOB
too? The cost in terms of SQLite space taken up may be worth it.Very nice! I'm not sure I'd want to include the image in every turn, though. I send a lot of full screenshots and my poor connection doesn't help. What I do currently is generate the description with a python script and pipe that to llm to chat about it. If it's important I might include the file path in the prompt. Then the llm can act on the file, and I can I search for the file in the logs dB.
Cheers, Thomas
On Thu, 4 Apr 2024, 02:42 Simon Willison, @.***> wrote:
Current status:
- Branch has -i support
- I have GPT-4 Vision support, plus branches of llm-gemini and llm-claude-3
The main sticking point is what to do with the SQLite logging mechanism
It's important that llm -c "..." works for sending follow-up prompts. This means it needs to be able to send the image again.
Some ways that could work:
- For images on disk, store the path to that image on disk. Use that again in follow-up prompts, and throw a hard error if the file is no longer visible.
- Some models support URLs. For public URLs to images I can store those URLs, and let the APIs themselves error if the URLs are 404ing
- Images fed in to standard input could be stored in the database, maybe as BLOB columns
- But since being able to compare prompts responses is so useful, maybe I should store images from disk in BLOB too? The cost in terms of SQLite space taken up may be worth it.
— Reply to this email directly, view it on GitHub https://github.com/simonw/llm/issues/331#issuecomment-2035960595, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE476NAPPZBB4BQDLUGVRU3Y3SV2HAVCNFSM6AAAAAA7AKZ476VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZVHE3DANJZGU . You are receiving this because you commented.Message ID: @.***>
@simonw Just add an option --image-log-mode
which can be set to db-blob
. By default, don't store them, it will take disk space for probably junk files.
Another open question: how should this work in chat?
I'm inclined to add !image path-to-image.jpg
as a thing you can use in chat to reference an image.
But then should it be submitted the moment you hit enter, or should you get the opportunity to add a prompt afterwards? I think adding a prompt afterwards makes sense.
Also should !image
be allowed inside !multi
? I'm not sure. If it IS, then how would you send that raw text to a model e.g. as part of a longer code sample you are pasting in?
@simonw Just add an option
--image-log-mode
which can be set todb-blob
. By default, don't store them, it will take disk space for probably junk files.
Yeah I'm beginning to think I may need to had a whole settings/preferences mechanism to help solve this. llm settings set image_log_mode blob
kind of thing.
@simonw
I'm inclined to add !image path-to-image.jpg as a thing you can use in chat to reference an image.
Perhaps you can use a TUI hotkey? E.g., Ctrl-i for inserting images. Though this will quickly spiral out of control ... E.g., should the TUI present a dialogue for selecting files?
The ideal case is to be able to just paste, and detect images from the clipboard. But this seems impossible to do using native paste. Perhaps you can add a custom hotkey for pasting that checks the clipboard.
I have some functions for macOS that paste images, e.g.,
class='«class PNGf»'
osascript -e "tell application \"System Events\" to ¬
write (the clipboard as ${class}) to ¬
(make new file at folder \"${dir}\" with properties ¬
{name:\"${name}\"})"
For pasting I think I'll hold off until I have a web UI working - much easier to handle paste there (e.g. https://tools.simonwillison.net/ocr does that) than figure it out for the terminal.
It would be good to get this working though:
pbpaste | llm -m claude-3-opus 'describe this image' -i -
Oh, that's frustrating: it looks like pbpaste
only works for text content, I tried pbpaste > /tmp/image.png
and got a 0 byte file.
ChatGPT did come up with this recipe which seems to work:
osascript -e 'set theImage to the clipboard as «class PNGf»' \
-e 'set theFile to open for access POSIX file "/tmp/clipboard.png" with write permission' \
-e 'write theImage to theFile' \
-e 'close access theFile' \
&& cat /tmp/clipboard.png && rm /tmp/clipboard.png
I imagine there are cleaner implementations than that. Would be easy to wrap one into a little zsh
script or similar.
I saved this in ~/.local/bin
(on my path) as impaste
and chmod 755 ~/.local/bin/impaste
and it seems to work:
#!/bin/zsh
# Generate a unique temporary filename
tempfile=$(mktemp -t clipboard.XXXXXXXXXX.png)
# Save the clipboard image to the temporary file
osascript -e 'set theImage to the clipboard as «class PNGf»' \
-e "set theFile to open for access POSIX file \"$tempfile\" with write permission" \
-e 'write theImage to theFile' \
-e 'close access theFile'
# Output the image data to stdout
cat "$tempfile"
# Delete the temporary file
rm "$tempfile"
Opus conversation here: https://gist.github.com/simonw/736bcc9bcfaef40a55deaa959fd57ca8
Turned that into a TIL: https://til.simonwillison.net/macos/impaste
@simonw I was inspired by your TIL to try a little Swift. Here's a executable that does roughly the same thing: https://github.com/paulsmith/pbimg
Also used Claude Opus to help get started.
OK, design decision regarding logging of images.
All models will support URL input. If the model can handle URLs directly those will be passed to the model - for models that can't retrieve URLs themselves LLM will fetch the content and pass it to the model.
If you provide a URL, then just that URL string will be logged to the database.
If you provide a path to a file on disk, the full resolved path will be stored.
If you pipe an image into the tool (with -i .
) the image will be stored as a BLOB in an llm_images
table.
You can also pass image file names and use a --save-images option to write them tot hat table too. This is mainly useful if you are building a research database of prompts and responses and want to pass that around.
@simonw I guess you should add a command to clean the database from image blobs, and automatically purge blobs older than LLM_CLEAN_OLDER_THAN
which should by default be 90 days.
The option for storing the images should be --store
for consistency with the llm embed-multi
command. Which already has the ability to store images in BLOB columns: https://github.com/simonw/llm/blob/12e027d3e48cf3615396e4190a02ee04392771fe/llm/embeddings.py#L145
I'm not sure if the discussion about ways to pass in multiple images is still open, but what about just using markdown? E.g. if llm
encounters an [img](...)
and it is being invoked against a vision-capable model, it checks to see if the linked images (local or remote URL) is there, and if it is, it gets incorporated into the multimodal prompt.
This has the advantage that a lot of existing markdown files in the wild could just be passed in without modification, and the llm would "see" what the human sees.
I'm not sure if the discussion about ways to pass in multiple images is still open, but what about just using markdown? E.g. if
llm
encounters an[img](...)
and it is being invoked against a vision-capable model, it checks to see if the linked images (local or remote URL) is there, and if it is, it gets incorporated into the multimodal prompt.This has the advantage that a lot of existing markdown files in the wild could just be passed in without modification, and the llm would "see" what the human sees.
@cmungall this is the simplest approach for sure. Could even support tags or raw URLs/file paths.
The downside is you're now making network calls based on the input text. You also need a way of turning the feature off, and also escaping whatever syntax is used.
I'm going to change this to -a/--attachment
instead of -i/--image
because models that accept things like video or audio are rapidly starting to emerge.
... or maybe not. I don't actually know how all of the model that handle images/audio/video are going to work. If they need me to pass them inputs as a specific type - a URL to a video that's marked as video, or to audio that's marked as audio, then bundling everything together as --attachment
might not make sense.
So maybe I do -i/--image
and -v/--video
and -a/--audio
instead?
... or maybe not. I don't actually know how all of the model that handle images/audio/video are going to work. If they need me to pass them inputs as a specific type - a URL to a video that's marked as video, or to audio that's marked as audio, then bundling everything together as
--attachment
might not make sense.So maybe I do
-i/--image
and-v/--video
and-a/--audio
instead?
yes, I think we need to use many variant extensions of the attachments instead. Only --attachment is very good but I think we need to implement a list filter to choose right processor for each extension
I also like --attachment
, though perhaps a better name is simply --file
. We can use either the extension or libmagic
to detect the file type. Perhaps flags such as --image
can also be added to force a particular format. (I.e., --attachment
would auto-detect, while --image
always assumes an image input.)
No update on this? The lack of multimodality is really a major reason I'm not using llm as much anymore :/
Really want multi-modal as well
I'm going to do this in here instead:
https://platform.openai.com/docs/guides/vision
I think this is best handled by command line options
--image
and--image-urls
to either encode and pass as base64, or to pass a URL.