Feature request: add vision capabilities to understand images

rksm / org-ai

Emacs as your personal AI assistant. Use LLMs such as ChatGPT or LLaMA for text generation or DALL-E and Stable Diffusion for image generation. Also supports speech input / output.

GNU General Public License v3.0

660 stars 52 forks source link

Feature request: add vision capabilities to understand images #122

Open tillydray opened 1 month ago

tillydray commented 1 month ago

I'll work on this. https://platform.openai.com/docs/guides/vision

tillydray commented 1 month ago

see new plan https://github.com/rksm/org-ai/issues/122#issuecomment-2264488201

~So far my design plan is~ ~1. create a new file to hold new functionality: org-ai-vision.el~ ~1. create a new file to hold common image functionality to be used by both org-ai-vision.el and org-ai-openai-image.el: org-ai-image.el~ ~1. extract functions from org-ai-openai-image.el and put them into org-ai-image.el~ ~1. add new functionality to org-ai-vision.el~

~If anyone has feedback let me know, especially on file naming~

tillydray commented 1 month ago

I assumed there would be commonalities to extract but that was wrong. So my new design plan is

rename org-ai-openai-image.el to something more specific to image generation with dall-e, like org-ai-generate-iamge.el
create org-ai-vision.el and add new functionality there

feedback welcome

rksm commented 1 month ago

Hey that sounds good! Vision capabilities would be super awesome! How do you imagine referring to an image?

tillydray commented 1 month ago

How do you imagine referring to an image?

Either base64 encoded or a link to the hosted image.

Per the documentation

Images are made available to the model in two main ways: by passing a link to the image or by passing the base64 encoded image directly in the request

“Two main ways” sounds to me like there are other ways but I didn’t see any others 🤷