sourcegraph / cody

AI that knows your entire codebase
https://cody.dev
Apache License 2.0
2.22k stars 213 forks source link

Ollama: add image upload support to multimodal models #4694

Open abeatrix opened 4 days ago

abeatrix commented 4 days ago

NOTE: This is a proof of concept on supporting multimodal in Cody. The UI will need to be updated to make it more obvious to users when an image has been uploaded for the current chat.

Multimodal models from Ollama that we currently support: llava & bakllava

Demo

Loom: https://www.loom.com/share/4d2b72de1385436a960b0775434b933e

Follow-up works/ideas

Test plan

  1. Download Ollama
  2. Download the supported Ollama models: llava or bakllava
  3. Confirm your Ollama is running
  4. Build and start Cody from this branch
  5. Verify you can find the model you just downloaded in the model dropdown menu
  6. Choose the supported model
  7. Verify a paperclip icon is showing up when you selected the supported model
  8. Click on the paperclip icon to select an image (png, jpg, SVG files only)
  9. Ask Cody to explain the image you attached
  10. Verify the response is correct

Example

Left image was shared with Cody, right image was built with the code provided by Cody:

Screenshot 2024-06-25 at 9 55 10 AM Screenshot 2024-06-26 at 10 21 47 AM
PriNova commented 3 days ago

@abeatrix This is amazing even if the open-source vision models are benching a lower than GPT-4o or Sonnet 3.5. If, in the future, the SG endpoint will be compatible later with Sonnet 3.5 and other multi-modal models, this tops that all.