janhq / cortex.cpp

Local AI API Platform
https://cortex.so
Apache License 2.0
2.14k stars 127 forks source link

planning: Supporting vision model (Llava and Llama3.2) #1493

Open namchuai opened 1 month ago

namchuai commented 1 month ago

Problem Statement

To support Vision models on Cortex, we need the following:

Let's take the image below for example.

Screenshot 2024-10-16 at 08 35 06

Scenario steps:

  1. User want to download llava model and expect it to support vision. So, user input:
    • Direct URL to the GGUF file (e.g. llava-v1.6-mistral-7b.Q3_K_M.gguf), or Url to repository (we will list options filter .gguf file) for user to select.
    • Since mmproj is also ended with .gguf, we also listed that in the selection.
  2. Cortex will only pull that selected GGUF file, ignoring that:
    • mmproj.gguf alone won't work.
    • only traditional gguf file (e.g. llava-v1.6-mistral-7b.Q3_K_M.gguf) will not have vision feature.

So, we need to come up with a way so that cortex knows when to download the mmproj file along with traditional gguf file.

cc @dan-homebrew , @louis-jan , @nguyenhoangthuan99, @vansangpfiev

Feature Idea

Couple of thoughts:

  1. File name based. 1.1. For CLI: Ignore file name contains mmproj when presenting selection list. And download it along with selected traditional gguf file. 1.2. For API: Always scan the directory with same level as the URL provided. If there's a mmproj file name, cortex adds it to the download list.

    • Edge case: If user provide a direct URL to mmproj file, return error with clear error message.
  2. Thinking / You tell me

gabrielle-ong commented 2 weeks ago

Updates:

  1. CLI cortex pull presents .gguf and mmproj files image
  2. mmproj param is added to /v1/models/start parameters in #1537
dan-homebrew commented 2 weeks ago

We should ensure that model.yaml supports this type of abstraction, cc @hahuyhoang411

gabrielle-ong commented 2 weeks ago

@vansangpfiev and @hahuyhoang411 - can I get your thoughts to add to this list from my naive understanding?

To support Vision models on Cortex, we need the following:

  1. Download model - downloads .gguf and mmproj file -> What is the model pull UX?
  2. v1/models/start takes in model_path (.gguf) and mmproj parameters ✅
  3. /chat/completions to take in messages content image_url ✅
  4. image_url has to be encoded in base64 (via Jan, or link to tool eg https://base64.guru/converter/encode/image)
  5. model support - (side note: Jan currently supports BakLlava 1, llava 7B, Llava 13B) ..
vansangpfiev commented 2 weeks ago

@vansangpfiev and @hahuyhoang411 - can I get your thoughts to add to this list from my naive understanding?

To support Vision models on Cortex, we need the following:

  1. Download model - downloads .gguf and mmproj file -> What is the model pull UX?
  2. v1/models/start takes in model_path (.gguf) and mmproj parameters ✅
  3. /chat/completions to take in messages content image_url ✅
  4. image_url has to be encoded in base64 (via Jan, or link to tool eg https://base64.guru/converter/encode/image)
  5. model support - (side note: Jan currently supports BakLlava 1, llava 7B, Llava 13B) ..
  6. I'm not sure about this yet, since 1 folder can have multiple chat model files with 1 mmproj file.
  7. Yes
  8. I'm not sure if this is a good UX
  9. image_url can be a local path to image, llama-cpp engine support encoding image to base64 and pass it to model.
  10. llama-cpp engine supports BakLlava 1, llava 7B, llava 13B. llama.cpp upstream has already supported MiniCPM-V 2.6, we can integrate it to llama-cpp. llama.cpp upstream does not support llama3.2 vision yet.

We probably need to consider changing the UX for inferencing with vision model, for example:

cortex run llava-7b --image xx.jpg -p "What is in the image?"
gabrielle-ong commented 2 weeks ago

Thank you @vansangpfiev and @hahuyhoang411! Quick notes from call: