planning: Supporting vision model (Llava and Llama3.2)

namchuai commented 1 month ago

Problem Statement

To support Vision models on Cortex, we need the following:

[ ] 1. Download model .gguf and mmproj file
[ ] 2. v1/models/start takes in model_path (.gguf) and mmproj parameters
[ ] 3. /chat/completions to take in messages content image_url
[ ] 4. image_url has to be encoded in base64 (via Jan, or link to tool eg https://base64.guru/converter/encode/image)
[ ] 5. model support - (side note: Jan currently supports BakLlava 1, llava 7B, Llava 13B)

1. Downloading model .gguf and mmprog file:

For fully compatible with Jan, cortex should be able to pull mmproj file along with GGUF file.

Let's take the image below for example.

Scenario steps:

User want to download llava model and expect it to support vision. So, user input:
- Direct URL to the GGUF file (e.g. llava-v1.6-mistral-7b.Q3_K_M.gguf), or Url to repository (we will list options filter .gguf file) for user to select.
- Since mmproj is also ended with .gguf, we also listed that in the selection.
Cortex will only pull that selected GGUF file, ignoring that:
- mmproj.gguf alone won't work.
- only traditional gguf file (e.g. llava-v1.6-mistral-7b.Q3_K_M.gguf) will not have vision feature.

So, we need to come up with a way so that cortex knows when to download the mmproj file along with traditional gguf file.

cc @dan-homebrew , @louis-jan , @nguyenhoangthuan99, @vansangpfiev

Feature Idea

Couple of thoughts:

File name based. 1.1. For CLI: Ignore file name contains mmproj when presenting selection list. And download it along with selected traditional gguf file. 1.2. For API: Always scan the directory with same level as the URL provided. If there's a mmproj file name, cortex adds it to the download list.
- Edge case: If user provide a direct URL to mmproj file, return error with clear error message.
Thinking / You tell me

gabrielle-ong commented 2 weeks ago

Updates:

CLI cortex pull presents .gguf and mmproj files
mmproj param is added to /v1/models/start parameters in #1537

dan-homebrew commented 2 weeks ago

We should ensure that model.yaml supports this type of abstraction, cc @hahuyhoang411

gabrielle-ong commented 2 weeks ago

@vansangpfiev and @hahuyhoang411 - can I get your thoughts to add to this list from my naive understanding?

To support Vision models on Cortex, we need the following:

Download model - downloads .gguf and mmproj file -> What is the model pull UX?
v1/models/start takes in model_path (.gguf) and mmproj parameters ✅
/chat/completions to take in messages content image_url ✅
image_url has to be encoded in base64 (via Jan, or link to tool eg https://base64.guru/converter/encode/image)
model support - (side note: Jan currently supports BakLlava 1, llava 7B, Llava 13B) ..

vansangpfiev commented 2 weeks ago

@vansangpfiev and @hahuyhoang411 - can I get your thoughts to add to this list from my naive understanding?

To support Vision models on Cortex, we need the following:

Download model - downloads .gguf and mmproj file -> What is the model pull UX?

v1/models/start takes in model_path (.gguf) and mmproj parameters ✅

/chat/completions to take in messages content image_url ✅

image_url has to be encoded in base64 (via Jan, or link to tool eg https://base64.guru/converter/encode/image)

model support - (side note: Jan currently supports BakLlava 1, llava 7B, Llava 13B) ..

I'm not sure about this yet, since 1 folder can have multiple chat model files with 1 mmproj file.

Yes

I'm not sure if this is a good UX

image_url can be a local path to image, llama-cpp engine support encoding image to base64 and pass it to model.

llama-cpp engine supports BakLlava 1, llava 7B, llava 13B. llama.cpp upstream has already supported MiniCPM-V 2.6, we can integrate it to llama-cpp. llama.cpp upstream does not support llama3.2 vision yet.

We probably need to consider changing the UX for inferencing with vision model, for example:

cortex run llava-7b --image xx.jpg -p "What is in the image?"

gabrielle-ong commented 2 weeks ago

Thank you @vansangpfiev and @hahuyhoang411! Quick notes from call:

upstream llama.cpp -> cortex.llama-cpp needs to expose vision parameters to cortex.cpp
Ease of models support: LLava, then MiniCPM.
Llama3.2 vision

janhq / cortex.cpp

planning: Supporting vision model (Llava and Llama3.2) #1493

Problem Statement

[ ] 5. model support - (side note: Jan currently supports BakLlava 1, llava 7B, Llava 13B)

1. Downloading model .gguf and mmprog file:

Feature Idea