epic: llama.cpp params are settable via API call or `model.yaml`

dan-homebrew commented 3 weeks ago

Goal

Cortex can handle all llama.cpp params correctly
Model running params (i.e. POST /v1/models/<model_id>/start)
Inference params (i.e. POST /chat/completions)
Function Calling, eg for llama.cpp

Tasklist

I am using this epic to aggregate all llama.cpp params issues, including llama3.1 function calling + tool use

[ ] https://github.com/janhq/cortex.cpp/issues/618d
[ ] https://github.com/janhq/cortex.cpp/issues/1059
[ ] https://github.com/janhq/cortex.cpp/issues/295
[ ] https://github.com/janhq/cortex.cpp/pull/354 (prev PR with function calling/tool use)
[ ] https://github.com/janhq/cortex.cpp/issues/618
[ ] min/max range for discrete int/float params, to prevent out-of-bounds errors (e.g. https://github.com/janhq/jan/pull/3609)

model.yaml

model.yaml as optional? (i.e. depend on GGUF params)
- See @tikikun's comment in https://github.com/janhq/cortex.cpp/issues/1151#issuecomment-2339478427
model.yaml should be well documented with approrpriate naming conventions
- Model loading params?
- Inference params?
- To exclude engine params?

Out-of-scope:

TensorRT-LLM
ONNX
[ ] https://github.com/janhq/cortex.cpp/issues/1163

[ ] fully support function calling: linking a comment from llamacpp's contributor.
[ ] Request body parameters: we need to add more option that supported in llama.cpp chat completion, load model to the request body. Need to Pr in cortex.llamacpp. Currently, most of popular params for load model and chat completion are supported in cortex.llamacpp.
[ ] Response body: add an option to return log probs -> need to modify cortex.llamacpp source, this task will need more effort because it related to inference implementation, need to carefully if not it will break inference or performance degrade.

Out-of-scope:
Speech api support
Image api support

model.yaml From my side all information for running model should be in 1 file, because 1 model file can only run with 1 engine and user can do modify/tunning every parameters needed to run a model in the same place, it's more convenient.

Function calling According to this comment, we can see that function calling is just a more complicated chat template that ask model to find the proper function and params to answer the input question. And there is no standard way to do it, each model has different training process so the prompt for each model is also different.

Function calling is essentially an advanced form of prompt engineering. It involves crafting a specialized prompt that instructs the model to identify appropriate functions and their parameters based on the input question. However, there's no universal approach to implementing this feature, as each model has undergone unique training processes, necessitating model-specific prompting strategies. Developing a generic function calling feature presents significant challenges:

Model variability: Different models require distinct prompting techniques, making it difficult to create a one-size-fits-all solution. Extensive experimentation: Even for a single model (e.g., llama3.1 - 8B), substantial testing is required to optimize performance across various scenarios.
User-defined functions: Since users will define custom functions, it's challenging to ensure that a preset system prompt will work effectively for all possible function definitions.
Quality assurance: Maintaining consistent output quality across diverse models and user-defined functions is extremely difficult. Unpredictable responses: The complexity of the task increases the likelihood of unexpected or incorrect outputs.

Given these challenges, it's crucial to approach the implementation of a generalized function calling feature with caution. The goal of supporting every model and every user-defined function is likely unattainable due to the inherent variability and complexity involved. Instead, it may be more practical to focus on optimizing the feature for specific, well-defined use cases or a limited set of models.

I also check chat-gpt, mistral, groq, ... they also support function calling but the different is they do this feature for their own models. llama.cpp also haven't support function calling yet since it is not really useful for normal user, and developer can do it themselves with better result.

louis-jan commented 3 weeks ago

model.yaml

Lessons learned from Jan

The model.json ID is quite easy to break -> it should be a computed field instead of stored.
The model.json structure is quite complicated, and users are unclear about where to place parameters, e.g. , between model loading and inference -> All of the parameters should be flattened and then handled at the application level (actually request middleman)
The model.json engine parameter is outdated, e.g., all of the models are defined as "nitro" but are actually routed to cortex.cpp, and we do not want to migrate them yet -> Detect model type and handle accordingly from engines.
The model.json should not define all available model parameters but the model / engine does. We want to keep model.json clean, short and simple. For example, ngl has been missing for a long time, so users could not adjust GPU offload for most of the previous versions of Jan.
Lacking model load parameters (cpu_threads, caching, flash attention ...)
The model.json is for templating, not for storing. I hadn't seen any use cases of an app that persists model.json until now.
Most of the inference parameters are unchanged over models, only max_tokens and stop words are maintained. But these all computed fields, E.g. default max_tokens can be ctx_length, and stop words can be pulled from GGUF file. Which is actually redundant because the engine should handle that automatically. GGUF model metadata is consolidated. All of these parameters' default values are actually defined by the engine already.
Mode load parameters are mostly computed fields until now, e.g. ctx_length, prompt_template, ngl, and llama_model_path; however, llama_model_path is duplicated, as we already know the location of the GGUF file, making redefinition from model.yaml unnecessary. Since GGUF model metadata is consolidated, adjusting those things is meaningless. As a user I would love to configure other params instead: e.g. cpu_threads, cache....
Decorative fields are killing the app. E.g. metatada, format, description... These fields should be optional, but without them, the app would break.. 😕
Grouping and sorting are hard when we don't know how to bring certain models to the top. Model tags are quite tricky. Model size should be a computed field as well. We are filling this in manually.
Sources: Currently, the app pre-populates model.json as a model template, and users can download it at any time without worrying about a blank page on Model Hub when there is no connectivity. And after pulling, app will depends on the file name field from source to determine BUT It's duplicated with llama_model_path, and also the engine can look for the model file in the model directory automatically, can't it? file_name also redundant it seem.

Sync parameters between Jan and engines

That would be great if we can apply something like protoc. Let's say there is a .proto (just for example) file that define the entities. It can be used for projects from JS to C++ and automatically generate entities. So we can just maintain one entity file that defines the model.yaml DTO, which can be used across projects (there are many engines to maintain as well).

Template parsing should be done from cortex.cpp?

We currently have to parse the model template in order to convert the Jinja template into ai_prompt, user_prompt, and system_prompt, so that engines can load it accordingly. Load model request should be simplified.

tikikun commented 3 weeks ago

Research input:

Market has tendency to consolidate on GGUF or Huggingface config file -> model already has its own config
What we want in the description is not related to the model, but how the "user store the config for that model"

So by using a seperate model.yaml we just make another wrapper for the config of the model that is already there inside either gguf or huggingface config file. In practice, it has proven to be extremely inconvenient to use.

The config of the model should bind to the entity of the user, the model is already contained within itself.

dan-homebrew commented 3 weeks ago

Research input:

Market has tendency to consolidate on GGUF or Huggingface config file -> model already has its own config

What we want in the description is not related to the model, but how the "user store the config for that model"

So by using a seperate model.yaml we just make another wrapper for the config of the model that is already there inside either gguf or huggingface config file. In practice, it has proven to be extremely inconvenient to use.

The config of the model should bind to the entity of the user, the model is already contained within itself.

I agree - given that GGUF already has built-in configs, we should optional model.yaml (i.e. just overrides existing GGUF params).

However:

model.yaml can still be very useful as a packaging tool (e.g. GGUF params are not editable by layman, or define Model Loading, Engine and inference params for the stack)
model.yaml can still be useful as a param definition tool, specially for other engines (e.g. TensorRT-LLM and ONNX/DirectML) which are less mature

dan-homebrew commented 3 weeks ago

[ ] Response body: add an option to return log probs -> need to modify cortex.llamacpp source, this task will need more effort because it related to inference implementation, need to carefully if not it will break inference or performance degrade.

@nguyenhoangthuan99 If log probs requires an upstream PR to llama.cpp, let's move it to out-of-scope for this epic.

My focus for now is to catch up to llama.cpp and ensure a stable product - we can explore upstream improvements later on.

dan-homebrew commented 3 weeks ago

Function calling According to this comment, we can see that function calling is just a more complicated chat template that ask model to find the proper function and params to answer the input question. And there is no standard way to do it, each model has different training process so the prompt for each model is also different.

Function calling is essentially an advanced form of prompt engineering. It involves crafting a specialized prompt that instructs the model to identify appropriate functions and their parameters based on the input question. However, there's no universal approach to implementing this feature, as each model has undergone unique training processes, necessitating model-specific prompting strategies. Developing a generic function calling feature presents significant challenges:

Model variability: Different models require distinct prompting techniques, making it difficult to create a one-size-fits-all solution. Extensive experimentation: Even for a single model (e.g., llama3.1 - 8B), substantial testing is required to optimize performance across various scenarios.

User-defined functions: Since users will define custom functions, it's challenging to ensure that a preset system prompt will work effectively for all possible function definitions.

Quality assurance: Maintaining consistent output quality across diverse models and user-defined functions is extremely difficult. Unpredictable responses: The complexity of the task increases the likelihood of unexpected or incorrect outputs.

Given these challenges, it's crucial to approach the implementation of a generalized function calling feature with caution. The goal of supporting every model and every user-defined function is likely unattainable due to the inherent variability and complexity involved. Instead, it may be more practical to focus on optimizing the feature for specific, well-defined use cases or a limited set of models.

I also check chat-gpt, mistral, groq, ... they also support function calling but the different is they do this feature for their own models. llama.cpp also haven't support function calling yet since it is not really useful for normal user, and developer can do it themselves with better result.

@nguyenhoangthuan99 @louis-jan I agree. Let's scope this to supporting per-model function calling:

We can focus on llama3.1 first
My naive understanding is that llama3.1 is the main model with function calling
i.e. ensure llama3.1 can support function calling (e.g. through combination of presets?, etc)
i.e. ensure llama3.1 finetunes can also support function calling at a model level

We can do this for llama3.1 first, and use it as a test case to develop a framework that can be generalized to other models in the future.

Given the high number of llama3.1 finetunes, this may mean prioritize the cortex presets story, which ultimately is a model.yaml story as well.

nguyenhoangthuan99 commented 3 weeks ago

Defining the default model.yaml first? If GGUF model binary is missing the header metadata, what is default? If GGUF binarry missing header metadata this file is invalid, llama.cpp cannot load it. The first 4 bytes of GGUF file is 1 magic number, when parse GGUF, we will read this magic number first and if it's not match the GGUF file is invalid. llamacpp and other tool like hugging face also do this to read data from GGUF file.

The model binary fail -> we won't create any model.yaml file because we cannot use it.

How do we intend to do versioning? Example: we added the wrong template to a new model and need to fix. How will cortex know the current model is outdated and needs an update?

Currently, we only download model base on repo name/branch in hugging face, the version in the model.yaml is parse from gguf file, this part may related to @namchuai .

nguyenhoangthuan99 commented 3 weeks ago

This PR can resolve:

[x] Add most llamacpp's params to request body
[x] Add log probs for response when using stream mode.

Since function calling is separated as different issue #1181 , I'll move function calling out of this epic.

dan-homebrew commented 2 weeks ago

@nguyenhoangthuan99 Quick check: there's a Jan issue asking for Beam search. Do we support it?

https://github.com/janhq/jan/issues/3112
I don't see it in the normal list of llama.cpp params

If it's not in llama.cpp main branch, we don't need to support it. I just want to keep up with stable for now

nguyenhoangthuan99 commented 2 weeks ago

In llamacpp, beam search is added seen it is very important sample technique, in llamacpp it is top_k sampler. Each step will use top_k=40 equal to num_beam of beam search to search the result. Llamacpp also combine many sampler method. Default it combine 5-6 sampler method I also added top_k option to the params for cortex.llamacpp

dan-homebrew commented 2 weeks ago

In llamacpp, beam search is added seen it is very important sample technique, in llamacpp it is top_k sampler. Each step will use top_k=40 equal to num_beam of beam search to search the result. Llamacpp also combine many sampler method. Default it combine 5-6 sampler method I also added top_k option to the params for cortex.llamacpp

Fantastic - yup, I was hoping it was a nomenclature difference

janhq / cortex.cpp