janhq / cortex.cpp

Run and customize Local LLMs.
https://cortex.so
Apache License 2.0
1.94k stars 108 forks source link

epic: llama.cpp params are settable via API call or `model.yaml` #1151

Open dan-homebrew opened 3 weeks ago

dan-homebrew commented 3 weeks ago

Goal

Tasklist

I am using this epic to aggregate all llama.cpp params issues, including llama3.1 function calling + tool use

model.yaml

Out-of-scope:

Related

nguyenhoangthuan99 commented 3 weeks ago

Generally, I will break down this epic into tasks:

model.yaml From my side all information for running model should be in 1 file, because 1 model file can only run with 1 engine and user can do modify/tunning every parameters needed to run a model in the same place, it's more convenient.

Function calling According to this comment, we can see that function calling is just a more complicated chat template that ask model to find the proper function and params to answer the input question. And there is no standard way to do it, each model has different training process so the prompt for each model is also different.

Function calling is essentially an advanced form of prompt engineering. It involves crafting a specialized prompt that instructs the model to identify appropriate functions and their parameters based on the input question. However, there's no universal approach to implementing this feature, as each model has undergone unique training processes, necessitating model-specific prompting strategies. Developing a generic function calling feature presents significant challenges:

Given these challenges, it's crucial to approach the implementation of a generalized function calling feature with caution. The goal of supporting every model and every user-defined function is likely unattainable due to the inherent variability and complexity involved. Instead, it may be more practical to focus on optimizing the feature for specific, well-defined use cases or a limited set of models.

I also check chat-gpt, mistral, groq, ... they also support function calling but the different is they do this feature for their own models. llama.cpp also haven't support function calling yet since it is not really useful for normal user, and developer can do it themselves with better result.

Image

louis-jan commented 3 weeks ago

model.yaml

Lessons learned from Jan

Sync parameters between Jan and engines

That would be great if we can apply something like protoc. Let's say there is a .proto (just for example) file that define the entities. It can be used for projects from JS to C++ and automatically generate entities. So we can just maintain one entity file that defines the model.yaml DTO, which can be used across projects (there are many engines to maintain as well).

Template parsing should be done from cortex.cpp?

We currently have to parse the model template in order to convert the Jinja template into ai_prompt, user_prompt, and system_prompt, so that engines can load it accordingly. Load model request should be simplified.

tikikun commented 3 weeks ago

Research input:

So by using a seperate model.yaml we just make another wrapper for the config of the model that is already there inside either gguf or huggingface config file. In practice, it has proven to be extremely inconvenient to use.

The config of the model should bind to the entity of the user, the model is already contained within itself.

dan-homebrew commented 3 weeks ago

Research input:

  • Market has tendency to consolidate on GGUF or Huggingface config file -> model already has its own config
  • What we want in the description is not related to the model, but how the "user store the config for that model"

So by using a seperate model.yaml we just make another wrapper for the config of the model that is already there inside either gguf or huggingface config file. In practice, it has proven to be extremely inconvenient to use.

The config of the model should bind to the entity of the user, the model is already contained within itself.

I agree - given that GGUF already has built-in configs, we should optional model.yaml (i.e. just overrides existing GGUF params).

However:

dan-homebrew commented 3 weeks ago
  • [ ] Response body: add an option to return log probs -> need to modify cortex.llamacpp source, this task will need more effort because it related to inference implementation, need to carefully if not it will break inference or performance degrade.

@nguyenhoangthuan99 If log probs requires an upstream PR to llama.cpp, let's move it to out-of-scope for this epic.

My focus for now is to catch up to llama.cpp and ensure a stable product - we can explore upstream improvements later on.

dan-homebrew commented 3 weeks ago

Function calling According to this comment, we can see that function calling is just a more complicated chat template that ask model to find the proper function and params to answer the input question. And there is no standard way to do it, each model has different training process so the prompt for each model is also different.

Function calling is essentially an advanced form of prompt engineering. It involves crafting a specialized prompt that instructs the model to identify appropriate functions and their parameters based on the input question. However, there's no universal approach to implementing this feature, as each model has undergone unique training processes, necessitating model-specific prompting strategies. Developing a generic function calling feature presents significant challenges:

  • Model variability: Different models require distinct prompting techniques, making it difficult to create a one-size-fits-all solution. Extensive experimentation: Even for a single model (e.g., llama3.1 - 8B), substantial testing is required to optimize performance across various scenarios.
  • User-defined functions: Since users will define custom functions, it's challenging to ensure that a preset system prompt will work effectively for all possible function definitions.
  • Quality assurance: Maintaining consistent output quality across diverse models and user-defined functions is extremely difficult. Unpredictable responses: The complexity of the task increases the likelihood of unexpected or incorrect outputs.

Given these challenges, it's crucial to approach the implementation of a generalized function calling feature with caution. The goal of supporting every model and every user-defined function is likely unattainable due to the inherent variability and complexity involved. Instead, it may be more practical to focus on optimizing the feature for specific, well-defined use cases or a limited set of models.

I also check chat-gpt, mistral, groq, ... they also support function calling but the different is they do this feature for their own models. llama.cpp also haven't support function calling yet since it is not really useful for normal user, and developer can do it themselves with better result.

@nguyenhoangthuan99 @louis-jan I agree. Let's scope this to supporting per-model function calling:

We can do this for llama3.1 first, and use it as a test case to develop a framework that can be generalized to other models in the future.

Given the high number of llama3.1 finetunes, this may mean prioritize the cortex presets story, which ultimately is a model.yaml story as well.

nguyenhoangthuan99 commented 3 weeks ago

Defining the default model.yaml first? If GGUF model binary is missing the header metadata, what is default? If GGUF binarry missing header metadata this file is invalid, llama.cpp cannot load it. The first 4 bytes of GGUF file is 1 magic number, when parse GGUF, we will read this magic number first and if it's not match the GGUF file is invalid. llamacpp and other tool like hugging face also do this to read data from GGUF file.

The model binary fail -> we won't create any model.yaml file because we cannot use it.

How do we intend to do versioning? Example: we added the wrong template to a new model and need to fix. How will cortex know the current model is outdated and needs an update?

Currently, we only download model base on repo name/branch in hugging face, the version in the model.yaml is parse from gguf file, this part may related to @namchuai .

nguyenhoangthuan99 commented 3 weeks ago

This PR can resolve:

Since function calling is separated as different issue #1181 , I'll move function calling out of this epic.

dan-homebrew commented 2 weeks ago

@nguyenhoangthuan99 Quick check: there's a Jan issue asking for Beam search. Do we support it?

If it's not in llama.cpp main branch, we don't need to support it. I just want to keep up with stable for now

nguyenhoangthuan99 commented 2 weeks ago

In llamacpp, beam search is added seen it is very important sample technique, in llamacpp it is top_k sampler. Each step will use top_k=40 equal to num_beam of beam search to search the result. Llamacpp also combine many sampler method. Default it combine 5-6 sampler method Image I also added top_k option to the params for cortex.llamacpp

dan-homebrew commented 2 weeks ago

In llamacpp, beam search is added seen it is very important sample technique, in llamacpp it is top_k sampler. Each step will use top_k=40 equal to num_beam of beam search to search the result. Llamacpp also combine many sampler method. Default it combine 5-6 sampler method Image I also added top_k option to the params for cortex.llamacpp

Fantastic - yup, I was hoping it was a nomenclature difference