janhq / jan

Jan is an open source alternative to ChatGPT that runs 100% offline on your computer. Multiple engine support (llama.cpp, TensorRT-LLM)
https://jan.ai/
GNU Affero General Public License v3.0
23.78k stars 1.38k forks source link

planning: Migrate Threads, Messages to Cortex, deprecates Conversation Extension #3904

Open dan-homebrew opened 4 weeks ago

dan-homebrew commented 4 weeks ago

Goal

Tasklist

louis-jan commented 4 weeks ago

According to this:

https://github.com/janhq/cortex.cpp/issues/1567#issuecomment-2444740659 ## Problems `/messages` is quite straightforward for now but Jan's `/threads` are a combination of `model preset, assistant parameters, assistant tools and threads`. Also `/assistants` is not well designed, it defaults to a hard-coded template. See a Jan `thread.json` example: ```json { "id": "jan_1729768043", "object": "thread", "title": "0.5.8 llama 3.2 1b", "assistants": [ { "assistant_id": "jan", "assistant_name": "Jan", "tools": [ { "type": "retrieval", "enabled": true, "settings": { "top_k": 2, "chunk_size": 1024, "chunk_overlap": 64, "retrieval_template": "Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\nCONTEXT: {CONTEXT}\n----------------\nQUESTION: {QUESTION}\n----------------\nHelpful Answer:" } } ], "model": { "id": "llama3.2-1b-instruct", "settings": { "engine": "llama-cpp", "ctx_len": 3072, "ngl": 100, "prompt_template": "<|start_header_id|>system<|end_header_id|>\n\n{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", "text_model": false }, "parameters": { "engine": "llama-cpp", "frequency_penalty": 0, "max_tokens": 3072, "presence_penalty": 0, "stop": [ "<|eot_id|>" ], "stream": true, "temperature": 0.699999988079071, "top_p": 0.949999988079071 }, "engine": "llama-cpp" }, "instructions": "" } ], "created": 1729768043312, "updated": 1730195853233, "metadata": { "lastMessage": "Hello!" } } ``` See OpenAI Assistant and Thread: ```json { "id": "asst_abc123", "object": "assistant", "created_at": 1698984975, "name": "Math Tutor", "description": null, "model": "gpt-4o", "instructions": "You are a personal math tutor. When asked a question, write and run Python code to answer the question.", "tools": [ { "type": "code_interpreter" } ], "metadata": {}, "top_p": 1.0, "temperature": 1.0, "response_format": "auto" } ``` ``` { "id": "thread_abc123", "object": "thread", "created_at": 1699012949, "metadata": {}, "tool_resources": {} } ``` ## So should we: 1. Introduce a new structure similar to an existing one and scoped by `/threads` and `/messages` 2. Follow a popular schema such as OpenAI that could scale to `/assistants` I think 2 is preferred since we could take advantage of existing test suites and client SDKs. Otherwise, we would eventually do another migration to scale to `/assistants` and double the workload, such as writing tests. ## Decouple `threads` & `/models` Currently, they are coupled and fairly similar to preset, which is not really well-defined. E.g. thread.json defines model settings, which created a side effect where switching between threads would also reload the model. It's an antipattern, and we should find a way to decouple it. 1. Inference parameters & tools go to `/assistants`. It's to scale `/assistants` better where users can have more than one assistant persona (instructions + parameters) instead of hard coding. 2. Model parameters go to `/models` where PUT takes effect (now it's used nowhere) 3. The thread is now fairly thin. Better to scale to `/run` as well, it is a likely a container that glue components together (assistant, run, file_stores)

There would be many conclusions that affect Jan's UX such as:

Threads are now coupled with model settings, which introduces a bad UX where users get their model restarted every time they switch to a new thread, even with the same model.

  1. Moving model configurations to per-model settings would be beneficial. Those settings have a global affect.
  2. Assistants are clearly defined. Where users can have more than one assistant persona (instructions + parameters).

As a new user to this space, it's quite hard to get thread's parameters and settings. The Assistant Personas (instructions and parameters) and Model Capability Settings (more about hardware explanations) would help onboard users better.

dan-homebrew commented 4 weeks ago

As a new user to this space, it's quite hard to get thread's parameters and settings. The Writing Assistant Persona (instructions and parameters) and Model Capability Settings (more about hardware explanations) would help onboard users better.

Can you elaborate a bit more about:

louis-jan commented 4 weeks ago

ah @dan-homebrew I just mean

  1. Thread's Inference Parameters such as temperature, frequency penalty, presence penalty are quite incomprehensible. Move those to Assistant would make building an assistant persona easier to get.
  2. Modifying Thread's settings parameters, such as context window and ngl, cause a bad UX. Move to per-model settings might help. From there we add more hardware detection information such as the recommended GPU layers load and context length based on their device specs -> Global effect per model, not per thread.
dan-homebrew commented 4 weeks ago

ah @dan-homebrew I just mean

  1. Thread's Inference Parameters such as temperature, frequency penalty, presence penalty are quite incomprehensible. Move those to Assistant would make building an assistant persona easier to get.
  2. Modifying Thread's settings parameters, such as context window and ngl, cause a bad UX. Move to per-model settings might help. From there we add more hardware detection information such as the recommended GPU layers load and context length based on their device specs -> Global effect per model, not per thread.

Got it. Can you proceed to make the recommendations for how we can break down the Assistants, Threads/Messages, and Models endpoints (and the related data structures).

I think it's better we bite the bullet and move to the correct data structures.