planning: `/chat/completions` Documentation and Roadmap

dan-homebrew commented 1 day ago

Goal

Our /chat/completions should have parameters similar to OpenAI
- Check if our implemetation is aligned
- Create planning: roadmap issues if it not supported yet
- Goal: communicate that we are committed to OpenAI compatibility

Tasklist

[ ] Itemize Roadmap issues (once Dan approves, we will create Roadmap issues)
[ ] Update Swaggerfile

dan-homebrew commented 1 day ago

@nguyenhoangthuan99 - Please transfer https://github.com/janhq/internal/issues/160 to this issue (can be public)

nguyenhoangthuan99 commented 1 day ago

API reference: https://platform.openai.com/docs/api-reference/chat/create

Missing supported fields from /v1/chat/completions API:

store boolean or null Optional Defaults to false Whether or not to store the output of this chat completion request for use in our model distillation or evals products. To support this, we should come up with an architecture to save and store output of chat completion requests of user. (e.g. MinIO for storage and postgres for DB).
metadata object or null Optional Developer-defined tags and values used for filtering completions in the dashboard. This also require some logics to save result to DB then user can query later.
logit_bias map Optional Defaults to null. Modify the likelihood of specified tokens appearing in the completion. Accepts a JSON object that maps tokens (specified by their token ID in the tokenizer) to an associated bias value from -100 to 100. Mathematically, the bias is added to the logits generated by the model prior to sampling. The exact effect will vary per model, but values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token. -> need to confirm llamacpp support this or not, but this might be nice to have feature. Issue: https://github.com/janhq/cortex.llamacpp/issues/263
logprobs boolean or null Optional Defaults to false. Whether to return log probabilities of the output tokens or not. If true, returns the log probabilities of each output token returned in the content of message. This feature is partially supported, need up update cortex.llamacpp to return logprob when use stream/ non stream mode. Issue: https://github.com/janhq/cortex.llamacpp/issues/262

The result should look like this:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1702685778,
  "model": "gpt-4o-mini",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I assist you today?"
      },
      "logprobs": {
        "content": [
          {
            "token": "Hello",
            "logprob": -0.31725305,
            "bytes": [72, 101, 108, 108, 111],
            "top_logprobs": [
              {
                "token": "Hello",
                "logprob": -0.31725305,
                "bytes": [72, 101, 108, 108, 111]
              },
              {
                "token": "Hi",
                "logprob": -1.3190403,
                "bytes": [72, 105]
              }
            ]
          },
          {
            "token": "!",
            "logprob": -0.02380986,
            "bytes": [
              33
            ],
            "top_logprobs": [
              {
                "token": "!",
                "logprob": -0.02380986,
                "bytes": [33]
              },
              {
                "token": " there",
                "logprob": -3.787621,
                "bytes": [32, 116, 104, 101, 114, 101]
              }
            ]
          },
          {
            "token": " How",
            "logprob": -0.000054669687,
            "bytes": [32, 72, 111, 119],
            "top_logprobs": [
              {
                "token": " How",
                "logprob": -0.000054669687,
                "bytes": [32, 72, 111, 119]
              },
              {
                "token": "<|end|>",
                "logprob": -10.953937,
                "bytes": null
              }
            ]
          },
          {
            "token": " can",
            "logprob": -0.015801601,
            "bytes": [32, 99, 97, 110],
            "top_logprobs": [
              {
                "token": " can",
                "logprob": -0.015801601,
                "bytes": [32, 99, 97, 110]
              },
              {
                "token": " may",
                "logprob": -4.161023,
                "bytes": [32, 109, 97, 121]
              }
            ]
          },
          {
            "token": " I",
            "logprob": -3.7697225e-6,
            "bytes": [
              32,
              73
            ],
            "top_logprobs": [
              {
                "token": " I",
                "logprob": -3.7697225e-6,
                "bytes": [32, 73]
              },
              {
                "token": " assist",
                "logprob": -13.596657,
                "bytes": [32, 97, 115, 115, 105, 115, 116]
              }
            ]
          },
          {
            "token": " assist",
            "logprob": -0.04571125,
            "bytes": [32, 97, 115, 115, 105, 115, 116],
            "top_logprobs": [
              {
                "token": " assist",
                "logprob": -0.04571125,
                "bytes": [32, 97, 115, 115, 105, 115, 116]
              },
              {
                "token": " help",
                "logprob": -3.1089056,
                "bytes": [32, 104, 101, 108, 112]
              }
            ]
          },
          {
            "token": " you",
            "logprob": -5.4385737e-6,
            "bytes": [32, 121, 111, 117],
            "top_logprobs": [
              {
                "token": " you",
                "logprob": -5.4385737e-6,
                "bytes": [32, 121, 111, 117]
              },
              {
                "token": " today",
                "logprob": -12.807695,
                "bytes": [32, 116, 111, 100, 97, 121]
              }
            ]
          },
          {
            "token": " today",
            "logprob": -0.0040071653,
            "bytes": [32, 116, 111, 100, 97, 121],
            "top_logprobs": [
              {
                "token": " today",
                "logprob": -0.0040071653,
                "bytes": [32, 116, 111, 100, 97, 121]
              },
              {
                "token": "?",
                "logprob": -5.5247097,
                "bytes": [63]
              }
            ]
          },
          {
            "token": "?",
            "logprob": -0.0008108172,
            "bytes": [63],
            "top_logprobs": [
              {
                "token": "?",
                "logprob": -0.0008108172,
                "bytes": [63]
              },
              {
                "token": "?\n",
                "logprob": -7.184561,
                "bytes": [63, 10]
              }
            ]
          }
        ]
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 9,
    "completion_tokens": 9,
    "total_tokens": 18,
    "completion_tokens_details": {
      "reasoning_tokens": 0
    }
  },
  "system_fingerprint": null
}

n integer or null Optional Defaults to 1. How many chat completion choices to generate for each input message. Note that you will be charged based on the number of generated tokens across all of the choices. Keep n as 1 to minimize costs. -> need to check if llama.cpp support this option. Issue: https://github.com/janhq/cortex.llamacpp/issues/264
service_tier string or null Optional Defaults to auto. Specifies the latency tier to use for processing the request. This parameter is relevant for customers subscribed to the scale tier service: If set to 'auto', and the Project is Scale tier enabled, the system will utilize scale tier credits until they are exhausted. If set to 'auto', and the Project is not Scale tier enabled, the request will be processed using the default service tier with a lower uptime SLA and no latency guarentee. If set to 'default', the request will be processed using the default service tier with a lower uptime SLA and no latency guarentee. When not set, the default behavior is 'auto'. When this parameter is set, the response body will include the service_tier utilized.
stream_options object or null Optional Defaults to null Options for streaming response. Only set this when you set stream: true. -> need to update cortex.llamacpp to support this. Issue: https://github.com/janhq/cortex.llamacpp/issues/265
modalities and audio: reference: https://platform.openai.com/docs/api-reference/chat/create#chat-create-modalities. We need a roadmap to support multimodalities for audio.
user : reference: https://platform.openai.com/docs/api-reference/chat/create#chat-create-user.

nguyenhoangthuan99 commented 4 hours ago

The following fields cannot be supported directly with cortex.cpp and need road map for it in the enterprise version:

store boolean or null Optional Defaults to false Whether or not to store the output of this chat completion request for use in our model distillation or evals products. To support this, we should come up with an architecture to save and store output of chat completion requests of user. (e.g. MinIO for storage and postgres for DB).
metadata object or null Optional Developer-defined tags and values used for filtering completions in the dashboard. This also require some logics to save result to DB then user can query later.
service_tier string or null Optional Defaults to auto. Specifies the latency tier to use for processing the request. This parameter is relevant for customers subscribed to the scale tier service: If set to 'auto', and the Project is Scale tier enabled, the system will utilize scale tier credits until they are exhausted. If set to 'auto', and the Project is not Scale tier enabled, the request will be processed using the default service tier with a lower uptime SLA and no latency guarentee. If set to 'default', the request will be processed using the default service tier with a lower uptime SLA and no latency guarentee. When not set, the default behavior is 'auto'. When this parameter is set, the response body will include the service_tier utilized.
modalities and audio: reference: https://platform.openai.com/docs/api-reference/chat/create#chat-create-modalities. We need a roadmap to support multimodalities for audio.
user : reference: https://platform.openai.com/docs/api-reference/chat/create#chat-create-user.
[x] Linked documentation to this issue in this PR #1589

janhq / cortex.cpp

planning: `/chat/completions` Documentation and Roadmap #1582

Goal

Tasklist