[Tracking] WebLLM: OpenAI-Compatible APIs in ChatModule

Overview

The goal of this task is to implement APIs that are OpenAI API compatible. Existing APIs like generate() will still be kept. Essentially we want JSON-in and JSON-out, resulting in a UI like:

import * as webllm from "@mlc-ai/web-llm";

async function main() {
  const chat = new webllm.ChatModule();
    await chat.reload("Llama-2-7b-chat-hf-q4f32_1");

  const completion = await chat.chat_completion({
    messages: [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Hello!" }
    ],
    // optional generative configs here
  });

  console.log(completion.choices[0]);
}

main();

If streaming:

  const completion = await chat.chat_completion({
    messages: [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Hello!" }
    ],
    stream = true,
    // optional generative configs here
  });

  for await (const chunk of completion) {
    console.log(chunk.choices[0].delta.content);
  }

Action items

[x] O1: Implement the basic chat_completion() (both streaming and non-streaming), support configs/features that we currently do not have inside llm_chat.ts
- https://github.com/mlc-ai/web-llm/pull/298
- https://github.com/mlc-ai/web-llm/pull/300
[ ] O2: Support function calling (tools)
[ ] O3: Documentation and tests for the WebLLM repo

Existing gaps

There are some fields/features that are not yet supported in WebLLM compared to OpenAI's openai-node.

Fields in `ChatCompletionRequest`

model: in WebLLM, we need to call reload(model) instead of making it an argument in ChatCompletionRequest
response_format (json-formatting)
function calling related:
- tool_choice
- tools

Fields in `ChatCompletion` respond

system_fingerprint: not applicable in our case (OpenAI needs it because they perform request remotely on servers)

Others

We do not support n > 1 when streaming, since llm_chat.ts does not support maintaining multiple sequences. We have to finish one sequence and then start generating another, conflicting with the goal of streaming in chunks.

Future Items

Support chat completion with image inputs (e.g. LLaVA), with Gradio frontend
Add support for low-level APIs for post-forward logit processing
- Supported here: https://github.com/mlc-ai/web-llm/pull/277
Support embedding models
More modalities such as Audio

mlc-ai / web-llm

[Tracking] WebLLM: OpenAI-Compatible APIs in ChatModule #276

Overview

Action items

Existing gaps

Fields in `ChatCompletionRequest`

Fields in `ChatCompletion` respond

Others

Future Items

mlc-ai / web-llm

[Tracking] WebLLM: OpenAI-Compatible APIs in ChatModule #276

Overview

Action items

Existing gaps

Fields in ChatCompletionRequest

Fields in ChatCompletion respond

Others

Future Items

Fields in `ChatCompletionRequest`

Fields in `ChatCompletion` respond