The goal of this task is to implement APIs that are OpenAI API compatible. Existing APIs like generate() will still be kept. Essentially we want JSON-in and JSON-out, resulting in a UI like:
import * as webllm from "@mlc-ai/web-llm";
async function main() {
const chat = new webllm.ChatModule();
await chat.reload("Llama-2-7b-chat-hf-q4f32_1");
const completion = await chat.chat_completion({
messages: [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello!" }
],
// optional generative configs here
});
console.log(completion.choices[0]);
}
main();
If streaming:
const completion = await chat.chat_completion({
messages: [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello!" }
],
stream = true,
// optional generative configs here
});
for await (const chunk of completion) {
console.log(chunk.choices[0].delta.content);
}
Action items
[x] O1: Implement the basic chat_completion() (both streaming and non-streaming), support configs/features that we currently do not have inside llm_chat.ts
[ ] O3: Documentation and tests for the WebLLM repo
Existing gaps
There are some fields/features that are not yet supported in WebLLM compared to OpenAI's openai-node.
Fields in ChatCompletionRequest
model: in WebLLM, we need to call reload(model) instead of making it an argument in ChatCompletionRequest
response_format (json-formatting)
function calling related:
tool_choice
tools
Fields in ChatCompletion respond
system_fingerprint: not applicable in our case (OpenAI needs it because they perform request remotely on servers)
Others
We do not support n > 1 when streaming, since llm_chat.ts does not support maintaining multiple sequences. We have to finish one sequence and then start generating another, conflicting with the goal of streaming in chunks.
Future Items
Support chat completion with image inputs (e.g. LLaVA), with Gradio frontend
Add support for low-level APIs for post-forward logit processing
@CharlieFRuan Thanks for creating the tracking issue. Just wanted to let you know that @shreygupta2809 and I are currently working on supporting the function calling
Overview
The goal of this task is to implement APIs that are OpenAI API compatible. Existing APIs like
generate()
will still be kept. Essentially we want JSON-in and JSON-out, resulting in a UI like:If streaming:
Action items
chat_completion()
(both streaming and non-streaming), support configs/features that we currently do not have insidellm_chat.ts
tools
)Existing gaps
There are some fields/features that are not yet supported in WebLLM compared to OpenAI's
openai-node
.Fields in
ChatCompletionRequest
model
: in WebLLM, we need to callreload(model)
instead of making it an argument inChatCompletionRequest
response_format
(json-formatting)tool_choice
tools
Fields in
ChatCompletion
respondsystem_fingerprint
: not applicable in our case (OpenAI needs it because they perform request remotely on servers)Others
n > 1
when streaming, sincellm_chat.ts
does not support maintaining multiple sequences. We have to finish one sequence and then start generating another, conflicting with the goal of streaming in chunks.Future Items