mlc-ai / web-llm-chat

Chat with AI large language models running natively in your browser. Enjoy private, server-free, seamless AI conversations.
https://chat.webllm.ai/
Apache License 2.0
311 stars 53 forks source link

[Feature Request]: Use custom models #23

Closed Neet-Nestor closed 4 months ago

Neet-Nestor commented 5 months ago

Problem Description

https://github.com/mlc-ai/web-llm/issues/421

Users want to be able to upload their own models from local machine.

Solution Description

WebLLM Engine is capable of loading any MLC format models.

https://github.com/mlc-ai/web-llm/tree/main/examples/simple-chat-upload is an example of supporting local model in the app.

We want to do something similar to allow uploading.

0wwafa commented 5 months ago

hmm no.. in the example there is a list of models... I wish to be able to upload a model from alocal directory without downloading it from the internet.

0wwafa commented 5 months ago

for example.. let's say I wish to have Mistral Instruct v0.3 quantized as: f16 (output and embed) and q6_k for the other tensors. How should I proceed?

Neet-Nestor commented 5 months ago

@0wwafa I understand the need here. Let me explain.

First, the prerequisite for custom models to run on WebLLM chat is that the models must be compiled to MLC format. For more details, checking the instructions of mlc llm here.

Once you got the the MLC-format models on your local, the proposal here is to allow one of the three following ways to use it on the webapp:

  1. You selectes the weight files and wasm files on your local machine, then the webapp loads the files and use it in inference;
  2. You uploads the weights files and wasm files to Hugging Face, then input the url to the webapp. The webapp will download the files from HuggingFace and use it for inference;
  3. You host your model on a local port using mlc-llm CLI, then webapp connects to the port to use the model for inference.

These are planned to be released in the next months. Does any of these fulfill what you need?

0wwafa commented 5 months ago

Welll I just wish to see how mistral works in the web browser using one of my quantizations, specifically: f16 / q6, f16 /q5 and q8/q6 and q8 q5.. https://huggingface.co/ZeroWw/Test

In other words I quantized the output and embed tensors to f16 (or q8) and the other tensors to q6 or q5. This keeps the "understanding" and "expressing" to an almost lossless quantization (f16) while it quantizes in a "good" way the other tensors. The results in my test confirm that the model in this way is less degraded and works almost as the original. I could not see any difference during interactive inference...

Neet-Nestor commented 4 months ago

The app has updated to support custom models through MLC-LLM REST APIs by switching model type in settings.

https://github.com/mlc-ai/web-llm-chat/commit/2fb025c3f999cf90c1b2cd38452f0e6fc5e49e63

Neet-Nestor commented 4 months ago

@0wwafa Could I know whether the update above fulfills your use case through hosting your models through mlc_llm serve command of MLC_LLM?

0wwafa commented 4 months ago

My models are available here. I still don't understand how to use them with mlc_llm