Open 137591 opened 4 months ago
Thanks for the question! The wasm
is composed of various parts, including the kernel of the model (in WGSL), and runtime support (C++ code compiled into WASM).
The kernel is implemented in MLC-LLM and compiled to WGSL: https://llm.mlc.ai/docs/deploy/webllm.html#bring-your-own-model-library
Runtime support from MLC-LLM: https://github.com/mlc-ai/mlc-llm/blob/main/web/emcc/mlc_wasm_runtime.cc
Runtime support from TVM (one of the three files): https://github.com/apache/tvm/blob/main/web/emcc/wasm_runtime.cc
The kernel, the runtime support (compiled into .bc
) are then linked together to form the final .wasm
file: https://github.com/apache/tvm/blob/main/python/tvm/contrib/emcc.py
When I use the web-llm instance (path: /web-llm/examples/simple-chat), and observe the source file (@mlc-ai/web-llm/lib/index.js), I notice that there is a lot of interaction with wasm files, which makes reading the source code somewhat difficult. I would be very grateful if you could inform me of the logical content of all the wasm files! Additionally, I have observed that there seems to be room for optimization in the implementation of model files (for example: "model_lib_url": modelLibURLPrefix + modelVersion + "/Llama-3-8B-Instruct-q4f32_1-ctx1k_cs1k-webgpu.wasm"). May I inquire if I should optimize through modifying the TVM compilation process?