Add ONNX Runtime in-browser Web-GPU support for MLX

sugatoray commented 9 months ago

🔥 ONNX recently released in-browser WebGPU support. The demo shows support for NVIDIA RTX and Intel CPU. However, the blog also points out that WebGPU support is available for MacBooks.

👉 Source: ONNX Runtime Web unleashes generative AI in the browser using WebGPU

This has tremendous potential of multiplying the effects of having a good locally running in-browser LLM use case. This could also mean more demand for higher configuration laptops -- possibly offering an incentive for laptops manufacturers (including Apple).

I would suggest that we should explore this option and see if mlx can elevate such in-browser local LLM experience.

I am setting this up as a place to discuss this option. Would love to hear thoughts, concerns, ideas and advice.

Quoting from the blogpost:

ONNX Runtime Web is a JavaScript library to enable web developers to deploy machine learning models directly in web browsers, offering multiple backends leveraging hardware acceleration. For CPU inference, it compiles the native ONNX Runtime CPU engine into the WebAssembly (WASM) backend. By doing that, ONNX Runtime Web can effectively run common machine learning models and it has been widely adopted by various web applications such as Transformer.js.

To address the challenges posed by large and complex generative models in browsers, which demand greater computational and memory resources beyond the capabilities of CPU execution, ONNX Runtime Web now enables the WebGPU backend. Moreover, Microsoft and Intel are collaborating closely to bolster WebGPU backend further. This includes implementing WebGPU operators for broader model coverage, enabling IOBinding to save GPU-CPU data copies, adding FP16 support for improved performance, memory efficiency, and more.

sugatoray commented 9 months ago

cc: @awni

awni commented 9 months ago

I don't think it's possible to have MLX be a back-end for ONNX Runtime Web 🤔 . I think that requires javascript or some kind of code/API the browser can execute. But maybe I am missing something there..?

The easiest way to use MLX in the browser is through a local server (see e.g. https://github.com/qnguyen3/chat-with-mlx/). That is all running locally. You can also use native apps with e.g. MLX Swift. I expect more to be built over the coming weeks.

To use MLX built models with the ONNX web runtime, we would need a path to export to ONNX. That is definitely a possibility, it's not the top priority but something we'd like to get to when we can.

dc-dc-dc commented 8 months ago

IIRC Metal is a supported runtime for webgpu on chrome / firefox in their webgpu implementations (chrome uses dawn, firefox uses wgpu). But these are abstracted away through the webgpu api, so yes you are running on RTX / Metal but you won't have any real control over the gpu in the same way. They even have their own language for compute shaders (WGSL)

awni commented 8 months ago

I'm going to close this as somewhat out of scope of MLX. Would be great to continue to the discussion about ways to export out of MLX, but probably a discussion is a better place for that.

ml-explore / mlx

Add ONNX Runtime in-browser Web-GPU support for MLX #762