Emscripten support to compile to WebAssembly + WebGPU; huge potential for cost savings in operating data-centers and running large deployments, e.g. ChatGPT

kyr0 commented 7 months ago

Dear team,

Triton uses libLLVM to generate intermediate code. Could you please tell me if you think implementing a transformation in order to be compatible with emscripten, an LLVM to WebAssembly compiler, or Binaryen directly, would be feasible?

WebAssembly support for SMID instructions is already a reality
WebGPU is available (Chrome, Edge) or can be enabled (Firefox) Support matrix (70% desktop users overall)

Community effort is also already there:

wasm_webgpu - system headers for interfacing with WebGPU from WASM
webgpu-pytorch

If Triton was able to natively support WebAssembly as a target, we could have all kinds of machine learning applications directly run from within the browser, close to top-speed.

Let's not underestimate the potential of this please. Running ML applications in-browser might seem silly at first, but as model sizes continue to decrease, and algorithms continue to be optimized, while consumer hardware is becoming more performant and energy efficient with every new generation, we're close to a new boom of ML-driven web applications run locally.

This can lead to a distribution of computation with a potential for huge cost savings in data-centers: If you think about it, you can easily auto-detect the potential GPU power of a client-side GPU in-browser. You might be able to download and cache a smaller, optimized model, as part of a "Setup process" that might only take a few seconds with todays internet connection speeds. You then execute it in parallel in a Worker context.

Same goes for vector search applications. If we focus on user-centered local execution, let's just index the data of one user using vector embeddings. Triton-backed HNSW in-browser would allow for mid-tier vector search to be feasible in-browser, especially with smaller embedding models. You could easily implement "Search chat history" locally, with search for meaning, in ChatGPT, without added cost or infra needs. My open-source repo vectorstore demonstrates this possibility not even applying HNSW (yet), and without WebGPU backend (yet).

If done well, even browser vendors could consider integrating something like this in addition to their default page search feature. You might even be able to run something like Whisper in-browser in the future.

Thank you for taking the time and considering this idea.

If I missed something, please point it out, and if you think I can assist building this, please ping me.

Thanks and best, Aron

edit: Grammar and formatting

jsdevtom commented 7 months ago

Media pipe by Google already demonstrate alot of the use cases for this

ThomasRaoux commented 7 months ago

Interesting idea. Note that web assembly would be analogue to a CPU target, Triton currently doesn't support any CPU backend upstream (there is a functional path downstream going through linalg).

There is a recent community effort starting to look at a new CPU backend that would be developed downstream first (this is being discussed on slack), I don't know all the details of webasm but I assume supporting it through that path could be possible.

jlebar commented 7 months ago

Note that web assembly would be analogue to a CPU target

Not necessarily, there's WebGPU.

jlebar commented 7 months ago

But I also don't think this is necessary to run Triton programs in the browser using WebGPU. You can compile Triton to LLVM IR offline, then compile the LLVM IR to WebGPU offline, and then include the WebGPU code in your webpage.

ThomasRaoux commented 7 months ago

Note that web assembly would be analogue to a CPU target

Not necessarily, there's WebGPU.

wasm and webgpu are separate things as far I know. WebGPU has a shader language derived from SPIRV. I don't believe it can be generated from LLVM direclty.

jlebar commented 7 months ago

wasm and webgpu are separate things as far I know. WebGPU has a shader language derived from SPIRV. I don't believe it can be generated from LLVM direclty.

I think the proposal here is to compile Triton to wasm so it can compile code targeting WebGPU in the browser.

But my point is that we're conflating two things: (1) running the compiler in the browser, and (2) running the compiled code in the browser.

ThomasRaoux commented 7 months ago

wasm and webgpu are separate things as far I know. WebGPU has a shader language derived from SPIRV. I don't believe it can be generated from LLVM direclty.

I think the proposal here is to compile Triton to wasm so it can compile code targeting WebGPU in the browser.

Aah, I had understood it differently. Maybe that needs clarification on whether the compiler itself would run in the browser or only the kernels.

kyr0 commented 7 months ago

But I also don't think this is necessary to run Triton programs in the browser using WebGPU. You can compile Triton to LLVM IR offline, then compile the LLVM IR to WebGPU offline, and then include the WebGPU code in your webpage.

Exactly, that is exactly what I was thinking about. You can, however, also let WebGPU interface with WASM first, so that the deploy target is a WASM Module. Why should you? JS is always JIT'ed and memory managed, and this can be cool for UI/UX and typical tasks and optimizations, but it can also lead to significant performance drawbacks when it comes to calculation-heavy tasks that are always executing the same instructions which should be exactly the ones noted; and therefore heuristics-based optimization would often lead to worse results.

Then it comes to memory management. Memory I/O interfacing is a huge pitfall in JS. But when using WASM, you can have a pointer from JS context into WASM context; and you can read/write TypedArrays via DMA (direct memory access), I demonstrated DMA between WASM and JS memory scope in this repo here, years ago: https://github.com/kyr0/assemblyscript-js-wasm-interop-example/blob/master/index.js#L20

This alone can boost runtime performance by magnitudes. Some guy from Babylon.JS team did use this code to boost his game from 5FPS to 73FPS+. Unfortunately, I cannot find the demo link to that anymore. The use-case was that he did physics calculations on the CPU so the GPU was basically waiting for I/O all the time, being the bottle-neck on the frame rate. If you have JS instructing WebGPU to do stuff, I expect a similar scenario.

Btw, Tensorflow.js has a WebGPU backend that could be inspiring for research and experimentation on the features of WebGPU and how to best use them: https://github.com/tensorflow/tfjs/tree/master/tfjs-backend-webgpu/src

For example, softmax: https://github.com/tensorflow/tfjs/blob/master/tfjs-backend-webgpu/src/softmax_webgpu.ts#L41 It is used as a Program: https://github.com/tensorflow/tfjs/blob/master/tfjs-backend-webgpu/src/kernels/Softmax.ts#L42

And of course, here we go with the interfacing code, which, for this implementation is done in JS scope. I believe at least the hot path for memory I/O could be more performant, when done using WASM: https://github.com/tensorflow/tfjs/blob/master/tfjs-backend-webgpu/src/backend_webgpu.ts

I don't want to bring too much topics into the game, but compiling to WASM and WebGPU might be a bit too much to ask for here. The interfacing code itself would probably remain the same anyway for most of the time. So... AssemblyScript is a language that compiles to WASM via Binaryen (LLVM alternative); it is very close to TypeScript syntax. A PoC could be done by forking the TF.js WebGPU backend management code, refactoring it into partly using AssemblyScript, which would spit out a WASM module for the I/O part. The WASM is loaded in a Worker for parallel execution. Both, this project and Tensorflow.js could benefit from the learnings.

DMA could be used between JS/WASM for fast I/O and WASM could load a precompiled Triton Softmax WebGPU shader compiled down from the LLVM IR or in any other way. I have no idea how hard it is to hack a quick PoC for that part of the experiment. But if I'm not mistaken, this would be quite a nice way to demo a MVP.

eatcosmos commented 6 months ago

https://github.com/iree-org/iree

kyr0 commented 6 months ago

@eatcosmos Interesting! WebGPU and WebAssembly are marked as experimental on the website (likewise ONNX and PyTorch), but the architecture looks promising! What is your assumption of why this project didn't gain more tracktion yet? How does it perform performance-wise?

eatcosmos commented 5 months ago

Hi @kyr0

Translated by ChatGPT，maybe with some mistake:

Sorry for the late reply; I am still getting used to GitHub's notification system. I believe that many developments in technology are influenced by a combination of chance and necessity. Technological advancements are often constrained by market demand and human resources. On one hand, the market demand for certain technologies may not be strong enough yet; on the other hand, there may not be enough personnel to develop them.

Over time, as foundational tools become more prevalent, certain technologies will naturally emerge. For example, the proliferation of websites led to the creation of WebAssembly, which in turn gave rise to native WebAssembly. Although native WebAssembly shows great promise, Docker emerged earlier, illustrating the mix of chance and inevitability in technology products.

The development of WebAssembly required consensus and collaboration among numerous browser vendors, whereas Docker technology could be promoted with minimal consensus. Docker's growth has significantly advanced the software industry, indirectly contributing to the development of WebAssembly.

As foundational tools continue to improve, WebGPU and WebAssembly will undoubtedly receive more natural attention. If we want to accelerate this process, individual efforts and contributions could help speed things up.

triton-lang / triton

Emscripten support to compile to WebAssembly + WebGPU; huge potential for cost savings in operating data-centers and running large deployments, e.g. ChatGPT #3631