iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.86k stars 625 forks source link

Port the IREE runtime to WebAssembly+JavaScript without Emscripten #8327

Open ScottTodd opened 2 years ago

ScottTodd commented 2 years ago

We've demonstrated support for the platform features we need for a good web deployment story with the samples at experimental/web/:

Emscripten was relatively easy to bootstrap with, offers fairly mature optimization modes, and provides various ways for connecting compiled code with JavaScript.

Porting without Emscripten would let us precisely control our binary size, runtime overheads, and specific implementation choices. Our runtime is designed to support bare metal platforms, so there are only a few syscalls and other platform-specific pieces of functionality to implement. See the WebAssembly without Emscripten blog for a high level overview of the porting process, with some specific examples.

Rough task list:

kripken commented 2 years ago

This is definitely possible and may be worthwhile for a very low-level project like this, but I think it would take significant work. It looks like you already have a list of the things that will take the most effort:

Both of those are only stable in Emscripten atm, so you'd be charting your own path there. In particular, last I heard LLVM doesn't have a non-Emscripten triple that supports all you need there. These might end up large amounts of effort as you'd need to start from a design and then implement it. But it would get easier the smaller your subset of pthreads and dynamic linking is.

Another path you can look at is standalone wasm in Emscripten, which is sort of a compromise between Emscripten and non-Emscripten modes. It uses Emscripten, but it only emits a wasm file, without any JS runtime. You can then build a minimal JS runtime that is suitable for exactly what you need. The benefit of doing it that way is that you'd get all the wasm side of pthreads and dynamic linking and everything else "for free" from Emscripten. Writing your own JS runtime would still take work, though, but a lot less than doing it all from scratch. A downside to this approach, however, is that Emscripten's wasm/JS ABI can change over time, so there may be some amount of work during upgrades.

Other stuff you'd need to do in either option, that you already have listed:

I'm not sure how much work those would take as it depends on what JS API you want.

Aside from that, stuff I don't see listed here:

benvanik commented 2 years ago

Thanks for the tips and sanity checking our (very rough) plan! Your experience with these parts of the system can be extremely hard to develop and I'm happy you stumbled across this issue and shared it šŸ™‡

As you note being a library causes us to put a lot of pressure on parts of emscripten that are not currently designed for composition. For us in the long term you can imagine potentially a dozen different instances of the IREE runtime on a page of which several may be compiled from different versions and deployed by different organizations; the lighter weight they are and more isolated they can remain the better so we tend to focus on optimizing for that vs. ease of implementation on our side. Saving even a few MB of wired memory or a few workers by being able to share workers across different instances can have an outsized impact on user experience vs. us just needing to type some more once :)

TL;DR: emscripten for standalone wasm is probably what we'll end up with, with early milestones using the pthreads/webgpu libraries that are eventually swapped out with custom versions tailored to library-style requirements as we extend our API. The rest of this wall of text is just context that's only been in my head but may be useful for the rest of the team and others coming across this issue:

Our experience with the tradeoffs of emscripten's runtime libraries vs. our own comes from a previous project @ScottTodd & I worked on where we were doing multithreaded video encoding/decoding and WebGL rendering/compute that had similar code size and efficiency constraints (both execution perf and memory consumption) - that helped us ensure this time around things were put together so we could do the reimplementation/slimming piecemeal (well, hopefully šŸ¤ž :). When operating on many big blobs of data like video/image frames that can be 10-20MB/each or ML models with hundreds of MB of parameters/transients even small scalar factors on things like allocations and copies can be killer. It's definitely not an easy space to design around and something most people won't need/know they need until they hit it so it's understandable that we're off the supported path (it's more fun off-road anyway :).

pthreads is an example of one area we've looked at where we don't need the full generality as we're already factored for multi-platform and bare metal/RTOS: we pretty much only need worker creation and futex (Atomics.notify/Atomics.wait) and control over the lifetime of the workers is easier to reason about without an emulation layer in-between. Rolling our own would be less of a need if we were trying to just compile and run as a single process such as with normal emscripten usage for full-page apps and having an easy porting story by following the pthreads API shape is great for most users not starting from scratch. For us, though, we want to expose things like a worker pool on our public API that is not tied to emscripten internals so as to share and optimize the expensive part of the system (worker creation) and solve critical performance issues (contention from having >1 instances creating their own worker pools and not collaborating on scheduling). This extends further with use of shared workers (though today is just a dream): if something like google docs had ~5 models per page (e.g. spelling/grammar, type-ahead, gesture control, and summarization) and you had even 10 pages loaded you'd want ~Navigator.hardwareConcurrency workers instead of 50 * Navigator.hardwareConcurrency - figuring out how to manage such complex ownership across instances within the bounds of the emscripten libraries is (for me, anyway) more difficult than just making the API calls directly. It's a problem unique to libraries using these expensive resources, though, so not something people usually hit.

The emscripten WebGPU library is something we will at least use to bootstrap and may want long-term but roughly fits into the same bucket as pthreads: all we need is the method calls and explicitly don't want anything related to context creation/device management/etc. This is because as a library we want to be able to seamlessly interop with existing external devices and contexts and the state tracking that happens internally for things like the id<->js object mappings needs to be something we can control via our own APIs and our own ref counting (vs. what the Managers do in there). We previously tried tight interop with emscripten's WebGL layer but it grew to be quite complex and tied to the internal implementation details of library_webgl.js; in the end it was easier to just generate the thunks/marshaling and have a stable API surface (the browser) vs. trying to keep up with the emscripten internals and monkey-patching library functions. It's also another example of where coming in with a library approach vs. a full process-style compilation changes the structural requirements as we want to be able to do things like upload/download buffers to independent ArrayBuffers from that of our runtime heap; if a user has fetched 100MB of model parameters into an ArrayBuffer we don't want to have to malloc 100MB of heap (and likely grow), copy it in, copy it back out to upload, and be stuck with that 100MB heap growth for the lifetime of the page even if we really only need ~1MB steady-state (things like https://github.com/emscripten-core/emscripten/blob/7ce8b5853492df005a2f586b4e432b760b4abb4e/src/library_webgpu.js#L1889-L1894 are what kill us) - in this case whether we roll our own is not just a code size optimization but existential.

I don't think any of these issues or capabilities we want are things that are incompatible with emscripten but they are things that emscripten has traditionally not focused on. Likely because what are hard constraints for resource-intensive libraries are usually just nice-to-have's for full-page apps and as of today there aren't many people building resource-intensive libraries (and that's probably a good thing given the complexity :). It'd be great if there was a design effort around making emscripten's runtime bits more composable and maybe what we come up with as we navigate this can be useful as examples/use-cases. We sit somewhere between bare wasm and full emscripten in terms of what we need and upstream emscripten being able to reliably scale closer to bare wasm would be great for everyone also having to make these tradeoffs - hopefully it'd encourage more efficient web apps as not everyone can afford to be as bonkers as us. Traditionally it's been a hard gap to manage though and when faced with that completely open-ended design space the pragmatic all-or-nothing approach usually ends up winning out (as seen in this issue!). I still wish we didn't have to write it all but would rather save millions of aggregate compute hours over time than save a 1000 lines of boilerplate - but I understand that's a privileged viewpoint :)

As for the emscripten toolchain I think once we get the core libraries we need implemented (workers and WebGPU, which can be done independently) we can see what's left. From an IREE user's perspective not requiring anything but clang/lld is an excellent story so understanding what we get from the toolchain when targeting bare wasm would be useful. I think the point around the tail end of the pipeline (binary size optimization/etc) is a big one we need to watch out for as compiler driver logic can get pretty complex. My hope is that we start with what we have today (full emscripten toolchain and runtime libraries) and then as we edge towards productionization along various axes start slimming things down as needed - given WebGPU's early/volatile status we'll have the CPU/worker side shipping first, for example, and can focus on that. But various uses have differing requirements and if just trying to run one small model on one page then even if there are efficiency issues there's lots of slack available - if you can fit your entire audio hotwording model in less memory than the favicon requires you can just ship it without worrying about any of this stuff :P

kripken commented 2 years ago

@benvanik

Very interesting, thanks for the details!

Btw, I think there is some similarity between your needs and those of Unity who are contributing related things to Emscripten, also with the goals of minimalism. For example they recently added a Wasm Workers API, which is much more minimal than pthreads, and suitable if you don't need all of pthreads. There is also the MINIMAL_RUNTIME option they are working on (in upstream emscripten; I don't think that has an official doc page yet). Might be worth looking at those if you haven't already.

benvanik commented 2 years ago

Oh wow, wasm workers looks interesting, particularly the hierarchical nature (great for separating libraries) and the lack of a thread main (great for sharing workers). These may be a real option for us!

Also, that documentation is fantastic! The hardest part about a lot of these toolbox-style systems is figuring out when to use something or the other; I think we could learn a lot from that style :)

ScottTodd commented 2 years ago

Nice! I'd been mainly looking at https://emscripten.org/docs/porting/pthreads.html for how to interface with threading / workers. We have a threading_pthreads.c file implementing our limited threading.h API and a few branches in synchronization.c for futex usage. We can swap those out for a threading_emscripten.c/threading_wasm_workers.c to give that a try. Emscripten has been really useful for bootstrapping, and if we can peel away the layers that we want more direct control over one by one then that could satisfy our requirements without requiring a lot of upfront work.