iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.82k stars 608 forks source link

Map WASM memory allocation APIs to IREE's HAL #5137

Open ScottTodd opened 3 years ago

ScottTodd commented 3 years ago

Splitting this off from https://github.com/google/iree/pull/5096 and a discussion on Discord here.

TL;DR: WAMR allocates memory per module (executable). IREE wants to define an allocator up a level, shared across executables. What should we do?


WASM runtimes limit what memory WASM modules can access to a single contiguous memory address range that the module can suballocate within. Applications can typically create this block of memory, resize it, offer it to instantiated modules, etc. See this article for a pretty good overview.

IREE follows several APIs (like Vulkan) in using a hierarchical setup going from application contexts down to executables:

driver registry (iree_hal_driver_registry_t)
  - driver (iree_hal_driver_t, VkInstance)
    - device (iree_hal_device_t, VkPhysicalDevice + VkDevice)
    - device
  - driver
    - device
    - device

executable (iree_hal_executable_t, VkShaderModule + VkPipeline[])
  - executable instances are created by devices and _may_ be cached/reused across devices by drivers

hal allocator (iree_hal_allocator_t)
  - may be independent from drivers/devices (e.g. for CPU implementations), or linked to a specific driver/device
  - each device has _one_ allocator, which it uses for all of its loaded executables

drivers and devices should be isolated from each other, except where resource sharing is explicitly used

While implementing a WASM HAL driver using WAMR in https://github.com/google/iree/pull/5096, we found that WAMR has a different memory allocation architecture in its "iwasm" VM core:

wasm_runtime_full_init (static)
  - wasm_runtime_malloc: allocate from runtime memory environment

wasm_runtime_load(module_bytes) -> wasm_module_t

wasm_runtime_instantiate(wasm_module_t, heap_size) -> wasm_module_inst_t
  - wasm_runtime_module_malloc(wasm_module_inst_t, size): allocate from WASM module instance

wasm_runtime_create_exec_env(wasm_module_inst_t, stack_size) -> wasm_exec_env_t

So,

At face value, an "IREE WAMR device" would need to either limit itself to one executable, or it would need multiple allocators.

WAMR has a WAMR_BUILD_SHARED_MEMORY CMake option / WASM_ENABLE_SHARED_MEMORY C define that could help, but we still want isolation between drivers/devices. One of the main reasons we'd be using WASM would be for the memory sandbox.

Notably, the WASM C API, which WAMR partially implements has a different model:

wasm_engine_t
  - wasm_store_t
    - wasm_memory_t
    - wasm_module_t

that seems to map more directly to IREE's architecture (device or driver would have wasm_engine_t, device would have wasm_store_t and wasm_memory_t, executable would have wasm_module_t).


Here are a few of our options, none of which seem too favorable:

(A) Wait for WAMR to implement the latest WASM C API and use that, instead of their "iwasm" VM core

(B) Continue using WAMR's "iwasm" VM core, finding a workaround using shared memory

(C) Restrict devices in IREE's WASM HAL to one executable

(D) Externalize memory using native read/write functions loading/storing from our own heap, making each wasm environment use no local heap and contain compute-only code. This would require some further ahead-of-time compilation work on our end (turning loads/stores into calls, ensuring these loads/stores are handled in pages since they would be slow individually).

(E) Use a different WASM runtime (see this list). We picked WAMR as an initial target for its low footprint, portability, performance, and ease of integration (C/C++ and CMake with few dependencies). If we want an IREE WASM HAL to serve as a flexible deployment path, we can't really compromise on those points (e.g. by taking a Rust dependency or using a runtime that can't run on embedded systems).

ScottTodd commented 3 years ago

I spent some time evaluating wasm3 and Wasmtime's APIs from this perspective:

Both seem architecturally more flexible when it comes to memory allocation and memory space management, but I don't quite see a way to satisfy our requirements yet. More discussions on Discord about this here (wasm3) and here (wasmtime).

Generally, we want to allocate a block of memory to be managed by an IREE "device" and shared between wasm modules (IREE "executables"). I think this can be solved using memory exports and imports (it's basically how SharedArrayBuffer is used on the web?), though I'm still searching for concrete examples and documentation from these runtimes. We're also wondering about thread safety (Discord discussion) - locking to allocate or load/unload a module is fine but we shouldn't need to take a lock to safely call stateless functions, for example.

mykmartin commented 3 years ago

Quick question: would the WebAssembly multi-memory proposal be useful for this? To avoid the external call cost, the device/HAL allocator logic could be baked into each module but operating on a single buffer shared between all modules, with other buffer(s) allocated per-module for isolated memory space as needed.

benvanik commented 3 years ago

@mykmartin it may be (my hope is that it is :)

Our usage really needs there to be a way to allocate a growable block of memory independent of any wasm module that we can then provide to each wasm module as we load them. The default memory of each wasm module is where stacks would live while the bulk data we'd work with would come from the shared memory.

It's hard for me to see in the spec if this is something the spec even cares about or if it's purely something related to the engines. From what we looked into most engines assume that they create all the memory for the loaded modules instead of allowing imports. The proposal spec looks like it would be very compatible with this approach as we could do this with just two data segments and we can easily assign the pointer address spaces in LLVM that end up as the data segment identifiers on instructions from our generated code. Then we just need the engines to have an "allocate a growable memory instance" and "import this memory instance during module instantiation".

We also could get by without multi-memory support if we could do the same "allocate a growable memory instance" and "use this memory instance during module instantiation" - we'd then assign the stack offsets ourselves and have all modules share the same exact memory. In browser land this would be like having a SharedArrayBuffer that all loaded wasm modules used - which would be useful (in my previous life I worked on Google Maps and wanted the same feature there for multithreaded decoding into staging buffers for GPU upload).

(posted some more details about what we are doing here: https://github.com/google/iree/issues/2863#issuecomment-881572348, which shows where multi-memory may help)

benvanik commented 3 years ago

The other thing multi-memory may allow in the future is importing of large read-only constant buffers. In ML inference this would be things like your model weights (which can be 10-100MB, or much larger 2GB). Today we would need to copy those into the wasm-accessible memory - which is similar to what we need to do for GPUs with discrete memory - but it'd be nice to not have to given that the bytes already exist. If we could create a wasm_memory_t with existing read-only contents then we could import that without the full alloc+copy.

mykmartin commented 3 years ago

fyi the wasmtime team have just added the multi-memory option in the C API: https://github.com/bytecodealliance/wasmtime/issues/3066 - would that have an impact on the analysis in Scott's comment?

benvanik commented 3 years ago

fyi the wasmtime team have just added the multi-memory option in the C API

Nice! That would require a small bit of compiler work to enable (just tagging LLVM pointers with the right address space, I believe) but nothing major - then we could evaluate an allocator that sliced from a wasm memory block and import that into each executable in the prototype wasmtime implementation Scott has.

aaron-schneider commented 1 year ago

Ho there! This bug hasn't been updated in a long time. Good intentions and all, but we're moving this to the backlog. Feel free to bring it back if you think there's a reasonable chance it'll get worked on in the next 6mo!