Memory model - Githubissues

martinfouilleul commented 9 months ago

The 32 bit linear memory model has all kinds of drawbacks and limitations. There are several memory proposals but imo they don't really solve those problems, and somewhat complicate the bytecode and the interpreter. I'd like to try going a much simpler route for Orca:

Just use 64 bits pointers in wasm. Besides that, the wasm bytecode and interpreter shouldn't have to know anything about the address space. It still views it as a contiguous buffer. How that memory is actually implemented should be a host/OS-level concern.
The host can expose functions to map regions of that address space.
The host can back the wasm address space with virtual memory. We only have to do one bounds check (but that would presumably be the case for multiple memories too), and perhaps install a custom page fault handler to deal with bad accesses in our own way (eg by mapping the missing pages, returning special values, erroring, etc. depending on the case). In release mode the bounds check can even be a mask op (bad accesses will be bad accesses anyway, so let's the module trash its own address space, as long as it can not escape it).

This way we can get all the nice properties of virtual memory without adding complexity to the bytecode or the vm.

--edit: see below for a more detailed discussion--

martinfouilleul commented 3 months ago

Ben and I recently had a conversation about virtual memory, and here's a summary of my notes

Target-defined, import-based memory model

One issue with a virtual memory proposal (and other "platform-level" proposals) is that it ties into a lot of different/incompatible platform constraints and use case requirements. In my opinion it is a very bad idea to try to parameterize the core wasm spec (eg the bytecode format or the general execution model) with every possible platform concerns. It is much better to accept the fact that some programs will be written against an expected target, ie a set of platforms with common characteristics. For example, targets could be "Browser", "WASI", "Orca", some embedded platform, etc.). Portability should be a concern inside the set of platforms that form a target, but not across targets. The target against which a module is written can then imply the presence of some features, either as normal wasm imports or through an upcoming "builtins" mechanism. I can also imply some conventions, eg around memory. For example a target could decide that zero accesses traps, or that out-of-bounds access wrap instead of trap, etc. The point is that a number of features and behaviours can be target-defined, and not multiply the complexity of the spec (and the amount of work needed to implement it) by the number of platforms. This also allows starting to leverage useful platform features without having to second guess their impact on completely unrelated (and possibly future) platforms.

There's also an idea about builtins that's being explored for example in https://github.com/WebAssembly/js-string-builtins/blob/main/proposals/js-string-builtins/Overview.md. The idea is to have sets of builtins that can be provided to the module at compile-time (instead of instanciation-time like imports). This looks like a very promising step. Not only calls to these builtins can be optimized better than imports, and thus replace the need for adding instructions to extend wasm in many cases, but they could also be provided "by default" on a per-target basis. For example, browsers could expose their APIs through this mechanism without needing any JS glue, WASI could do the same for its system-calls, etc. This would of course be a great way to expose per-target virtual memory APIs.

Needed wasm core spec changes

So what would be left to specify at the wasm core level? Basically we would only need pointers and memory sizes to be 64 bits, and to make amendments to the memory model to allow it to trap on target-defined conditions, even for in-bounds accesses. To this last point, Ben said it would be important to make this new memory model opt-in by adding a new type of "sparse" memory (ie a large memory that can trap), eg:

(import "" "mem" (memory (sparse 2^20)))

This way we would maintain backward compatibility and allow modules that don't care about virtual memory at all to still rely on the old model.

Discussion about pages, pages size, and out-of-bounds

Wasm has a "page size" of 64k. In my opinion this is an unfortunate misnomer, since pages only make sense in the context of virtual memory, and current wasm... does not have virtual memory. This "page size" is actually just some granularity that is used to specify memory sizes, eg memory limits, or operands to mem.grow, etc. Although the name creates confusion, this does not mean it has to correspond to any future virtual memory page size.

The rationale for the 64k page size (at least the one I read/heard) seems to be that it allows runtimes to use guard pages to ensure trapping on out-of-bounds access. First this seems to leak implementation concerns into the spec, which could be ok, but it also enshrines a specific number which is already obsolete for systems with huge pages. Imo this could have been solved other ways:

In my ideal world, there wouldn't be such a "wasm page size" in the first place, and memory sizes would always be specified in bytes. Runtimes could still be free to allocate memory in page-aligned chunks, and add guard pages to catch out-of-bounds. This means you could ask for a 16000 bytes memory, and a runtime with a 4k page size would reserve you 16384 bytes. Sure, the last 384 bytes would not trap, but it would not access any important memory either. As long as you don't escape the sandbox, I don't see why out-of-bounds should be treated in a special way compared to other memory bugs.

I can see the desire for consistency in mandating trapping on out-of-bounds, but wasm programs can already trash their own memory without trapping, so this argument doesn't really convince me. Furthermore the guard page trick isn't directly usable anymore for a 64bit address space, so trapping requires costlier checks. Relaxing that constraint would allow aligning the size of the memory actually reserved by the runtime to whatever is most convenient to the runtime, and replacing these checks with cheaps shift and mask operations.

Now we probably have to accept the "wasm page size" as a given, but the point is that we shouldn't confuse it with the virtual memory page size of our targets, and it could be beneficial to relax the trapping behaviour of out-of-bounds accesses for 64 bit memories.

Orca virtual memory API

The Orca runtime reserves the entire size declared by the sparse memory. Inside this reserved area the app can create mappings. In-bounds accesses outside of mapped ranges trap. Out-of-bounds access may wrap or trap. Null-address access traps.

Note that the runtime doesn't need to actually commit mapped ranges until they are touched. It can keep track of mappings and commit pages on-demand when they are first accessed.

Ideally the sparse memory would declare the absolute maximum reservation it needs, and we would disable the mem.grow instruction. "Growing" would be done by mapping more memory, until the reserved address space is full. Alternatively we could allow the runtime to try growing the reservation if it can't honour a mapping request. However this would potentially require copying / remapping the whole wasm memory, so we may want to just mandate the wasm module to declare it's absolute max memory size upfront.

Some mappings can be created by the runtime itself at instanciation time, for the program's data segment and stack. Mappings can also be created by the runtime in response to some API calls, eg to create buffers inside wasm memory that are used to share data between the app and the runtime (eg ring buffers used for IO).

Ben mentionned forbiding overlapping mappings, and with some toughts I think it would be a good idea. For comparison Windows seem to track ranges allocated through VirtualAlloc, whereas Linux/macOS seem to treat pages individually. For Orca it would allow the runtime to keep track of ranges that are mapped by the app and efficiently find/check those ranges, for example:

have ranges that are specially reserved by the host (eg the zero page whose access right the app can't change, or some read-only segments for constants / interned strings etc...)
handling the OS commit state of pages independently from the wasm commit state: committing on demand from inside the page fault handler, decommitting rarely used pages to reduce memory pressure if useful on some OSes...
It also feels like a better choice API-wise, since you can talk about memory mappings in a way that doesn't allow them to get fragmented / partially changed, etc.

I think we don't actually need a difference between reserve and commit at the wasm level. The runtime can handle committing on first access. If I understand correctly that's already what happens on macOS/Linux. Windows commit on demand can be done in the page fault handler. This removes some gymnastics from the app, and this leave some wiggle room to the host to actually commit/decommit when it is needed.
We could still have hints/advice to the runtime similar to madvise with OC_MEM_DONT_NEED or OC_MEM_WILL_NEED. On Windows that would translate to commit/decommit, and on macOS to calls to madvise
If we don't need to commit individual pages in a bigger reservation, I also feel like we don't really need an equivalent to MAP_FIXED, and every mapping can just be allocated from the OS reserved space by the runtime allocator. The only real use case I see for something like MAP_FIXED would be when you want to allocate two consecutive ranges to do the ring buffer trick. Imo this can be better handled by an API to split/coalesce mappings, which could work only for ranges already mapped with no access rights.
Finally, we need a way to change the properties of an existing mapping as a whole.

So with this in mind, the API would look something like (in Orca API terms, so not necessarily an exact mapping to the wasm builtins):

// mapping / unmapping
typedef struct oc_mem_range
{
    char* ptr;
    u64 len;
} oc_mem_range;

oc_mem_range oc_mem_map(u64 size, oc_mem_map_flags flags);
oc_mem_range oc_mem_remap(oc_mem_range, oc_mem_map_flags flags);
void oc_mem_unmap(oc_mem_range range);

void oc_mem_advise(oc_mem_range range, u64 offset, u64 size, oc_mem_advise_flags flags);

// split / coalesce
typedef struct oc_mem_range_pair
{
    oc_mem_range range1;
    oc_mem_range range2;
} oc_mem_range_pair;

oc_mem_range_pair oc_mem_split(oc_mem_range range, u64 size);
oc_mem_range oc_mem_coalesce(oc_mem_range_pair pair);

// shared memory. If we feel like it these could also be folded into the normal map functions

typedef struct oc_mem_shared { u64 h; } oc_mem_shared;

oc_mem_shared oc_mem_shared_create(u64 size, oc_mem_map_flags flags);
oc_mem_range oc_mem_shared_map(oc_mem_shared shared, oc_mem_map_flags flags);
oc_mem_range oc_mem_shared_remap(oc_mem_range, oc_mem_shared shared, oc_mem_map_flags);

Mapped ranges starting point and sizes are aligned to the host virtual memory page size. As long as the ranges returned can be bigger to what the user asked, I'm not sure we actually have to mandate a specific page size. We could also provide a call to discover the page size used by the runtime. And perhaps the app could ask for a preferred page size?

With this here's an example of how you would do the ringbuffer trick:

oc_mem_range range = oc_mem_map(2*bufferSize, OC_MEM_NONE);
oc_mem_range_pair pair = oc_mem_split(range, bufferSize);

oc_mem_shared obj = oc_mem_shared_create(bufferSize, OC_MEM_READ|OC_MEM_WRITE);

oc_mem_shared_remap(pair.range1, obj, OC_MEM_READ|OC_MEM_WRITE);
oc_mem_shared_remap(pair.range2, obj, OC_MEM_READ|OC_MEM_WRITE);

One tricky use-case that will require more research is aliasing buffers created by some native API into wasm memory. Typical example is the WebGL or WebGPU APIs, which have functions to map GPU buffers, but don't let you choose where to map them. In order to avoid copies, we would like to alias these buffers into wasm memory by creating some mapping to the same physical pages. This is somewhat similar to the ringbuffer example, except I don't know of any API to do that after the fact, ie for an already mapped range. If we do find such API we would simply setup that aliasing in our WebGL/WebGPU bridging functions, so this would all be transparent to the app.

Roadmap

We can start exploring 64-bits virtual memory in Orca once we have support for them in bytebox. Meanwhile, I think nothing is stopping us to expose the same calls to 32 bits wasm modules and just trap on excessive sizes. Since we're dealing with small memories and we don't have compiler support for our "sparse memory" proposal, we can just reserve 4GiB anyway, and just experiment with the mapping API for now.

rdunnington commented 3 months ago

I like the direction here overall, but one thing that gives me pause is that I believe the way this would be implemented on linux is with overcommit. Overcommit is a configurable setting on user systems, and there is a "typical" default, but it may vary between OSes. When overcommit support is on and the system passes some memory usage threshold, the system will selectively kill applications it deems the worst offenders of memory via the "OOM killer" process. If overcommit support is off, huge allocations that cannot actually be satisfied by the system will simply fail, meaning that any orca app that has a large sparse memory will simply fail to run. At least, this is what I believe will happen - I haven't tested it personally myself yet.

It turns out what overcommit setting to use is a somewhat controversial topic among linux users - there are some that really care about having applications that only use the amount of memory they reserve, so they either just refuse to use apps that rely on overcommit, or only use them begrudgingly. I'm not personally super in tune with what the "majority" of linux users care about - maybe most of them just don't care either way and it's just a vocal minority that don't like overcommit. Either way I think we should talk about whether this group's needs are important to Orca, and, if so, what to do about it.

Also it would be great to get some input from regular linux users to get a read on the overall temperature on overcommit and how much people care either way.

bvisness commented 3 months ago

I don't think the initial reservation would trigger any limits; presumably it would be mapped with PROT_NONE and shouldn't actually result in any physical memory mappings. But maybe I'm wrong about that?

rdunnington commented 3 months ago

Yeah it could be I'm mistaken as well, I should probably just write some code on Linux to confirm either way. It's just that all the docs I read that talk about overcommit don't seem to distinguish between the initial allocation and differing mapping types.

bvisness commented 3 months ago

I know that for Android specifically, the OOM killer really only cares about resident memory, i.e. memory with physical mappings. At least that is what I gather from this page. (RSS, PSS, and USS are all metrics of resident memory, just with different accounting of shared memory.)

In general I think this makes sense, and I would be surprised if other Linux distros were killing things based on commit alone, since killing any process with low physical memory use will not really improve memory pressure!

rdunnington commented 3 months ago

Yeah I agree that makes sense, but my concern is for users that have completely disabled overcommit (e.g. on desktop Linux). My understanding is in that scenario, if a program requests an allocation size that the system cannot satisfy with physical pages, it will fail the allocation (return NULL). But again, my understanding could be wrong, that was just my impression from reading the docs, I need to verify.

orca-app / orca

Memory model #36

Target-defined, import-based memory model

Needed wasm core spec changes

Discussion about pages, pages size, and out-of-bounds

Orca virtual memory API

Roadmap