In-Module 'Dirty Pages' Array

kettle11 / tangle

Radically simple multiplayer / networked WebAssembly

MIT License

1.18k stars 37 forks source link

In-Module 'Dirty Pages' Array #7

Open AThilenius opened 1 year ago

AThilenius commented 1 year ago

Firstly, what a clever idea, I'm fascinated by it! Lockstep has been done by snapshotting runtimes (LUA of note) before, but this is a whole new level of flexible. It would fit a game I've been toying with for years quite well, which requires untrusted networked user code to run alongside the game in a browser. Many kudos to you.

Let me test some understanding:

You have wasm_guardian which rewrites the WASM module to call a host (JS) function on each memory mutation. That is an immense amount of overhead I'm guessing?
Which I assume is why it was removed in this commit. But wasm_guardian is still being applied to the WASM module.
Both of those were in an effort to not have to snapshot the entire WASM module memory each time, which can very quickly grow into gigabytes.
gzip was introduced as a workaround for snapshot memory size. It's computationally expensive.

Taking a page out of hardware TLBs (sorry for the pun) I could see tracking dirty pages in WASM memory itself. Allocate a static array, probably 1 byte per flag, say 64KiB pages, so 64k entry dirty array. That reduces the runtime overhead to dirty_flags[mem_addr >> 16] = 1. It's still a lot for each and every memory write, but without access to the hardware TLB (don't see that happening from a browser any time soon) I can't see any way to avoid that cost. CPUs all use barrel-shifters, so that whole thing is likely to be 3 or 4 cycles and cache coherency should be really good. I saw notes in your code somewhere about this too, but excluding the stack from this tracking would be a huge win.

Again, awesome work, I love it.

kettle11 commented 1 year ago

Excellent analysis!

One small point: wasm_guardian has most of its transformations turned off at the moment: it's only being used to force exporting of all Wasm globals.

My thought process is similar to yours and Tangle should eventually adopt the approach you've described. The main difficult question to answer is where do the dirty_flags get stored?

If you store them in regular Wasm memory a poorly behaved program could accidentally or intentionally clobber them. Maybe that's OK for now? I need to research more but perhaps there's some sort of mutable global array in a Wasm extension that could be used. A 'nuclear' option that may not work that well is to generate a ton of global variables.

kettle11 commented 1 year ago

Some more thoughts on this: Ideally multi-memory could be used, but that's not widely supported yet. Then the dirty_flags array could be stored in a separate memory and the Wasm code could be preprocessed to make sure the other memory isn't accessed.

But what alternatives work in today's browser?

table is usually used for function pointers but I think it could work here as well. A table could be allocated to point to dummy functions and the host environment (browser JS) could skim the table to see which flags are set. My concern is that using tables for a purpose they weren't designed for might mess with performance. I'll do some tests.

AThilenius commented 1 year ago

where do the dirty_flags get stored

I think initially you go simple with this, a 64KiB global in WASM of whatever type as long as it's contiguous memory. Then on both WASM and JS side just treat that contiguous block of memory as a byte array and do unchecked writes to it.

a poorly behaved program could accidentally or intentionally clobber them

I wouldn't worry about solving that initially. A WASM module is "allowed" to use up to 4GiB of memory, so handle the degenerate case were all 4GiB is updated every single sync. During sync, quickly check that the total number of updated bytes isn't greater than the total allocated memory though, then it's up to host code to set sane limits based on application. In my case WASM modules will be very limited for untrusted user code (~25MiB memory limit per module) with soft limits on other resource usage.

kettle11 commented 1 year ago

I think initially you go simple with this, a 64KiB global in WASM of whatever type as long as it's contiguous memory. Then on both WASM and JS side just treat that contiguous block of memory as a byte array and do unchecked writes to it.

The problem is that I can't think of a way to do that without coordination from the language compiling to Wasm. If the dirty_flags array is in the same memory as the rest of the Wasm then the Wasm language may accidentally write over dirty_flags. Wasm has individual global variables but they're just single numbers. What we need is some sort of global array, that's not within Wasm memory.

It'd be possible for each Wasm program to allocate an array and report it to the host, but that makes Tangle more annoying to integrate with. It also allows the Wasm program to accidentally desync (by writing over dirty_flags). If possible I'd like to aim Tangle's design such that all desyncs are bugs within the Tangle library and are impossible to trigger from user Wasm.

But if that's the only way it could be a pragmatic stepping stone towards a better long-term solution.

AThilenius commented 1 year ago

The problem is that I can't think of a way to do that without coordination from the language compiling to Wasm

Ah. Right. Yeah this is a hairy one. Initially I thought "oh easy, just add a 64k offset to all memory access within the module and use byte [0, 64k) as your array" but that cause out of bound memory access (4G + 64K) which the JIT won't be happy about. Probably also causes issues with null-deref fencing if you try to use byte 0, and who knows what else.

Because WASM memory access can be statically analyzed I feel like there is still a way to rewrite the final WASM module to leave a 64k block alone without help from the compiler that produced it. But I'm thinking that could be a non-trivial challenge. Maybe reach out to the Walrus authors and see if they have suggestions. Having the memory live outside the WASM module might be doable, but I would be worried about overhead if the JIT can't guarantee memory bounds and compile to raw CPU ISA.

Maybe this whole approach is wrong though 🤔 What if you keep the paging approach, but do change detection during sync with a linear scan? Seems modern CPUs are optimized to within an inch of their lives for linear scans. That also presents a nice O(N) computational cost where N is the amount of memory actually used by the module. It also means the module can be left untouched, so you can remove Walrus and much more easily tune page sizes. Change detection can be either a cheap content hash or a direct comparison with the previous state, which ever is faster (or select based on 'optimize performance' vs 'optimize memory usage').

AThilenius commented 1 year ago

The more I think about this the more benefits I find with a linear scan. You ideally want a content hash of the entire module state during syncs for reliable de-sync detection. That final hash can be built up by hashing all the (page location, page hash) tuples. So the total overhead for change detection is "free" in that you need/want that final content hash anyway.

I'm starting to wonder what else you can do with granular content hashes too! Interesting.

kettle11 commented 1 year ago

This is untested, but I was experimenting with what the Wasm modifications would look like to track edits using the initially proposed pages technique but using a Wasm Table as the dirty_flags array:

https://github.com/kettle11/tangle/commit/d3cbaa002a2816d69d3909848945c82938407a8c

A problem is that Wasm store ops aren't aligned so a store could overlap pages. My thought on how to approach that was just mark the start and end address of the store as dirty.

The draft I linked adds 15-19 extra ops per store op which is certainly not amazing. And I'd have to test to see how much overhead there is for the table.set opcode, it may not be designed to be called that frequently.

Linearly scanning memory as you're proposing is certainly an option and wouldn't be too hard to add, but it would hurt for programs that use lots of memory. I'll keep thinking about it.

AThilenius commented 1 year ago

The advantage of a scan is predictable O(N) performance with memory usage. Per-write overhead is unpredictable. For example, when the compiler runs out of physical registers to map, it's going to start spilling over to the stack (though I'm not sure exactly how that plays with WASM). Even just for heap writes it can be hard to know how much pressure you're putting on memory. If performance were equal, I would probably take the predicable linear-scan hit over the unpredictable one.

Here's some totally un-scientific numbers I cooked up (from the code below)

Write overhead: 2.40s
Slice cmp impl: 693.33ms
Iter elem cmp:  2.59s
Fastcmp crate:  62.20ms
XXHash (xxh3):  149.54ms

The write-overhead is a lot, that's the ideal overhead for 2^32 writes which seems easy to get to, especially for things like matrix arithmetic that involve many intermittent writes (I would love to see some metrics on that too).

Second, Third and Forth are various ways of comparing 4GiB with another 4GiB block. fastcmp is the clear winner there, but I'm surprised it's not faster.

xxh3 I was really surprised by! It's within an order of magnitude of direct comparison. Sub-millisecond for any reasonably sized program.

#![feature(test)]
extern crate test;

use std::time::Instant;

use fastcmp::Compare;
use test::black_box;
use xxhash_rust::xxh3::xxh3_64;

// 2^32
const LEN: usize = 4_294_967_296;

fn main() {
    let v1 = vec![0_u8; LEN];
    let v2 = vec![0_u8; LEN];

    let now = Instant::now();

    {
        // Overhead of writing 2_32 times to memory. This is the 'ideal' case because writes are linear.
        // It's purely the overhead, not the actual write.
        let pages = vec![0_u8; LEN >> 15];
        for i in 0..LEN {
            black_box(pages[i >> 15] == 1);
        }
        println!("Write overhead:\t{:.2?}", now.elapsed());
    }

    let now = Instant::now();

    {
        // Compare the `cmp` impl.
        black_box(v1.cmp(&v2));
        println!("Slice cmp impl:\t{:.2?}", now.elapsed());
    }

    let now = Instant::now();

    {
        // Iterating through each element and comparing them byte by byte. It's slow. This is with
        // Rust's zip optimizations though (it can skip bounds checks because it knows the lengths
        // are equal).
        for (i1, i2) in v1.iter().zip(&v2) {
            black_box(i1 == i2);
        }
        println!("Iter elem cmp:\t{:.2?}", now.elapsed());
    }

    let now = Instant::now();

    {
        // Use the `fastcmp` crate, which uses `memcmp` under the hood.
        black_box(v1.feq(&v2));
        println!("Fastcmp crate:\t{:.2?}", now.elapsed());
    }

    {
        // Xxh3 is shockingly fast, probably because it's half the memory pressure as comparisons and
        // CPUs are fast.
        let now = Instant::now();
        black_box(xxh3_64(&v1));
        println!("XXHard (xxh3):\t{:.2?}", now.elapsed());
    }
}

kettle11 commented 1 year ago

Thanks for investigating further.

I did some more non-scientific exploration as well. I measured the speed of the above more granular tracking that I shared.

For the landing page it makes the example run about 4x slower, for the ball_pit example it makes it run 4x -> 10x+ slower (based on the number of balls increasing). Not good!

I also did a very basic test of how quickly browser JavaScript can compare arrays:

const size = Math.pow(2, 28);
const some_data = new Uint8Array(size);
const other_data = new Uint8Array(size);

some_data[some_data.length - 1] = 2;

let now = performance.now();

let are_equal = true;
for (let i = 0; i < some_data.length; ++i) {
    if (some_data[i] - other_data[i]) {
        are_equal = false;
        break;
    }
}
let time_spent = performance.now() - now;

console.log("TIME SPENT: ", time_spent);

For me that's not performing well either. With 64 KiB it's taking ~1.5ms in Chrome. For 268 MB it's taking ~347 ms in Chrome. Safari is way slower.

Some quick thoughts:

Native (non-browser) hosts could handle this way better, as your tests show.
Excluding the stack could from the granular tracking could make the overhead vastly smaller, but it may not work for all languages. But this could be very good for languages where it's possible. I think Rust / C / C++ / Zig fall into this category, but other languages may not.
Introducing multiple code-paths and configuration is a pain, but this could almost be a configurable based on if a program is memory-heavy or compute-heavy.

I'll keep thinking about this.

AThilenius commented 1 year ago

Yeah it'll be really slow from JS because the runtime is forced to do all kinds of weird JS nonsense each loop; you'll definitely want to do it from WASM. JS can be a tad faster by treating it as a u32 array though, and pre-warming caches:

const size = Math.pow(2, 28);
const some_data = new Uint8Array(size);
const other_data = new Uint8Array(size);

some_data[some_data.length - 1] = 2;

const some_data_as_32 = new Uint32Array(some_data.buffer)
const other_data_as_32 = new Uint32Array(other_data.buffer)

// Warm caches
for (let i = 0; i < some_data_as_32.length; ++i) {
    some_data_as_32[i] = some_data_as_32[i];
    other_data_as_32[i] = other_data_as_32[i];
}

let now = performance.now();

let are_equal = true;
for (let i = 0; i < some_data_as_32.length; ++i) {
    if (some_data_as_32[i] !== other_data_as_32[i]) {
        are_equal = false;
        break;
    }
}
let time_spent = performance.now() - now;

console.log("TIME SPENT: ", time_spent);

That gives 44ms on my machine.

Both xxh3 and an assembly-based memcmp in WASM will be plenty fast though. Might be missing SIMD support, but otherwise should be native speed +/- a few percent.

kettle11 commented 1 year ago

I did some more investigations yesterday. I basically implemented your initial proposal (minus the actual rollback integration) and made it so the Wasm program needs to allocate and pass an array to the host to store dirty_flags. On the ball_pit example I found the performance hit to be ~20%, which isn't that bad! That example seems to be particularly heavy on calls to store.

Potentially a hybrid approach could be used: for programs with less memory just hash / compare their pages to find differences. But high memory-usage programs could opt-in to the code transformation strategy. Eventually if Wasm gets something like multi-memory (https://github.com/WebAssembly/multi-memory/issues/42) Tangle could be changed so the program doesn't need to allocate its own dirty_flags array.

What do you think?

I don't love introducing multiple code-paths for maintainability reasons but it'd be great to remove Tangle's limitations on memory-usage.

AThilenius commented 1 year ago

Nice! 20% isn't too bad, most use cases can live with that. Another metric that needs testing is what percentage of pages are written each loop to for 'real-world uses' though. It will also change depending on task.

What do you think?

Take my thought with a large grain of salt because it isn't my codebase and I think you've already done really fantastic work. If I were taking a crack at this problem I would probably start with something like this:

Segment the lockstep module into compile-time configurable page sizes
Page data is stored is a page_data: HashMap<xxh3, Vec<u8>> - keyed by content hash and storing compressed data.
Memory snapshots are stored as memory_snapshot: HashMap<xxh3, Vec<xxh3>> - keyed by the content hash of the the Vec<xxh3>, which is a vector of xxh3 content hashes in page_data that makes up the entire module memory (we just build a DAG, more on this later).
Other storage data is also content addressed (like globals, inputs, ...) the same way.
History is stored in a Vec of something like { monotonic_iteration_num: usize, memory_hash: xxh3, globals_hash: xxh3, input_hash: xxh3, ... }
At the end of each cycle, the WASM module memory pages are hashed, missing pages are compressed and added to page_data, and the full Vec<xxh3> of memory is added to memory_snapshot if missing.
Then, the copy_memory function can then take in the target buffer, a from: Vec<xxh3>, a to: Vec<xxh3> and copy only the minimal set of pages that were updated.
You'll obviously also need to run a GC each iteration, removing the oldest history, then clearing out any orphaned memory_snapshot and finally page_data. But that can be flexible based on a memory cap, or some other configurable parameter.

Once thats working, then you start getting fancy...

Page data doesn't need to be stored in fixed-size blocks. If a single byte changed in the page, it's cheaper to store that single byte and it's location than the entire page. There are a few ways to approach this, the simplest is just to store 'sub-page changes' like a Vec<{ offset: u8, change: [u8; 16] }>. You could also go totally nuts and build up a tree of changes, down to some minimal leaf size but I highly doubt that'll be faster or even more memory efficient. You also have to store the parent page xxh3 which the diff was computed from, and reconstructing an arbitrary page is a tad harder. Git internals are a good reference for this kind of stuff, You'll also need to 'compact' the base page by applying deltas when a parent page drops out of scope (is GCed).
memory_snapshot values (the Vec<xxh3>) are probably better stored in a single Trie, or something. Less memory and makes GC much faster.
If a memory delta is smaller than the input that produced it... you can just send the memory delta over the wire and 'fast-forward' the remote. This has security implications though, and probably isn't a good idea, but kind of neat.

This list goes on and on, but should be driven by real-world tests so I'll digress.

Interestingly this strategy isn't mutually exclusive with tracking writes within the WASM module, that can be used to hint which pages have changes and only xxh3/delta-compress those. But that's an optimization IMO and only for some workloads, so I would start it later. This approach also means I wouldn't need Walrus, or to open up the Lockstep WASM module. I would personally also write the host code in Rust.

Again, with a grain of salt! I think you're doing just fine, these are just my thoughts on how I would approach this problem.

kettle11 commented 1 year ago

This is a great analysis, thanks for thinking it through!

I like the perspective of sticking to hashing pages to track deltas and then treating the Wasm pre-processing to report stores as an eventual possible optimization (always good to be able to defer work!).

I've been wanting to rewrite Tangle back to Rust (it was originally Rust) so I'll experiment with some of these ideas at the same time.

The only thing I'm considering changing from your above proposal is delta-tracking instead of storing 'snapshots' at each iteration. Snapshots do feel simpler / cleaner, but as I'm picturing it you'd end up storing a bunch of redundant hashes.

Another thing I was thinking about: your approach of using a HashMap<xxh3, Vec<u8>> has the cool property of deduping page storage. But I genuinely wonder how often that'd occur in practice. It feels like generally most pages would either be all zeroes or a pattern only ever seen once, but I could be wrong about that.

AThilenius commented 1 year ago

You're very welcome, and thank you for the idea and library!

I think I might not have a explained a few things well (or am misunderstanding what you're saying). I approached this similar to how Git works internally, which is quite elegant despite the CLI being a total dumpster-fire.

The initial pass (without getting fancy) would store only changed pages, but it would store the page content in it's entirety for each revision. Let's consider the following two loops of Tangle, and use a letter to represent the content-hash of each page of memory:

Loop N:   [ A B C ]
Loop N+1: [ A D C ]

In this case we have the following unique pages: [A, B, C, D]. We can't drop page B because we need it for rollback. We also only want to store page A once in memory. That's where the HashMap<xxh3, Vec<u8>> comes in. You have to address unique pages somehow, and content hashing is the common approach when dealing with opaque blobs of data. As you noted, it has the added benefit of handling de-dups for free. I think duplicate pages of zeros will be somewhat common, but probably not pages of non-zeros. So you're already 'delta-tracking' memory, at the granularity of one page size.

The snapshots are Vec<xxh3>, which are those [ A D C ]. It's a small vector (limited by how many pages you can have) of small (16 byte) values, just the hashes of the contents of each page. Even for hundreds of snapshots the memory usage will be insignificant.

Then we get to the fancy parts.

Going back to the above memory layout

Loop N:   [ A B C ]
Loop N+1: [ A D C ]

We can note that page B turned into page D, possible with very few changes. So we can again save memory, by storing the full pages [A, B, C] but page D can be a 'delta' from page A. This delta can be very fast, as easy as a list of offsets into the page at which a change occurred, and a Vec<u8> of bytes for the changes. This gets a bit tricky later on though, when you want to free page B from memory. You need to 'compact' the contents of B into D. In other words, page D needs to become a full page snapshot and not a delta. It's easy to do (decompress page B, apply the delta, and call it page D, which by content it now is). I would expect that to be a huge memory saving for most use cases.

Second part was the snapshot vectors. Those are essentially a Vec<Vec<xxh3>>, or... a list of snapshots, where each snapshot it a vector of content hashes (just the hash itself, not the actual page content). There will be lots of duplicates between different versions though (assuming most pages stay the same). Instead you can store the entire thing in something like an RRB Tree. The easiest way to do that (because those data structures get complicated) is to use something like https://docs.rs/im/latest/im/ for the inner vectors. In other words, Vec<im::Vec<xxh3>>. When you construct a new snapshot, you start by cloning the previous one, and then changing / inserting values. That library will handle de-duping for you, using an RRB tree. End result is that you'll use less memory. But this is small potatoes compared to delta-compressing page data!