Use 128 bit fat pointers for continuation objects

frank-emrich commented 3 months ago

This PR changes the representation introduced in #182 , where continuation objects were turned into tagged pointers, containing a pointer to a VMContRef as well as a 16bit sequence counter to perform linearity checks.

In this PR, the representation is changed from 64bit tagged pointers to 128bit fat pointers, where 64bit are used for the pointer and the sequence counter.

Some implementation details:

The design introduced in #182, where we use disassemble_contobj and assemble_contobj to go from effectively Optional<VMContObj> to Optional<VMContRef> is preserved.
The feature unsafe_disable_continuation_linearity_check is preserved: If it is enabled, we do not use fat (or tagged) pointers at all, and all revision checks are disabled.
Overflow checks for the revision counter are no longer necessary and have been removed.
In wasm, we now use the SIMD type I8X16 for any value of type (ref $continuation) and (ref null $continuation). See the comment on vm_contobj_type in shared.rs for why we cannot use I128 or I64X2 instead.
Some translate_* functions in the FuncEnvironment trait now need to take a FunctionBuilder parameter, instead of FuncCursor, which slightly increases the footprint of this PR.
The implementation of table.fill for continuation tables was missing. I've added this and in the process extended cont_table.wast to be generally more exhaustive.
For those libcalls that take a parameter that is a variant type including VMContObj, I've introduced dedicated versions for the VMContObj case, namely table_fill_cont_obj and table_grow_cont_obj in libcalls.rs. These manually split the VMContObj into two parts.

frank-emrich commented 3 months ago

Some benchmarking results: First, I compare the fat pointer implementation against the existing tagged pointer one. Enabling them actually makes all benchmarks except c10m fail, because they overflow the counter. Thus, I've had to slightly tweak their parameters. In the list below, each line shows the value of X/Y, where X is the runtime of that particular benchmark with tagged pointers, and Y is the runtime with fat pointers. As usual, the difference between, say c10m_wasmfx and c10m_wasmfx_fiber is that the latter uses the fiber interface, while the former uses handwritten wat files.

Suite: c10m
c10m_wasmfx: 1.0125704809561387
c10m_wasmfx_fiber: 0.9931000528537908

Suite: sieve (cut number of primes in half)
sieve_wasmfx: 0.9637743103971731
sieve_wasmfx_fiber: 0.9910300798077857

Suite: skynet (5 instead of 6 levels)
skynet_wasmfx: 0.9970199355953799
skynet_wasmfx_fiber: 0.9912801597259853

Suite: state
only runs when counting up to 8000, at which point runtime is 10ms

I now compare the performance impact of enabling vs disabling the linearity check when using this PR (i.e., whether or not the unsafe_disable_continuation_linearity_check is enabled). Again the values shown are X/Y, where X is the runtime without linearity checks, and Y is the runtime with linearity checks.

Suite: c10m
c10m_wasmfx: 0.9162058249858285
c10m_wasmfx_fiber: 0.9677704802233246
Suite: sieve
sieve_wasmfx: 0.9758646600083649
sieve_wasmfx_fiber: 0.9808578875186281
Suite: skynet
skynet_wasmfx: 0.9675361140008778
skynet_wasmfx_fiber: 0.9859123548277564
Suite: state
state_wasmfx: 0.9729201800828162
state_wasmfx_fiber: 0.983206464991699

frank-emrich commented 3 months ago

I noticed that there is an issue when continuation tables are allocated in a TablePool. I'll update the PR once I have time to fix it.

dhil commented 3 months ago

I noticed that there is an issue when continuation tables are allocated in a TablePool. I'll update the PR once I have time to fix it.

What's the problem/error?

frank-emrich commented 3 months ago

The TablePool manages a single mmapped memory region from which it allocates all tables. To this end, it calculates the required overall size of this region as max_number_of_allowed_tables * max_allowed_element_count_per_table * size_of_each_table_entry. Thus, the memory for table with index i in the pool then starts at i * max_allowed_element_count_per_table * size_of_each_table_entry.

However, all of this is based on the (hardcoded) assumption that all table entries across all table types are pointer-sized (i.e., size_of_each_table_entry is sizeof(*mut u8)). But as of this PR, this is not the case anymore.

I will address this as follows:

Change the calculation of the overall size of the mmapped region to max_number_of_allowed_tables * max_allowed_element_count_per_table * max_size_of_each_table_entry, where max_size_of_each_table_entry is now sizeof(VMContObj) == 16. This effectively doubles the amount of address space occupied by the table pool. The calculation of the start address of each table is changed accordingly.
Change the logic for allocating and deallocating tables from the pool so that we take the element size for that particular table type into account when committing and decommitting memory.

In summary, these changes mean that while the table pool occupies more virtual address space, the amount of actually committed pages for non-continuation tables does not change.

There are some other solutions, which seem less preferable:

Simply refuse to allocate continuation tables that have more than max_allowed_element_count_per_table / 2 entries. That seems dodgy.
Have two separate mmapped regions, one for tables with pointer-sized entries and one for tables that contain fat pointers. Despite complicating the implementation of TablePool, it has the following drawback, defeating the whole purpose of the separation: The current design of the TablePool assumes that you allocate (but don't commit) all the required memory upfront. But the size of the mmapped region for small tables + the size of the region for large tables would together be larger than the single unified region proposed above.

frank-emrich commented 3 months ago

I have implemented this fix now independently #192, meaning that the current PR needs to be landed after #192.

frank-emrich commented 3 months ago

This should be good to go now

wasmfx / wasmfxtime

Use 128 bit fat pointers for continuation objects #186