Closed frank-emrich closed 5 months ago
Some benchmarking results:
First, I compare the fat pointer implementation against the existing tagged pointer one. Enabling them actually makes all benchmarks except c10m fail, because they overflow the counter. Thus, I've had to slightly tweak their parameters.
In the list below, each line shows the value of X/Y, where X is the runtime of that particular benchmark with tagged pointers, and Y is the runtime with fat pointers. As usual, the difference between, say c10m_wasmfx
and c10m_wasmfx_fiber
is that the latter uses the fiber interface, while the former uses handwritten wat files.
Suite: c10m
c10m_wasmfx: 1.0125704809561387
c10m_wasmfx_fiber: 0.9931000528537908
Suite: sieve (cut number of primes in half)
sieve_wasmfx: 0.9637743103971731
sieve_wasmfx_fiber: 0.9910300798077857
Suite: skynet (5 instead of 6 levels)
skynet_wasmfx: 0.9970199355953799
skynet_wasmfx_fiber: 0.9912801597259853
Suite: state
only runs when counting up to 8000, at which point runtime is 10ms
I now compare the performance impact of enabling vs disabling the linearity check when using this PR (i.e., whether or not the unsafe_disable_continuation_linearity_check
is enabled). Again the values shown are X/Y, where X is the runtime without linearity checks, and Y is the runtime with linearity checks.
Suite: c10m
c10m_wasmfx: 0.9162058249858285
c10m_wasmfx_fiber: 0.9677704802233246
Suite: sieve
sieve_wasmfx: 0.9758646600083649
sieve_wasmfx_fiber: 0.9808578875186281
Suite: skynet
skynet_wasmfx: 0.9675361140008778
skynet_wasmfx_fiber: 0.9859123548277564
Suite: state
state_wasmfx: 0.9729201800828162
state_wasmfx_fiber: 0.983206464991699
I noticed that there is an issue when continuation tables are allocated in a TablePool
. I'll update the PR once I have time to fix it.
I noticed that there is an issue when continuation tables are allocated in a
TablePool
. I'll update the PR once I have time to fix it.
What's the problem/error?
The TablePool
manages a single mmapped memory region from which it allocates all tables. To this end, it calculates the required overall size of this region as max_number_of_allowed_tables * max_allowed_element_count_per_table * size_of_each_table_entry
. Thus, the memory for table with index i
in the pool then starts at i * max_allowed_element_count_per_table * size_of_each_table_entry
.
However, all of this is based on the (hardcoded) assumption that all table entries across all table types are pointer-sized (i.e., size_of_each_table_entry
is sizeof(*mut u8)
). But as of this PR, this is not the case anymore.
I will address this as follows:
max_number_of_allowed_tables * max_allowed_element_count_per_table * max_size_of_each_table_entry
, where max_size_of_each_table_entry
is now sizeof(VMContObj)
== 16. This effectively doubles the amount of address space occupied by the table pool. The calculation of the start address of each table is changed accordingly.In summary, these changes mean that while the table pool occupies more virtual address space, the amount of actually committed pages for non-continuation tables does not change.
There are some other solutions, which seem less preferable:
max_allowed_element_count_per_table / 2
entries. That seems dodgy.TablePool
, it has the following drawback, defeating the whole purpose of the separation: The current design of the TablePool
assumes that you allocate (but don't commit) all the required memory upfront. But the size of the mmapped region for small tables + the size of the region for large tables would together be larger than the single unified region proposed above.I have implemented this fix now independently #192, meaning that the current PR needs to be landed after #192.
This should be good to go now
This PR changes the representation introduced in #182 , where continuation objects were turned into tagged pointers, containing a pointer to a
VMContRef
as well as a 16bit sequence counter to perform linearity checks.In this PR, the representation is changed from 64bit tagged pointers to 128bit fat pointers, where 64bit are used for the pointer and the sequence counter.
Some implementation details:
disassemble_contobj
andassemble_contobj
to go from effectivelyOptional<VMContObj>
toOptional<VMContRef>
is preserved.unsafe_disable_continuation_linearity_check
is preserved: If it is enabled, we do not use fat (or tagged) pointers at all, and all revision checks are disabled.I8X16
for any value of type(ref $continuation)
and(ref null $continuation)
. See the comment onvm_contobj_type
in shared.rs for why we cannot useI128
orI64X2
instead.translate_*
functions in theFuncEnvironment
trait now need to take aFunctionBuilder
parameter, instead ofFuncCursor
, which slightly increases the footprint of this PR.table.fill
for continuation tables was missing. I've added this and in the process extendedcont_table.wast
to be generally more exhaustive.VMContObj
, I've introduced dedicated versions for theVMContObj
case, namelytable_fill_cont_obj
andtable_grow_cont_obj
in libcalls.rs. These manually split theVMContObj
into two parts.