Enable the bulk memory operations WASM feature

koute commented 2 years ago

(This is a subissue of https://github.com/paritytech/substrate/issues/10707; I'm creating a new issue to focus on just the bulk memory operations.)

We should seriously consider enabling the bulk memory operations feature in our WASM executor and our runtimes. We have recently discovered that the contracts' benchmarks currently can spend up to 75% of their time within the WASM calling memset; not only that, since wasmtime doesn't cache-align loops depending on how the instructions are laid out in memory the performance of memset/memcpy/etc. can very widely vary, and can become up to ~40% slower when compared to the cache-aligned case.

I've done a quick test to compare the performance of the pallet_contracts/seal_return_per_kb benchmark as it currently stands today; here's its execution time in each case:

without bulk memory operations, loop not cache-aligned: 362598
without bulk memory operations, loop cache-aligned: 230293
with bulk memory operations: 113970

Not only does it cut down the benchmark's execution time by half, but also should prevent wasmtime's codegen roulette from regressing the performance.

Now, we could probably just optimize this on wasmi's side so that it doesn't preallocate and clear 1MB buffer on each invocation (assuming this hasn't been done already; we're still using quite an old version of wasmi, and I haven't checked the newest version). Nevertheless I think this has demonstrated that there's concrete value in enabling this extension - even if wasmi is fixed something else could conceivably allocate large buffers and tank the performance, and then we'll be back to square one; there's also potential for this extension to speed things up in general, considering how widely memset/memcpy are used under the hood. I don't think it's worth holding back on this extension anymore, and we should just pull the trigger and enable it. In the worst case it won't make any difference, in the best case it can significantly speed things up.

What needs to be done? (high level plan)

Add support for the bulk memory ops to wasmi, wasmi-validation and wasm-instrument, if it hasn't been done yet.
Enable the bulk feature on the parity-wasm.
Call config.wasm_bulk_memory(true) when initializing wasmtime.
Do a burn in with the runtime compiled with -C target-feature=+bulk-memory.
Release a new version of polkadot.
Wait for a few releases. (Essentially the same as when introducing a new host function.)
Permanently enable -C target-feature=+bulk-memory flag when building runtimes.
???
Profit.

Anything else I'm missing?

cc @pepyakin @athei @Robbepop

Robbepop commented 2 years ago

Now, we could probably just optimize this on wasmi's side so that it doesn't preallocate and clear 1MB buffer on each invocation (assuming this hasn't been done already; we're still using quite an old version of wasmi, and I haven't checked the newest version).

The newer wasmi versions allow to set the initial and maximum value stack length in the Config similar to how Wasmtime Config works. Therefore this wasmi specific problem should be resolved once we upgrade to wasmi 0.16.0 or above.

Add support for the bulk memory ops to wasmi, wasmi-validation and wasm-instrument, if it hasn't been done yet.

We probably do not need support for Wasm bulk-memory proposal in wasmi since this is only relevant if we want to keep wasmi as a Substrate runtime execution engine but I honestly see no reason why we should keep it. In the very past it was useful when Wasmtime was not as stable as it is nowadays (or maybe some other reason). Although support for Wasm bulk-memory proposal is planned for wasmi since certain smart contracts could potentially benefit from it.

pepyakin commented 2 years ago

Citing the parent issue:

It would be great to have data on how exactly performance is improved. This would help to evaluate how much priority it has for implementation and for the runtime writers user for upgrading.

wasmi does not support pretty much most of the newest features. wasmtime is the primary engine for now, but it would still be good to have the second engine.

The polkadot side should be taken into account. How would PVF execution migrated? Or those features won't hit PVF?

It's clear about the pt. 1. Re pt. 2, I think either way is fine. (UPD: Robin beat me up to it, and I agree, I think getting rid of wasmi executor is on the table)

The PVF is a bit more complicated.

We need to consider that if we unconditionally enable wasm_bulk_memory it will be enabled for PVFs. There are two problems with that:

There is a problem with the upgrade. If we just YOLO enable it, an adversary could take advantage of that. Unfortunately, until https://github.com/paritytech/polkadot-sdk/issues/917 is landed our hands are tied.

That implies that the executor configuration should allow disabling bulk mem ops. It will be disabled for PVF execution.

Since the Cumulus PDK uses the same binary for the Runtime and PVF, the parachains won't be able to take advantage of bulk mem ops. Parachains are the overwhelming majority of the users of Substrate, so the impact would be limited until we upgrade PVF.

Then, the blocker for enabling bulk mem ops for PVFs is the question about the metering. It is likely coming at least for PVFs. That means we have to squash this concern touched on in the parent issue.

athei commented 2 years ago

(UPD: Robin beat me up to it, and I agree, I think getting rid of wasmi executor is on the table)

For the foreseeable future we will run contracts with an in-runtime wasmi.

pepyakin commented 2 years ago

Tangential? The wasmi executor refers to sc-executor-wasmi and not the sandbox backend.

ggwpez commented 2 years ago

Should be sanity checked with something outside of pallet benchmarks, but the seal_return_per_kb sound great.
Probably historic import times, there is benchmark block to measure re-import times of old blocks.
I will do so once you put up an MR :smile:

paritytech / polkadot-sdk

Enable the bulk memory operations WASM feature #36

What needs to be done? (high level plan)