pepyakin commented 2 years ago

It is actually not a new idea and was considered very early. We ended up having the allocator provided by the host.

Lately, there were several discussions about moving the allocator back into the runtime.

One of those discussions anecdotally mentioned that there is a lot of overhead coming from the fact that the wasm-host boundary has to be crossed each time allocation and deallocation is performed. If we switched to allocator to the runtime, then each time host needs some memory it will have to ask the runtime to allocate a buffer. This should result in way fewer wasm-host boundary crossings though.
Another brought up the fact that it is not convenient from the perspective of long-term maintenance. Any implementation will have to follow the current implementation to the word.
This will remove the unsoundness discussed in the old issue https://github.com/paritytech/substrate/issues/300
Having the allocator adds flexibility as well. The runtime will be able to pick an allocator that suits the workflow it expects to have.
Technically, the allocator might ask for a different alignment. However, our allocator will always provide allocations aligned by 16 bytes.

koute commented 2 years ago

Another disadvantage of the current approach is that we can't effectively make any non-trivial behavior affecting changes to the current allocator. There are various improvements to the allocator's internals which would be nice to make, however due to backward-compatibility concerns that carries with it a lot of risk.

Since the allocator is provided by the host any changes to its behavior will affect all of the historical blocks, and those blocks might depend on the allocator's exact behavior due to Hyrum's law. Even if we can guarantee that our chains won't be affected we can't do that for every other parachain and project building on Substrate.

Of course, an alternative to this would be to just use our existing host function versioning machinery and provide a malloc_v2, etc., but then we'll have to maintain two allocators. And what if we'll want to make further improvements in the future? Are we going to maintain 10 copies of different allocators?

So I'm highly in favor of this. There might have been good reasons to keep the allocator on the host's side in the past; today I don't really see why we wouldn't want to move it into the runtime.

bkchr commented 2 years ago

How would the host allocate the return value of a host function?

koute commented 2 years ago

How would the host allocate the return value of a host function?

That is a good question. That is currently one of the biggest issues which we'd need to resolve, since currently some types marshaled across the FFI boundary are directly allocated by the client itself inside of WASM's linear memory.

Some random ideas off the top of my head (not all are necessarily great ideas; just typing out whatever comes to my mind):

Simply do this in two steps. The first call into the host would return a size instead of a pointer; then the runtime could allocate the necessary space to hold the data and call into the client again to fetch the data itself.
Allow the host to call back into the runtime's allocator to allocate the necessary space.
Designate a scratch space an the end of the linear memory and write the data there (obviously also adding a mechanism to make sure that if there isn't enough free space it will generate an error instead of overwriting memory).
Have two linear memories - one would be used by the runtime with its own allocator, while the second one would be used purely to marshal data between the runtime and the host (so wouldn't need an allocator), and the runtime could simply copy the data from the scratch space into its own memory with the memory.copy instruction from the bulk memory operations extension.

It should be doable. Of course we'd need to make sure that this doesn't regress the performance and that whatever we'll pick will be maintainable into the future.

pepyakin commented 2 years ago

Since this is not new, back in the day, I thought we would go with something like @koute described in the 2nd point. Specifically, I imagined that the host API would gain a function that would initialize the allocator. That function accepted parameters for function pointers to malloc, free (and today I would maybe add realloc). Then the runtime is supposed to initialize the allocator before calling to the host functions that perform allocations. An alternative to that would be a similar construction, but the host finds the allocator functions itself by finding exports with names. That works too and I think this is more appropriate in the current setting than I used to.

However, if possible, I would like to avoid reentrance into the same wasm instance [^1]. One annoying thing is the stack, especially in the current implementation of wasmtime. That's because the stack limit is shared between nested calls and the host functions. Maybe it's fine for Substrate, but PVF requires it to be robust against malicious code.

We now live in 2022 and can think about enabling new features.

We could tap onto multiple memories (phase 3 at the moment of writing[^3]) and bulk memory proposals (finished), as @koute mentioned in the 4th point.
We could tap onto the reference types (finished[^3]). A host function that needs to return a buffer of the unknown in advance size would return two values, one i32 and one externref. The i32 holds the size of the buffer and the externref is an opaque pointer to the buffer. Then the runtime would internally allocate memory to fit the buffer by whatever means it wants. Finally, the runtime would call another host function passing the externref that would copy the data from the internal buffer into the memory at the designated location. We still need to limit the number and sizes of available buffers[^2].

So yeah, it's doable. There are plenty of options. The option we pick should be robust to malicious PVFs.

[^1]: Not entirely sure if it is possible, it may happen we might need to have that with something like paritytech/polkadot-sdk#370. [^2]: If we reduce the number to 1, we get "the scratch buffer" technique that was/is employed in the pallet_contracts. [^3]: The toolchain currently does not support this feature well, to my knowledge.

yjhmelody commented 1 year ago

Hi @koute @bkchr I wonder about progress status of these runtime allocator related features. Have anyone designed this part or some new discusses ?

bkchr commented 1 year ago

No, nothing of this was designed so far.

yjhmelody commented 1 year ago

TBH, I have designed this part before, I could just port it into substrate. I have considered various situations of upgrading the runtime specification, including version compatibility, and performing spec version checks during code update upgrades.

Of course, it is only for substrate, and the polkadot part may be more complicated. After all, there is a verification logic for running parachains, but the thinking should be the same.

Need a draft PR to show the idea?

bkchr commented 1 year ago

TBH, I have designed this part before, I could just port it into substrate.

Can you give some brief overview of your design?

yjhmelody commented 1 year ago

The overview

Firstly we should disable host allocator and export runtime allocator. So define an new allocator crate sp-allocator-v1 for wrapping sp-io:

[dependencies]
lol_alloc = { version = "0.3", optional = true }

# not used just enable feature
sp-io = { version = "7.0.0", path = "../io", default-features = false }

[features]
default = ["std"]
std = [
    "sp-io/std",
]

disable_oom = [
    "sp-io/disable_oom",
]

allocator-v1 = [
    "lol_alloc",
    "sp-io/disable_allocator",
    # TODO: This part need to be redesigned
    "sp-io/disable_panic_handler",
]

The crate maybe have the following code:

use core::alloc::GlobalAlloc;
use lol_alloc::{AssumeSingleThreaded, FreeListAllocator};

#[global_allocator]
pub static ALLOCATOR: AssumeSingleThreaded<FreeListAllocator> =
    unsafe { AssumeSingleThreaded::new(FreeListAllocator::new()) };

#[no_mangle]
unsafe fn alloc(size: usize) -> *mut u8 {
    ALLOCATOR.alloc(core::alloc::Layout::array::<u8>(size).unwrap())
}

#[no_mangle]
unsafe fn dealloc(ptr: *mut u8, size: usize) {
    ALLOCATOR.dealloc(ptr, core::alloc::Layout::array::<u8>(size).unwrap())
}

#[no_mangle]
unsafe fn realloc(ptr: *mut u8, size: usize, new_size: usize) -> *mut u8 {
    ALLOCATOR.realloc(ptr, core::alloc::Layout::array::<u8>(size).unwrap(), new_size)
}

// TODO: maybe it's better to rename this crate to `sp-runtime-abi`.
/// The dummy function represents the version of runtime ABI.
///
/// This name should be checked by host.
#[no_mangle]
fn v1() {
    // nop
}

/// A default panic handler for WASM environment.
#[panic_handler]
#[no_mangle]
pub fn panic(_info: &core::panic::PanicInfo) -> ! {
    core::arch::wasm32::unreachable()
}

/// A default OOM handler for WASM environment.
#[alloc_error_handler]
pub fn oom(_layout: core::alloc::Layout) -> ! {
    core::arch::wasm32::unreachable()
}

The versioning here only manages the specification of exported primitive functions, of course, I think it can also cover more specifications.

And then we could import this crate for node-template-runtime:

sp-allocator-v1 = { version = "7.0", default-features = false, path = "../../../primitives/allocator-v1" }
# ...
[features]
default = ["std"]
allocator-v1 = [
    "sp-allocator-v1/allocator-v1",
]
# ...

And enable it extern crate sp_allocator_v1;.

Now we could build a substrate wasm runtime without host allocator.

And since we mark the version info in exported function name(Or mark it in custom section, but I think the former design is better), we could refactor sc-executor to check version info when handle runtime_blob. The only thing need to be careful is that when meet codeUpdate we should recreate instance.

Maybe define a AbiVersion to handle different version.

/// The Abi version for runtime.
#[derive(Debug, Copy, Clone, Eq, PartialEq)]
#[non_exhaustive]
pub enum AbiVersion {
    /// Legacy allocator defined by host.
    #[default]
    Legacy,
    /// V1 see sp-allocator-v1.
    V1,
}

@bkchr

yjhmelody commented 1 year ago

Try to build node-template-runtime here: https://github.com/yjhmelody/substrate/tree/executor-inside-allocator

cargo build -p node-template-runtime --release

Now the wasm executor not work anymore, since v1 spec is not supported:

./target/release/node-template --dev --execution=wasm

yjhmelody commented 1 year ago

Hi, @bkchr @koute Need I make a PR to show the idea? Although the changes to the wasmtime part will be somewhat troublesome

tomaka commented 1 year ago

And since we mark the version info in exported function name(Or mark it in custom section, but I think the former design is better)

IMO it's not necessary to add yet another version number to this. The client should just check whether the runtime exports a function named alloc (or a different name if for some reason we're afraid that alloc might be accidentally exported). There's also no need for dealloc and realloc, just alloc is enough.

However, in my opinion it wouldn't be a bad idea to also transition host functions to not perform allocations at all.

This is very straight forward for the ext_crypto_*, ext_hashing_*, ext_trie_*, and *_clear_prefix functions by making them write to a pre-existing buffer rather than allocate a new one, and I think that everyone would agree that this is a good change (I personally don't understand why functions such as ext_hashing_sha2_256_version_1 don't already do that). For the storage read functions, it's a bit less straight forward as we need to indicate how much we've read, but should be simple as well.

The only not-so-simple ones could be the *_next_key, *_get storage functions, and ext_misc_runtime_version_version_1, as they return a variable-length value whose maximum length can't really be known in advance. We can always make them work by introducing the necessity to call them a second time if the buffer is too small the first time. This could lead to a performance degradation. My intuition says that in practice it wouldn't be, but it should still be measured.

yjhmelody commented 1 year ago

There's also no need for dealloc and realloc, just alloc is enough.

The current implementation obviously needs dealloc, and here is a realloc Issue https://github.com/paritytech/substrate/issues/11884

However, in my opinion it wouldn't be a bad idea to also transition host functions to not perform allocations at all.

For this part, I think it could not prevent the idea about exportingalloc/dealloc. For executing a big block verification (such as code update and storage migration), memory must be freed to avoid oom. It's not about one tx execution, it's about a big block time/space execution.

tomaka commented 1 year ago

What I mean is that the client never frees the memory it allocates, it's always the runtime that does that, so there's no need for the runtime to export this function.

yjhmelody commented 1 year ago

Oh, yeah, You are right. But from the implementation point of view, it is not hard to export dealloc, just like alloc, and a certain degree of function completeness is guaranteed

tomaka commented 1 year ago

Whether it is hard or not is completely irrelevant.

There exists a certain interface between the client and the runtime, and exporting alloc is about modifying this interface so that the client uses alloc to allocate the return values. Also exporting dealloc modifies this interface further and for no reason.

paritytech / substrate

Move the allocator inside of the runtime #11883

The overview