Better allocator for Wasm

gavofyork commented 6 years ago

We currently use a Linear Allocator. It's very inefficient and means we need ~128 x 64KB pages to validate a block with a 350kb transaction in it. Switch this for a Buddy allocator or Slab allocator.

Furthermore, sr-io::storage was recently pessimised through adding a needless memcpy in to_vec. The code should be reverted thus:

/// Get `key` from storage and return a `Vec`, empty if there's a problem.
pub fn storage(key: &[u8]) -> Option<Vec<u8>> {
    let mut length: u32 = 0;
    unsafe {
        let ptr = ext_get_allocated_storage(key.as_ptr(), key.len() as u32, &mut length);
        if length == u32::max_value() {
            None
        } else {
            Some(<Vec<u8>>::from_raw_parts(ptr, length as usize))
        }
    }
}

To ensure this code doesn't result in UB on the part of Vec, ext_get_allocated_storage should ensure that the pointer is aligned according to u8 allocations on the unknown-unknown-webassembly platform (almost certainly 4 byte). This is won't be necessary as long as the Heap implementation ensures all allocations are on such a boundary.

pepyakin commented 6 years ago

Now the heap is controlled by the external to runtime code: there are ext_malloc and ext_free functions that layout heap and provide allocation machinery. This approach has some upsides: it should be more performant, and sort of more convenient (since some external functions actually require allocations).

But it has some downsides too. It also should be implemented outside of the runtime, so for us this means we need at least 2 different implementations: one for Rust and one for JS. So such mechanism may increase the chance of consensus issue.

So maybe we should re-consider the question about allocation mechanism. To start, we probably need to measure the perfomance impact of placing allocation inside the runtime.

pepyakin commented 6 years ago

There are a few places where unsafe code can be broken by changing the allocator, for example this.

It's unfortunate that even though we have a custom system allocator we can't be sure that it is safe to free memory that was allocated with the system allocator with Vec::drop.

pepyakin commented 5 years ago

We need this for contracts because they can use a lot of allocations (bounded only by gas limit).

gavofyork commented 5 years ago

There are a few places...

I'm unconvinced any of these are unsafe, or at least cannot be made safe, in practice.

I already checked to ensure that Vec::drop frees using ext_free for sanity.

Is there any way that Vec::drop could possibly deallocate other than by calling ext_free? Assuming that there is appropriate alignment (which is surely trivial given that we're dealing only with Vec<u8> ), then do we not satisfy those invariants in practice?

pepyakin commented 5 years ago

My worries are about the thing when somebody violates an invariant - thus invoking UB - "in practice" is a very weak condition. Sure, these safety requirements assume a lot of uncertainties like used allocator, platform, rustc/lib, the type of T. By fixing these you make less chance of actual breakage in practice.

But my main point that there is still a chance of breakage and we can't guarantee that the compiled code does what we expect. In order to check that in practice, we have to check the behavior every time the code is compiled. To make it worse, the documentation doesn't actually mention that this is related to allocator, they just require this invariants to be held. Thus, in theory, you have to check every used function actually, because some of them might rely on this invariant. And I don't mean cases like push (which performs reallocation which performs freeing), I mean literaly every other method, including .len(). Even testing is no help much. This is the nature of UB, unfortunately.

Yes, I agree, this sounds like over conservative, pessimistic and theoretic. And I yes I agree that the chance of breakage is rather low.

But it is still there and we have to admit it.

pepyakin commented 5 years ago

The good news is: the mentioned code is no longer a problem.

gavofyork commented 5 years ago

So while I certainly don't want to risk introducing UB into consensus code, I also don't want to be a slave to dogma.

The only thing we actually need to check is that the data at the pointer passed in to from_raw_parts is allocated in precisely the same way that a Vec<u8> in the runtime would allocate it. This is a small piece of fairly trivial logic and is easy to audit prior to a production runtime release.

Other, non-Rust, runtimes may potentially need to memcpy it and free the pointer manually if they do not have a similar API or cannot be guaranteed that their language's abstraction can't use the malloced data or won't free it as expected. But that's not really our problem, at least not yet.

I don't want to add an needless memcpy into every single storage fetch due to over-conservative reading of the documentation.

pepyakin commented 5 years ago

Note that my remark is more as a one pro for using the approach proposed here. So if we decided to implement this approach (i.e. runtime allocates memory and substrate fills it) then we wouldn't even bother with this issue. But yeah as described and discussed this requires some research on it.

Here is another idea, though still an invasive one:

we can create a new type that wraps the pointer provided by the runtime.
It also has Drop implementation which calls ext_free on that pointer explicitly.
it has Deref<Traget=[u8]> impl, which allows treating it as slice.
we can provide .to_vec() function (or rely on Deref impl) to convert this buffer to vector.

As far as I can see, this should satisfy all the use cases and don't pessimise things too much: If a user wants to use it as slice Deref got the user covered. If the user wants to use it as Vec i.e. push something, then the case is not pessimised compared to Vec::from_raw_parts because length is equal to capacity and it will require reallocation anyway. Cases like pop will experience regression though.

gavofyork commented 5 years ago

runtime allocates memory and substrate fills it

I don't like that approach as it involves an extra trip over the native/wasm divide, and an extra lookup into the hashmap or trie.

The other approach is, as you say, invasive. That said, that call is essentially only made when reading a typed storage item so it may not be much of an issue.

eira-fransham commented 5 years ago

I mentioned this to @pepyakin in a private conversation, but I think that I have a design that removes the need for calling into the contract's malloc implementation at all (and therefore can allow malloc to be just another function, and we can have some implementation of dynamic libraries that addresses the code size issues).

Essentially, read_storage(key_ptr: *const u8, key_len: u32) (or whatever the signature is) returns some handle (probably just a u32) that can then be used to read a fixed number of bytes at an offset with a different function, using a set of functions like:

extern "sysv64" fn open_handle(key_ptr: *const u8, key_len: u32) -> u32;
// Returns number of bytes written
extern "sysv64" fn read_from_handle(handle: u32, offset: u64, dst: *mut u8, dst_len: u32) -> u32;
// This would likely be called in the `Drop::drop` impl of some wrapper type
extern "sysv64" fn free_handle(key_ptr: *const u8, key_len: u8);

We can start with only allowing 1 handle to be active at once, since that massively reduces implementation complexity, and we only expose it via a shim of the existing read_storage(key: &[u8]) -> Vec<u8> that would be implemented in userspace. In the future we can write a wrapper than implements Read (and we can also implement Write using a similar approach), which would allow streaming the data. AFAICT this would be zero-overhead compared to the current approach if the userspace also had a way to get the length of the value given a handle to it so they could preallocate a vector with that space - we wouldn't even need to write an optimised memcpy that the contract could call out to as I was originally suggesting to Sergei, since the memcpy itself would be done by the host.

Obviously it's quite the rearchitecture but the way I see it this removes most of the complexities around malloc, is no slower than the current implementation and allows a lot more flexibility in userland code - allowing contracts that do no allocation at all and only operate on fixed-size buffers or streaming data.

gavofyork commented 5 years ago

if the userspace also had a way to get the length of the value given a handle to it so they could preallocate a vector with that space

The would probably necessitate a further call across the wasm/native divide, and potentially having to go back to the hashmap without substantial redesign of the native side.

Streaming data in user-land from native is a non-starter. We do it already with IncrementalInput and it's slow since a round trip to native land is far slower than a pointer addition. One of my tasks will be to remove it entirely.

gavofyork commented 5 years ago

I checked the impl of Vec, and unsurprisingly, the code demonstrates that it's really not an intractable problem to ensure no UB. Here's the code in question: https://github.com/rust-lang/rust/blob/master/src/liballoc/raw_vec.rs#L80

                let align = mem::align_of::<T>();
                let layout = Layout::from_size_align(alloc_size, align).unwrap();
                let result = if zeroed {
                    a.alloc_zeroed(layout)
                } else {
                    a.alloc(layout)
                };

For our Vec, A is Global, which means a.alloc effects a direct call into our ext_malloc via __rust_alloc which handles alignment.

To make the existing code accordingly correct we just need to ensure the data is aligned to mem::align_of::<u8>() , as the runtime would state. This is a fixed platform value unlikely to change which can be determined by running https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=7853a458ee9621a5b742fd64414f9c70

eira-fransham commented 5 years ago

if the userspace also had a way to get the length of the value given a handle to it so they could preallocate a vector with that space

The would probably necessitate a further call across the wasm/native divide, and potentially having to go back to the hashmap without substantial redesign of the native side.

Streaming data in user-land from native is a non-starter. We do it already with IncrementalInput and it's slow since a round trip to native land is far slower than a pointer addition. One of my tasks will be to remove it entirely.

A round trip to native code need not be expensive, it is with wasmi but all calls are expensive with wasmi. With lightbeam it would be essentially free, or as free as any sysv function call can be. Obviously lightbeam is pre-pre-pre-pre-pre-alpha right now but I don't think reducing host calls is necessarily a reason to throw the idea out the window.

My idea is that when you request a handle the native side fetches the vector into memory, either just using references with rental or by prehashing the key and making subsequent accesses fast that way. This would make getting the length just the price of a host call plus a single pointer deref.

EDIT: We can also work on making host calls faster in wasmi somehow.

gavofyork commented 5 years ago

Sure, but this would require a lot of redesign, refactoring and optimisation, is complex and accomplishes essentially nothing more than what we already have.

gavofyork commented 5 years ago

Made a start in gav-start-refactor

pepyakin commented 5 years ago

Closed by #1460

pepyakin commented 5 years ago

Reopen due to revert https://github.com/paritytech/substrate/pull/1502

pepyakin commented 5 years ago

Closed by https://github.com/paritytech/substrate/pull/1506

paritytech / substrate

Better allocator for Wasm #300