Remove unnecessary host functions allocator usage

tomaka commented 1 year ago

I have followed the template in https://github.com/polkadot-fellows/RFCs/pull/2.

tomaka commented 1 year ago

Is my understanding of the RFC correct that it proposes to introduce the new functions but not to remove the old ones?

Yes!

This seems valid, but after being accepted/implemented, the old versions should be deprecated and then removed entirely (after some significant period of time, to allow backwards compatibility)

They should be deprecated, yes, but removing them would mean that you could no longer sync older blocks. I'm not necessarily against breaking the possibility to sync older blocks, but that's a big/controversial change not in scope of this RFC.

pepyakin commented 1 year ago

This is definitely a step into the right direction. I also agree on that the C API approach makes sense under current condition. I am a bit worried though that this proposal doesn't provide any numbers though.

I am thinking that in the future, we might need to consider the API design overhaul as if it's 2023 actually. But this is outside of the scope of this proposal.

Just for linking purposes, here are a relevant discussion on Substrate: https://github.com/paritytech/substrate/issues/11883

burdges commented 1 year ago

Is there an obstacle to host functions actually doing borrowing aka just pass pointers? We need not even copy memory in say:

fn host_function(foo: &'a mut [u8], &'b [u8; 32]) -> &'a mut [u8];

Also..

A hash interface like pub fn keccak_256(data: &[u8]) -> [u8; 32] is inherently inefficient for variable sized data because you must build some Vec<u8> before doing the hashing. Although RustCrypto/digest is imperfect, at least their traits avoid this.

We could've wasm types designed for usage directly in the host, like demanding u64 or simd alignment, and then impl appropriate digest::{Update, Reset, FixedOutput*, ExtendableOutput*, XofReader} directly upon those.

#[derive(Clone)]
#[repr(align(8))]
pub(crate) struct Sha3State {
    state: [u8; 26 * 8],
    //  [u64; PLEN] with PLEN = 25 plus a usize=u64
    // https://github.com/RustCrypto/hashes/blob/master/sha3/src/state.rs#L9
}

impl Update for Sha3State {
    fn update(&mut self, data: &[u8]) {
        // Host call.  On the host side unsafely casts our Sha3State to sha3::Sha3State
    }
}
...

You'd then cargo patch in sp-sha3 over sha3 and then you've a faster hashing interface. It's more work to do this however..

tomaka commented 1 year ago

This is definitely a step into the right direction. I also agree on that the C API approach makes sense under current condition. I am a bit worried though that this proposal doesn't provide any numbers though.

I have opened this RFC with the desire of making things move. I would be more than happy if someone else took over this effort and provided numbers. Personally it seems obvious to me that "allocating memory, writing a hash to it, and freeing said memory" is slower than "writing a hash to memory". And if it turns out that in practice it wasn't slower, I would search for the problem elsewhere. Since nobody else is proposing these changes, it will have to do with me and without numbers.

Is there an obstacle to host functions actually doing borrowing aka just pass pointers? We need not even copy memory in say:

A hash interface like pub fn keccak_256(data: &[u8]) -> [u8; 32] is inherently inefficient for variable sized data because you must build some Vec before doing the hashing. Although RustCrypto/digest is imperfect, at least their traits avoid this.

This is very precisely what this RFC fixes.

burdges commented 12 months ago

This is very precisely what this RFC fixes.

Just for the case where a fixed length buffer is being hashed. sp_io::Hashing inherently demands allocations when hashing a variable length buffer. I just pointed out the variable length buffer interface looks like:

#[runtime_interface]
pub trait Hashing {
        //  Existing fixed length buffer stuff

    fn keccak_256_update(state: &mut Sha3State, data: &[u8])
    fn keccak_256_finalize(state: &Sha3State) -> [u8; 32]

    fn keccak_512_update(state: &mut Sha3State, data: &[u8])
    fn keccak_512_finalize(state: &Sha3State) -> [u8; 64]

    fn sha2_256_update(state: &mut Sha2State, data: &[u8])
    fn sha2_256_finalize(state: &Sha2State) -> [u8; 32]

    fn blake2_256_update(state: &mut Blake2State, data: &[u8])
    fn blake2_256_finalize(state: &Blake2State) -> [u8; 32]
}

Anyways..

If we typically hash fixed length buffers then maybe this variable length buffer allocation is best to just keep, as the above interfaces demands quite a few more calls.

But before the host-side allocator can be deprecated, all the host functions that make use of it need to be updated to not use it.

I see, we need roughly this for backwards compatibility anyways.

yjhmelody commented 10 months ago

Hi, I want to know any plan to move the allocator inside of wasm runtime. It definitly will make substrate runtime be more generic.

tomaka commented 10 months ago

Same remark as https://github.com/polkadot-fellows/RFCs/pull/8#issuecomment-1722447950

I would like to propose this RFC, but I don't know how to proceed.

bkchr commented 10 months ago

To move forward with this pull request, I would like to see some basic benchmarks being done. @yjhmelody could you please help with this?

bkchr commented 10 months ago

I would also like to see that this directly moves the entire allocator to the runtime. Not sure why we only should do this half way.

tomaka commented 10 months ago

I can expand the RFC to cover all the host functions.

When I opened this RFC (two and a half months ago), I actually expected a swift and unanimous approval. But if RFCs inherently need a long time, I might as well make it as big as possible.

koute commented 10 months ago

+1 for going all the way and moving the whole allocator inside of the runtime, so that for new blocks the allocator host functions are never called anymore.

bkchr commented 10 months ago

When I opened this RFC (two and a half months ago), I actually expected a swift and unanimous approval.

Yeah sorry on that, but it should actually also not be "swift". Let's do it properly.

tomaka commented 10 months ago

I'm currently updating the RFC.

When it comes to ext_offchain_network_state_version_1, I'm going in the direction of deprecating the function due to its unclear semantics. It's not clear from the runtime's perspective whether the PeerId or addresses returned by this function can change over time or not (hint: they can change and this function is very racy).

The only place where this function is used is in the node-authorization pallet, in order to grab the PeerId of the local node and mark it as reserved (I don't really understand why you'd mark yourself as reserved, but whatever). While this is slightly off-topic, I also think that the mechanism of reserved peers should be deprecated due to unclear semantics.

For the sake of not breaking things, I'm going to add a ext_offchain_network_peer_id_version_1 function, but I do think that this function should not exist.

tomaka commented 10 months ago

I've updated the RFC.

As I've mentioned in the "unresolved questions" section, I expect the offchain worker functions to be slightly slower than right now, but this is IMO negligible since they are offchain.

To me, the biggest potential sources of performance degradation comes from the deprecation of ext_storage_get_version_1, as the runtime might have to call ext_storage_read_version_2 multiple times instead of calling ext_storage_get_version_1 once. In theory it should be possible to progressively read storage items (in a streaming way) whose size isn't known ahead of time, rather than completely load it in one go, but I understand that this would be a pretty large change to the runtime code and I'm afraid that half-assing the change might lead to reading becoming slow.

The changes to ext_storage_next_key might also cause performance degradation if not implemented properly. This is also a moderately sized change to the runtime code (that will degrade performances if half-assed), but I have the impression that it shouldn't be too hard to do it properly.

tomaka commented 9 months ago

To move forward with this pull request, I would like to see some basic benchmarks being done. @yjhmelody could you please help with this?

Is something in progress?

While I agree that RFCs shouldn't necessarily be swift, if nothing is done then the ETA of this RFC is "infinity" and it would be great if it was finite. The more we wait, the more other changes (such as #31) will conflict with this RFC.

(and the reason why I originally submitted a stripped-out version is that I anticipated this happening)

burdges commented 9 months ago

The more we wait, the more other changes .. will conflict with this RFC.

Agreed, our unstable host calls for elliptic curves pass Vec<u8>s too for example. I managed to get them down to only one or two allocations each, and in cases those are unavoidable, but.. They could be fixed with a PR now, given they're not stable or used anywhere. I just do not know how to pass a &[u8] to a host call can without doing one allocation. This matters for halo2 usage.

yjhmelody commented 9 months ago

To move forward with this pull request, I would like to see some basic benchmarks being done. @yjhmelody could you please help with this?

Is something in progress?

While I agree that RFCs shouldn't necessarily be swift, if nothing is done then the ETA of this RFC is "infinity" and it would be great if it was finite. The more we wait, the more other changes (such as #31) will conflict with this RFC.

(and the reason why I originally submitted a stripped-out version is that I anticipated this happening)

Sorry,I haven't seen this message before. I personally think there is no need to do benchmarking, it will take a lot of effort. And there is no reason not to support the runtime's own Allocator, which can greatly promote the substrate runtime ecosystem, and obviously it will not degrade performance very much unless you use a very poor memory allocator.

You can see that when various Ethereum ecosystems use wasm or other virtual machines, they all use the runtime Allocator instead of the host. This is very convenient in verification scenarios, because verification is just a pure function, which helps support more For ecological scenarios, it is a pity that substrate has not supported it.

Sorry again, I don't have the energy to promote related improvements recently.

tomaka commented 8 months ago

I've run a quick small experiment and have tried to extract the allocator out of smoldot, in other words do to smoldot the opposite of what this RFC wants to do to the runtime, and then measure the size of smoldot compiled to Wasm. Smoldot uses dlmalloc, which is the allocator used by default by the Rust standard library.

With the allocator outside of smoldot: 2,789.9kB With the allocator inside of smoldot: 2,794.3kB

So basically the size of the code of the allocator is <5 kiB. And that's uncompressed. According to this page, compression seems to reduce the size by ~75%, so we're closer to 1kiB to 1.5kiB of overhead if we moved the allocator to the runtime.

radkomih commented 7 months ago

Sorry if this isn't the right place for this, but I would like to share our insights on the topic:

I'm part of a team developing runtimes in Go, and we've had to modify the compiler to ensure compatibility with the external allocator. We've implemented a custom garbage collector that integrates seamlessly with this external allocator, but this is one of the reasons we're not using the standard Go compiler (another is related to the Wasm target). For us, integrating the host allocator into the runtime is not solely a matter of speed, it's also about facilitating easier adoption and generalization for writing Polkadot runtimes in various languages.

+1 this change to happen.

koute commented 5 months ago

This RFC is kind of dead, so unless anyone else steps up I volunteer to push it forward.

I will implement this in Substrate and report back; let's get this over the finish line.

burdges commented 5 months ago

As discussed above, it appears twox is sufficently depricated to ignore here.

A polkavm-ish question: Is there some minimum overhead for a host function call? If small enough then like I described above ..

One could open up the hash functions guts and provide host functions for digest::Update::update and digest::FixedOutput::finalize_into. We'd avoid doing so many allocations even inside the runtime this way.

koute commented 5 months ago

A polkavm-ish question: Is there some minimum overhead for a host function call? If small enough then like I described above ..

One could open up the hash functions guts and provide host functions for digest::Update::update and digest::FixedOutput::finalize_into. We'd avoid doing so many allocations even inside the runtime this way.

For WASM VMs depending on the implementation hostcalls can be expensive (e.g. if the VM fiddles with the signal handlers when going through the boundary) and for PolkaVM specifically it does involve crossing a thread/process boundary (essentially similar to sending a message through a channel to another thread, but possibly a little cheaper than your average Rust async channel implementation due to optimistic spinlocking I can do), so it's really hard to say and depends on the ratio of work saved vs the number of host calls.

Another alternative would be a vectored call, that is something like fn hashv(slices: &[*const u8], out_hash: &mut [u8]) which is sort of a hybrid where it only takes a single call but you can hash multiple disjoint chunks of memory as if they were next to each other.

Nevertheless, I'm not sure if we need this?

burdges commented 5 months ago

These calls are purely for performance, and have no reason to transfer control to the host. We could've machine code which gets linked in directly maybe? I guess that'd bring adifferent auditing problem..

Nevertheless, I'm not sure if we need this?

Nah. It helps avoids some performance footguns, but whatever.

IRFT standards where those footguns look likely, like hash-to-curve and HKDF, pay some small cost to make this not a problem, not that we support any of their hash functions anyways.

yjhmelody commented 5 months ago

This RFC is kind of dead, so unless anyone else steps up I volunteer to push it forward.

I will implement this in Substrate and report back; let's get this over the finish line.

I'm not free recently, but if you implement it, I can help review it :)

polkadot-fellows / RFCs

Remove unnecessary host functions allocator usage #4