solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
12.91k stars 4.13k forks source link

BPF VM stack frame size limit is too limiting for programs #13391

Closed jstarry closed 1 year ago

jstarry commented 3 years ago

Problem

The BPF VM limits stack frames to 4KB which can easily be surpassed by non-trivial programs. There are workarounds for this limit but they sacrifice developer ergonomics and program efficiency. Since accounts can be up to 10MB in size, it's not unreasonable to think devs will want stack allocated objects that use a moderate portion of that limit. For instance, the Serum order book state is 3228 bytes. Processing more than order book state at the same time in a program becomes impossible without heap allocations or breaking up the state into smaller views.

Work arounds:
  1. Allocate on the heap instead of the stack

This is a big headache for Rust programs since Rust allocates on the stack by default and doesn't yet provide an easy way to avoid stack allocations when allocating on the heap:

https://docs.rs/bytemuck/1.4.1/bytemuck/allocation/fn.try_zeroed_box.html https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=c78da94118e5e09925138548cd7093cb

  1. Break up large account states into smaller views

Rather than deserializing the full account state, programs can be more efficient about the data they deserialize to reduce stack allocations and instruction count. There isn't a convention in the Rust SDK yet that makes this easy.

  1. Zero-copy deserialization

Same as 2) above, this would probably be a good pattern to promote but there isn't a convention or tooling that makes this easy yet.

  1. Micro optimizations

There's probably lots of small tricks that could be done to break up program logic to minimize stack allocations. Rust programs can eagerly drop stack allocated objects by using nested scopes or they can split up logic into smaller functions to avoid accumulating too many allocations.

Proposed Solution

Increase stack frame size considerably

tag @jackcmay

jackcmay commented 3 years ago

BPF stack consists of stack frames rather than a mutable stack pointer. The unfortunate side effect of this is that each function call uses its own stack frame. Increasing the stack frame size to 10MB would be a very large burden given we support call depths 64 deep.

There has been discussions recently about doubling the stack frame size to 8k since most of the reports we've got from developers and internally are within 8k. We could make that larger than 8k to be on the safe side but any increase in the stack frame size results in that increase being applied to each call in a call chain. There are optimizations we can make to help miniumize the cost like allocate chunks of frames as needed or use a heap pool but the sheer amount of memory being allocated for a program instantiation should be taken into consideration for performance.

We've also considered breaking away further from standard eBPF by changing from stack frames to stack pointers...

@jstarry from your experience did you have a stack frame size that you think would meet most folk's needs?

jstarry commented 3 years ago

Doubling to 8k would certainly help! I did a bit of digging and it seems the biggest hits are rather unexpected like when LLVM inlines functions and when local variables are added again to the stack when passed in a function call.

aeyakovenko commented 3 years ago

@jstarry why can't you use Box<> to allocate on the heap?

jstarry commented 3 years ago

You can but Rust doesn't make it easy to avoid an allocation on the stack when allocating to the heap. You have to use a library or do some unsafe stuff

let boxed = unsafe {
  let layout = Layout::new::<BigStruct>();
  let raw_allocation = alloc_zeroed(layout) as *mut BigStruct;
  Box::from_raw(raw_allocation)
};

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=c78da94118e5e09925138548cd7093cb

jstarry commented 3 years ago

And I stopped short of recommending devs do this in our docs because I feel like it's an anti-pattern that will affect performance. Stack allocation should be faster if the BPF VM allows it right?

aeyakovenko commented 3 years ago

@jstarry it will make the programs larger and more expensive to load, since it effects every frame. Can we write a macro for heap alloc?

jstarry commented 3 years ago

Yeah we can, I laid out a few other work arounds too. I think zero-copy deserialization could be helpful, I'm not sure why we haven't been doing that in SPL.

dzmitry-lahoda commented 3 years ago

porting borsh to zerocopy from account will help a lot for stack sizes, and i guess for performance. borsh looks simplistic serialization format. also some coold third parties, like enum_dispatch may not work with scoped references out of box, will have to impl the patter (swich instead of dyn) manually.

jstarry commented 3 years ago

Experimenting with https://github.com/djkoloski/rkyv would be cool. @dzmitry-lahoda are you interested in trying it out to see what the stack usage improvements would be?

dzmitry-lahoda commented 3 years ago

I very believe that zero-copy is the best to approach for having proper coding of the solana programs. ideally, brosh would provide option to use either serde onto stack or into heap or generating zero copy view. but if they will not go that route, than yes - rkyv is option to research.

what would be minimal research output to consider accepted using rkuv in spl?

jstarry commented 3 years ago

what would be minimal research output to consider accepted using rkuv in spl?

a working demo that deserializes a large account struct without blowing up the stack or using heap allocation would be cool

dzmitry-lahoda commented 3 years ago

There are no default rkyv codes which uses a slice of account data... It works well only with an allocated vector. It panics(hangs VM) or does not have traits implemented for doing serde against sliced serializers https://github.com/dzmitry-lahoda/solana-zero-copy-serde-example/blob/290d2c36a6460e6d23dd3b4b4d889acb0c14361d/src/processor.rs#L41

So at the first attempt, it does not work in VM, and for sure will need impl something to work with a slice of account data.

As of now rkyv generates wrappers around raw data (hopefully not copies them, why zero copy api so insisting on allocator?). That wrapper provides plain structs fields access, not view via functions. So it is zero copy (if), then data on storage = data in mem (because no view generated).

Not sure I like rkyv now.

jstarry commented 3 years ago

Hey @djkoloski 👋🏻 we are exploring the use of rkyv and it looks really promising but it's not clear how to leverage zero-copy deserialization over a slice of data. Can you provide some pointers?

djkoloski commented 3 years ago

Hey all, thanks for the interest in rkyv. Let me know if I missed addressing anything:

Responses

You can but Rust doesn't make it easy to avoid an allocation on the stack when allocating to the heap.

Yeah, this is annoying for stuff like Box::new that takes a variable by value and moves it to a heap allocation. You end up needing stack space to pass the value in, then heap space to store the value. When you have an array of objects that you want to avoid stack allocation for, you can always use a Vec which will always heap-allocate and can have elements added one at a time. There's also an unstable Box API for doing the heap-allocating pattern given above (new_zeroed / assume_init).

There are no default rkyv codes which uses a slice of account data... It works well only with an allocated vector.

I'm the first to admit that rkyv is a little complicated especially compared to serde. Most of the examples in the book are for sized types; that's just because sized types are what most users will be working with. Slices and other unsized types are supported by rkyv, you just have to use different functions in a few places. In this case, I'd suggest using serialize_unsized_value to serialize a slice and archived_unsized_value to get the slice back out. You can read more about how rkyv handles unsized values in the book and the documentation for ArchiveUnsized.

As of now rkyv generates wrappers around raw data (hopefully not copies them, why zero copy api so insisting on allocator?).

All of the work rkyv does is at serialization time, which makes sense as it's a zero-copy deserialization framework. 🙂 You can read more about how it works in the book. As a quick summary, regular types aren't suitable for zero-copying if they contain any data like pointers that may change from execution to execution. rkyv generates "archived" variants of types that are made of types that act the same but are zero-copy safe. Then when you serialize a regular type, rkyv converts it to that archived variant and writes that to your output stream. The process of getting data back out is as simple as a pointer cast to the archived type.

It panics(hangs VM)

There are a few situations that can cause this to happen, especially surrounding buffer alignment. As of 0.4.3 there should be some checks verifying that, but they're behind a #[cfg(debug_assertions)] and won't get run if you're only running release code. If you ever run into crashes or hangs when using a buffer, I'd definitely recommend running it through check_archive and bytecheck should diagnose any issues. If you run into any unexpected behavior, feel free to file an issue and I'll look into it as soon as I can.

Sample code

Here's an example of serializing a slice:

let my_slice = // ...
let writer = // ...

let mut serializer = WriteSerializer::new(writer);
let pos = serializer.serialize_unsized_value(my_slice)?;

And accessing it:

// Make sure that your data buffer is properly aligned for the types inside
let data = // ...

let archived = unsafe { archived_unsized_value::<[MySliceType]>(data, pos) };

If you have any other questions feel free to ping me or hop in the discord and ask some questions.

dzmitry-lahoda commented 3 years ago

Produce zero copy view

rkyv requires AccountInfo data: Rc<RefCell<&'a mut [u8]>>, to be aligend to 8 bytes. If I get data, and get slice at data[1..] it fails. If i get data[8..] it works. Deserialization work in local test VM, but not sure if the guarantee of alignment will hold always or will work for program deployed into Solana cluster.

How to mutate state?

Views are immutable. As I got of https://davidkoloski.me/rkyv/architecture/deserialize.html , mutating is next:

  1. zero copy deserialize
  2. find part of object we want to mutate (ArchivedState), take pointer to it.
  3. deserialize it into heap with copying as mut (SomePartOfState).
  4. mutate
  5. serialize it into pointer as ArhivedSomePartOfState

Looks a little bit complicated and low level (for something that should be verified by auditors of smart contract codes, may be).

Mutation ideal

Simplified options to steps above (some wrappers, written by solana may be) can simplify steps above so. How it is easy to do with borsh (borsh serialization look simple enough to traverse it, at least I was lucky enough to hit proper bits of it by calculating layout in head).

But I thought of some traits generated for each Archived struct so that when I have traversed to the needed view, I call mutating any part of it via arhive.some_part_of_state_as_mut() which returns a struct with function to mutate it.

djkoloski commented 3 years ago

For mutation, you can also use archived_value_mut / archived_unsized_value_mut to get a mutable reference to archived data. I'll add some more examples of its use in the future. It essentially gives you a Pin<&mut ArchivedState> that you can traverse and mutate. You may have to write some extra accessors to get the pinned fields:

#[derive(Archive, Serialize, Deserialize)]
pub struct AccountState {
    pub number: i32,
    pub string: String,
}

impl ArchivedAccountState {
    // i32 is Unpin, so we can choose `number` to have non-structural pinning
    // We could choose to have structural pinning instead if we wanted
    pub fn number_pin(self: Pin<&mut Self>) -> &mut i32 {
        unsafe { &mut self.get_unchecked_mut().number }
    }

    // ArchivedString is not Unpin, so `string` must have structural pinning
    pub fn string_pin(self: Pin<&mut Self>) -> Pin<&mut Archived<String>> {
        unsafe { self.map_unchecked_mut(|s| &mut s.string) }
    }
}

fn execute(account_info: &AccountInfo) {
    let data = account_info.data.try_borrow_mut().unwrap();
    let buf = &mut data[..];
    // Not sure where you'd store the root object position, but there are some
    // alternative ways to do this that don't require storing the data pos
    let pos = account_info.data.pos();

    let mut value = unsafe { archived_value_mut::<AccountState>(Pin::new(buf), pos) };
    *value.number_pin() = 42;
    println!("archived string: {}", value.get_ref()
}

I haven't tested this 100%, so my apologies if anything doesn't work out of the box.

If you use the mutable API for rkyv, you should always have structural pinning for fields that are not Unpin. I'm not 100% sure that's what you were looking for, but hopefully that helps. This isn't a general solution for mutations that change the sizes of dynamically allocated objects like strings and arrays, but it might work for some cases.

ryoqun commented 3 years ago

@djkoloski Hi, glad to see the maintainer of rkyv here. :) rkyv has been on my watch list for awhile. If we can adapt rkyv in some way, it'll be huge. :)

This isn't a general solution for mutations that change the sizes of dynamically allocated objects like strings and arrays, but it might work for some cases.

I'm wondering we could realize this by combining rust's unstable allocator_api(https://github.com/rust-lang/rust/issues/32838). My idea is that Archived can be type-parameterized like ArchiveAllocator. Then, it can lend (mmap-backed) rkyv-controlled memory region which is basically convertible to rkyv's RelativePtr on collection re-allocation etc. Some space inefficiency/fragmentation of it can be ok. Can this wild idea come true from eyes of actual writer of it?

Then, this will allow for zero-copy growable account.data serialization/deserialization for dynamic collections like HashMap and BTreemap. Sounds cool. :)

djkoloski commented 3 years ago

@ryoqun Thank you for your kind words. This seems like something that could be possible with some work, but I also don't want to derail this issue too much. If this is something you're interested in, I definitely encourage you to write it up in an issue on the rkyv repo. 🙂

dzmitry-lahoda commented 3 years ago

@djkoloski thanks.

@jstarry it works for deep layered structures https://github.com/dzmitry-lahoda/solana-zero-copy-serde-example/tree/dz/zerocopy . borsh overflows (see comment).

It may seem possible to wrap rkyv into some nicer API (not sure if can make it safer).

I dislike the need to align to 8 bytes (wastes 7 bytes per instruction input).

Also, because structs are generated, my rust-analyzer in vscode is not very friendly to typing, but hope that issues will be fixed in future Rust.

Another issue, how to provide memory layout to external (non Rust) consumers (could C headers be generated from generated structs?)

djkoloski commented 3 years ago

Minor update, you should be able to use the archived_root helper functions to avoid length-prefixing your buffers now (as long as you can meet certain conditions).

dzmitry-lahoda commented 3 years ago

other concerns:

jackcmay commented 3 years ago

@dmakarov Would you work up what would be necessary to increase the stack size?

jstarry commented 2 years ago

Getting some requests for 8kb stack size, think we could prioritize this?

dmakarov commented 2 years ago

Yes, I'll work on this on Monday.

Lichtso commented 2 years ago

How about we drop the fixed stack frame size all together, so that the reserved stack area can be utilized better? E.g. currently a program which wants to use more than the size of one stack frame at once will fail even if it has no recursion at all and there is enough space in total on the stack. The same goes for a program with a deep recursion which only uses very small stack frames. That will also hit the depth limit even though there is still enough space in total.

IIRC the only reason we use aligned stack frames is for easier debugging. Also, that would require programs to be able to write to the stack pointer to mange their stack allocations dynamically.

jackcmay commented 2 years ago

Fixed BPF stack frames are part of the BPF arch spec but not ideal for our use case for many reasons. We've agreed that a single stack region with a mutable stack pointer like more current architecture use is the way we want to go.

The decision on when that happens is based on how far we want to diverge from BPF and when.

@dmakarov do you have an idea of what it would take to modify the BPF backend to make the switch from fixed-sized stack frame to a mutable stack pointer?

dmakarov commented 2 years ago

We'd have to implement emitPrologue/emitEpilogue methods in BPFFrameLowering class (currently these are empty functions). Maybe some other minor changes are needed. Other than that I don't anticipate big changes.

jackcmay commented 2 years ago

This kind of architectural change should coincide with our move from BPF to SBF as a target architecture. @dmakarov any objection to adding this to the list of changes that are needed when we do that move?

dmakarov commented 2 years ago

No objections. I think you're right.

jstarry commented 2 years ago

@jackcmay @dmakarov any updates here?

Lichtso commented 2 years ago

See parent issue: https://github.com/solana-labs/solana/issues/20323 "Variable stack pointer instead of fixed stack frames" is done. However, the entire SBFv2 project still needs testing and is not activated yet.

dmakarov commented 1 year ago

@alessandrod can we close this as you have enabled variable size stacks?

alessandrod commented 1 year ago

This is done but not available yet, since it's part of SBFv2 which hasn't been rolled out yet. Up to you if you want to close or leave open until rollout!

dmakarov commented 1 year ago

This is resolved by https://github.com/solana-labs/rbpf/pull/274 Closing this for now.