Solana BPF v2 - Githubissues

jackcmay commented 2 years ago

Problem

Solana Bytecode Format is a variation of eBPF. Some of the architectural requirements are different from eBPF and what Solana needs. A new version 2 is needed to move away from BPF and meet the specific needs of Solana

Proposed Solution

[x] #34258
- [ ] BTF ELF-section parsing
- [ ] Static type inference
- [ ] Instruction stream verification
[ ] Dynamic linking / dynamic dispatch
[ ] Decide about implicit vs explicit sign extension in 32bit ALU operations (https://github.com/solana-labs/solana/issues/32924)
[ ] ISA changes: #34250
[x] Reduce the overhead for the loader at runtime, by letting the linker concatenate sections to become consecutive in virtual address space (https://github.com/solana-labs/rbpf/pull/284 and https://github.com/solana-labs/rbpf/pull/301)
[x] Restrict the control flow:
- [x] Only symbols registered in the ELF or referenced by call are valid destinations for callx. (https://github.com/solana-labs/rbpf/pull/397 and https://github.com/solana-labs/rbpf/pull/439)
- [x] Functions can not start in the middle of a lddw instruction (https://github.com/solana-labs/rbpf/pull/440)
- [x] Jumps have to stay within a function, only call / callx can reach other functions. (https://github.com/solana-labs/rbpf/pull/454)
- [x] Forbids the fall-through at the end of a function, which now must be a unconditional-jump back into the function. (https://github.com/solana-labs/rbpf/pull/522)
[x] toolchain administration
- [x] New target triple SBFv2
- [x] New or featurized llvm backend
- [x] New tool cargo-build-sbf
- [x] Tooling changes to specify version (default to v1 for now, eventually to v2), eg. cargo build-sbf --arch=sbfv2

From a documentation perspective SBFv2 will still be referred to as "SBF" and when calling tools like cargo build-sbf a default arch will be provided, v1 initially, then a switch over to v2.

jackcmay commented 2 years ago

FYI @Lichtso @dmakarov

Lichtso commented 2 years ago

I have been thinking, that instead of adding a few new instructions and removing some others from BPF, it might be better to go with a portable-code approach directly.

We are breaking compatibility with BPF in a major anyway and we want to support other native architectures besides x86-64, such as ARMv8 AArch64 and maybe SPIR-V. These would require an entire data flow analysis and register allocation at least, because they are so different from x86-64 in their data-flow style and would not work as well as the current BPF to x86-64 conversion.

Furthermore, we could think a lot more about the final conversion steps and e.g. do things which BPF is not meant for like:

Staying in SSA or implicit operand addressing to simplify register allocation and avoid the need for data flow analysis
Staying in structured programming to avoid the need for control flow analysis
Variable length instruction encoding and text section compression
Prefetch-like hits for address translation
Hints for loop vectorization
Wide range of data encodings such as 128-, 256-, 512-bit values and vectors

dmakarov commented 2 years ago

I have been thinking, that instead of adding a few new instructions and removing some others from BPF, it might be better to go with a portable-code approach directly.

What does this mean in practical terms? What other changes to the BPF instruction set would you suggest (for now considering only supporting non-x86 ISAs)? Maybe some BPF instructions do not map directly to ARM instructions, but is it so radical that we need to redefine the BPF instruction set completely?

Lichtso commented 2 years ago

For the current BPF to x86-64 conversion it is really hard to do any optimizations, because most rely on static analysis that would not be necessary if that information was not thrown away by the previous compilation step.

For ARMv8 AArch64: It has almost double the registers (31) that x86-64 (16) and BPF (11) have. So not to waste them requires an entire reallocation of the registers and reorganization of the stack spilling. Again requiring a data flow and control flow analysis beforehand. All of this could also be skipped if we would not compile down to a specific register set in the first place.

For SPIR-V: The entire architecture works completely different. While you still program on what is seemingly a single thread, it is executed in bundles of 32 or 64 threads in lockstep. So control-flow-divergence has to be minimized. Also there are much more registers (thousands) for cheap context switching, but almost no cache in turn. Data is best touched only once.

Lichtso commented 2 years ago

What does this mean in practical terms? redefine the BPF instruction set completely?

What I suggest is not to fork & modify BPF any further but use something else entirely. So far I looked at WASM and GNU lightning. Designing our own from scratch is also possible.

dmakarov commented 2 years ago

We could change the BPF registers to unlimited number of virtual registers and do the register allocation at run-time. It would add some overhead, of course. I'm not sure it's possible to come up with an instruction set that would map to all real machine instruction sets equally well. Java bytecode is not quite a good example, and optimizing it works well for long running server applications.

SPIR-V also requires a different programming model. It seems one cannot take an arbitrary rust program and compile it for execution on SPIR-V.

dmakarov commented 2 years ago

What does this mean in practical terms? redefine the BPF instruction set completely?

What I suggest is not to fork & modify BPF any further but use something else entirely. So far I looked at WASM and GNU lightning. Designing our own from scratch is also possible.

I agree this is indeed something to think about.

Lichtso commented 2 years ago

SPIR-V also requires a different programming model. It seems one cannot take an arbitrary rust program and compile it for execution on SPIR-V.

Definitely, while there have been some funny trials such as rust-gpu, we would really need to think the "more parallelism" thing through from end-to-end throughout the entire ecosystem, not just the VM and ISA.

We could change the BPF registers to unlimited number of virtual registers and do the register allocation at run-time.

You mean as a form of SSA, right? What about an implicit operand addressing like a stack (FILO) or queue (FIFO) machine. These might even reach a better instruction encoding as they don't need to encode their destination operands.

I'm not sure it's possible to come up with an instruction set that would map to all real machine instruction sets equally well. Java bytecode is not quite a good example, and optimizing it works well for long running server applications.

That is the reason why I mentioned WASM. It has its shortcomings, but it works well over a wide range of hardware architectures.

dmakarov commented 2 years ago

You mean as a form of SSA, right? What about an implicit operand addressing like a stack (FILO) or queue (FIFO)

Yes, this essentially echoes your suggestion of staying in SSA. By stack machine you mean we would redefine BPF abstract machine to be a stack machine the same as Java VM? I guess it's a possibility, but then we really need a good run-time binary translator (in rbpf) to optimize the code. I'm not sure why you're so concerned with instruction encoding -- is it to reduce the code size of on-chain programs?

Lichtso commented 2 years ago

is it to reduce the code size of on-chain programs?

Yes, I think the BPF 8-byte fixed instruction size is terribly wasteful. It might not be a concern for the Linux kernel, for us however, program size is limited by account size.

would redefine BPF abstract machine to be a stack machine

The queue (FIFO) variant might also do the trick and be more straight forward as the instruction schedule can stay in the same order.

Let's say we have a SSA form like this:

...
%3824 = add %3818, %3821
%3825 = add %3815, %3823
%3826 = mul %3824, %3825

To convert to the queue machine representation you simply:

Remove the destination operand as that implicitly is defined by the instruction index.
Encode the source operands as relative to the instruction index of the next instruction.

...
add %5, %2
add %9, %1
mul %1, %0

Not only does it need no destination operands, but also the numeric value of the source operands is bounded. The instruction scheduler can achieve staying in the window of possible relative operand addresses with the same strategy as it uses for the window of possible short jumps.

jackcmay commented 2 years ago

There is room for both here, I think we should formally move away from "BPF" anyway. In the shorter term, the proposed changes described in this issue's description can provide benefits as well as pave the way for a more flexible evolution path for other SBF formats. Aka, SBFvX doesn't have to be anything like SBFv1 or 2.

jackcmay commented 2 years ago

@dmakarov How long do you think the last 4 items in the issue description would take to implement?

dmakarov commented 2 years ago

@dmakarov How long do you think the last 4 items in the issue description would take to implement?

The support for signed division I think we already have enabled in the llvm backend (i'll have to double check).

I started implementing the variable stack size and I believe the remaining work is in RBPF only. The same as now the programs do not manage the stack pointer, but RBPF does, I want to make RBPF manage both frame pointer and stack pointers. On the compiler side there will be only one additional instruction generated in function prologue and epilogue, that communicates the stack size to RBPF. The RBPF at run-time updates its internal data structures managing the frame and stack pointers, without exposing this to the executing program. Both frame and stack pointer are need to know the boundaries of the variable stack frame at all times. I estimate I would need another week to finish this work (but not this current week probably).

dmakarov commented 2 years ago

The two other items input buffers and callx restriction I haven’t looked at yet. I need to understand better what it would take. Restricting callx to known targets is probably simple. What to do if the target is not known? Issue an error? Removing input buffers instructions also does not take much time, but what should be instead of these instructions? Are they not used at all? Then why is it necessary to remove them? Maybe we can discuss this in more detail?

jackcmay commented 2 years ago

Removing those instructions cleans up the instruction set, turns them into invalid instructions, and removes the burden on the runtime to support these instructions.

seanyoung commented 2 years ago

We could change the BPF registers to unlimited number of virtual registers and do the register allocation at run-time. It would add some overhead, of course.

Register allocation is expensive to do, and that's one of the reasons why wasm isn't so great. I think we should tread carefully there.

The other way round, from BPF registers to virtual registers/ssa form is not so expensive. A single pass can create SSA form, and it's not expensive.

Lichtso commented 2 years ago

Register allocation is expensive

That depends on the quality you want. But for ahead-of-time, maximum performance I agree. Then again, in that case, we don't need to compile often so it can take more time.

A single pass can create SSA form, and it's not expensive.

We have that implemented here. Still does not solve the miss match in the number of registers between different native target architectures.

dmakarov commented 2 years ago

We could change the BPF registers to unlimited number of virtual registers and do the register allocation at run-time. It would add some overhead, of course.

Register allocation is expensive to do, and that's one of the reasons why wasm isn't so great. I think we should tread carefully there.

The other way round, from BPF registers to virtual registers/ssa form is not so expensive. A single pass can create SSA form, and it's not expensive.

My original response was to the problem of supporting multiple architectures with various number of available registers (x86, ARM, etc). Excuse me, but I don't quite understand the purpose of creating SSA form at run-time other than for simplifying JIT optimizations. It seems at that point we still would need to do the register allocation for the specific hardware registers of the machine the RBPF is running on. If this is the case, why add an extra step of converting limited number of BPF registers to SSA form at run-time?

seanyoung commented 2 years ago

My original response was to the problem of supporting multiple architectures with various number of available registers (x86, ARM, etc). Excuse me, but I don't quite understand the purpose of creating SSA form at run-time other than for simplifying JIT optimizations. It seems at that point we still would need to do the register allocation for the specific hardware registers of the machine the RBPF is running on. If this is the case, why add an extra step of converting limited number of BPF registers to SSA form at run-time?

I assume the SSA form is needed to support spir-v, that's all.

In order to justify doing register allocation at runtime (with the extra overhead), there would have to be sufficient register spilling in BPF code. I haven't seen this in the BPF code I've seen.

Lichtso commented 2 years ago

It is already buried in the depths of this conversation:

ARMv8 AArch64: It has almost double the registers (31) that x86-64 (16) and BPF (11) have.

The reason for register allocation is not spilling but simply utilizing all the registers ARM has, as we want to target that in the future as well.

seanyoung commented 2 years ago

If the BPF code is not spilling, it has no need for extra registers.

Lichtso commented 2 years ago

Do you have a representative body of BPF programs to analyze how much register pressure we currently have?

Even then, future programs could change everything.

seanyoung commented 2 years ago

I've been staring a lot at generated BPF code and I haven't seen a lot of spilling. The eBPF was designed explicitly to avoid the runtime register-allocation problem; changing this is should not be based on a hunch.

Lichtso commented 2 years ago

The eBPF was designed explicitly to avoid the runtime register-allocation problem

Keep in mind that ARM is RISC, it has no memory operands unlike x86-64.

changing this is should not be based on a hunch

That is certainly true. And I think we need a body of representative BPF programs to benchmark and analyze anyway if we want to design driven by our needs.

Lichtso commented 2 years ago

Support for signed division

It just occured to me that signed integer division would introduce another edge case: isize::MIN / -1 which overflows as the result is isize::MAX + 1.

seanyoung commented 2 years ago

Support for signed division

It just occured to me that signed integer division would introduce another edge case: isize::MIN / -1 which overflows as the result is isize::MAX + 1.

Absolutely. Now the code emitter needs two compares and branches before emitting idiv. div is already an expensive instruction (in cycles). This becomes even worse with signed div.

There is possibly a case to be made that the bpf code can deal with signed div like it has to do now.

Lichtso commented 2 years ago

We should try to get the linker to concatenate similar sections (such as all read-only sections) and place them consecutively in virtual address space, so that we don't need to allocate, copy and fill the gaps with zeros here: https://github.com/solana-labs/rbpf/blob/cd37c0bf61dcce22fa48b594005237eeb130af76/src/elf.rs#L430.

bernieblume commented 2 years ago

Very interesting discussion. Any update on that? Does anyone know the reason why the initial Solana design ended up being based on eBPF rather than WASM, given the fact that there are a bunch of other, pre-existing smart contract chains that are WASM-based. Was it just the Solana founders' familiarity with eBPF?

Lichtso commented 1 year ago

sub imm is kind of useless as it can be represented by add imm with a negative intermediate as well. However, swapping the operands so that the register is subtracted from the immediate would make sense.

neg reg could then also be easily replaced by sub 0 reg. Additionally, That would make overflow conditions and sign extensions more consistent.

Lichtso commented 1 year ago

I finished the task "Restrict the callx jump targets to only allow known symbols": https://github.com/solana-labs/rbpf/pull/397

However, there are some new ideas what to add to SBFv2:

BTF for type tracking at load time. Can potentially be used for skipping memory access checks at runtime.
Restricting jumps to their local function.
Removing le16, le32 and le64 (see https://github.com/solana-labs/rbpf/issues/354)
Removing sub imm or swapping the register and immediate operands roles.

Lichtso commented 1 year ago

In the last two weeks I spent some time refactoring, cleaning up and testing SBPFv2 in RBPF. During that I discovered multiple issues that we should consider changing. I think points 1, 3 and 4 should definitely be implemented before releasing SBPFv2 in program-runtime-v1. The rest is only required for program-runtime-v2 but IMO it would still make sense to add it now so that we can get more testing exposure.

Proposed Solution

In order to speed up and harden the ELF parsing we would ignore relocations and instead apply all of them in lld beforehand. Relocations may still be emitted for debugging purposes or future use. The issue with relocations is that they can modify any byte in the ELF, including the headers and other sections and thus influence the rest of the parsing process. Currently, if the output type is set to "shared" then lld outright refuses to --apply-dynamic-relocs and only supports --emit-relocs. This needs to be fixed.
In order to unify the target addressing scheme of call and callx and allow both to address internal and external functions, we would switch from relative addressing to hash keys for internal calls too, just as we did for external calls (see static syscalls).
In llvm the code emitter for the call instructions encoding would be changed:
- from src=1 to offset=0 for internal function calls
- from src=0 to offset>0 for external function calls

That way the offset and imm fields together form an 56 bit function address:

The MSB 16 bit (from offset) mark the program the address belongs to: 0 is self and everything higher is the index in the list of DT_NEEDED entries plus one.
The LSB 32 bit (from imm) are the hash key inside the program

Previously the function registry was the unified set of addresses which are relative call targets and relocation targets. With both of these gone, we would instead build the function registry directly from the dynamic symbol table. For this, lld should build the dynamic symbol table from local and global functions and nothing else.
In order to avoid having to hash all symbol names when parsing the ELFs, we would instead encode the hashed value in the name field of the symbols in binary directly. This would only apply to the dynamic symbol table and the normal symbol table would still use string names for debugging.
In order to link against external programs lld would emit a list of DT_NEEDED entries with base58 encoded pubkeys (instead of file names).

ripatel-fd commented 1 year ago

@Lichtso All of the above sounds great if we stick with ELF. I had a few thoughts as well:

we would switch from relative addressing to hash keys for internal calls too

This forces a sequential dependency on a memory load for every function call (to look up hash=>destination). Since function calls are quite common, this may worsen performance and trash caches.

Instead, I would go the other direction:

call: Internal function call, relative addressing via immediate
callx: Internal indirect function call, relative addressing via register
farcall (new opcode): External function call, hash addressing via immediate

we would instead encode the hashed value in the name field of the symbols in binary directly

Can you be more specific how this would work? Does the name field point to a hex representation of the hash? If both name and hash are specified, what is the behavior when the hash does not match the name?

DT_NEEDED entries with base58 encoded pubkeys

Perhaps we can save some performance using hex instead. A quick benchmark shows that the reference base58_decode_32 algorithm (used in bs58) takes about 1079ns, while fd_base58 takes 123ns. Not the end of the world, I guess.

Let's use this opportunity to specify strict validation rules for the number of entries of the .dynamic section and the order of these entries (unless this is unreasonably complicated). There should probably be some versioning mechanism for how external imports get resolved. We could use PT_INTERP to point to a dummy loader such as /lib/ld-solana-sbfv2.so.2.0.0.

Generally though, I would prefer a new Solana-specific object format instead of working around limitations in ELF. ELF lacks the features we need. I realize that the additional work required to make such a format may be impractical, but I want to at least give it a try.

Lichtso commented 1 year ago

This forces a sequential dependency on a memory load for every function call (to look up hash=>destination). Since function calls are quite common, this may worsen performance and trash caches.

Good point, at least for interpreters I guess.

Instead, I would go the other direction

We already do distinguish between what you refer to as call and farcall with the src field in the call instruction. Instead what we need is a way for dynamic calls (the callx instruction) to target external functions. How about the following slight modification, which would still unify call and callx instructions and avoid an additional lookup for internal function calls:

There is an 58 bit unified address space for all functions:

internal calls: 16 MSBs are zero, 32 LSBs are the absolute IP / PC
external calls: 16 MSBs are the DT_NEEDED index plus one, 32 LSBs are the hash key
call instruction encoding: 16 MSBs of the unified address in offset, 32 LSBs of the unified address in imm
callx register content: 16 MSBs fixed zeros followed by 58 LSBs of the unified address

Can you be more specific how this would work? Does the name field point to a hex representation of the hash? If both name and hash are specified, what is the behavior when the hash does not match the name?

We could either encode the u32 key directly where the index into the string table would be, or Hex-ASCII encode it as 9 characters (including NULL termination) in the string table. The later would be slightly less efficient because of the indirection but keep compatibility with other tooling such as disassemblers, readelf etc.

Perhaps we can save some performance using hex instead. A quick benchmark shows that the reference base58_decode_32 algorithm (used in bs58) takes about 1079ns, while fd_base58 takes 123ns. Not the end of the world, I guess.

Hex is fine too, base58 is just a bit nicer for debugging but probably not worth it.

Let's use this opportunity to specify strict validation rules. Generally though, I would prefer a new Solana-specific object format instead of working around limitations in ELF. ELF lacks the features we need. I realize that the additional work required to make such a format may be impractical, but I want to at least give it a try.

I think the consensus with our team ATM is that we stick with ELF to keep the tooling reach, but restrict it down as much as possible for ease of parsing.

ripatel-fd commented 1 year ago

@Lichtso Mocking a flat address space is an interesting solution. I wasn't aware that code might need to indirectly call into another object, so that makes sense. That considered, I agree with having both call+callx opcodes handle "near" and "far" calls, and prefer this solution over a mix of hashes and relative addresses.

external calls: 16 MSBs are the DT_NEEDED index plus one, 32 LSBs are the hash key

Using a hash as the lower bits of an address seems a bit exotic ... But not all that new in eBPF land, where low addresses map to kernel function calls.

The remaining issue is the use of binary/opaque identifiers where ELF only supports strings. Ideally, we would find a solution that works the same for symbol hashes and program addresses. Seems like we could either (1) specify a string encoding and use the ELF standard structures, or (2) roll our own custom dynamic imports table and call destination table. I begrudgingly prefer 1.

Lichtso commented 1 year ago

@alessandrod Pointed out to me, that we don't have to hack the symbol hashes into the symbol table or symbol strings directly. Instead we can use the DT_GNU_HASH section to define hash values for the symbols in the ELF with almost no overhead. He also recommended the paper Drepper, Ulrich. "How to write shared libraries." on the topic of linking and shared objects in ELF.

Also, instead of using hash keys in the 32 LSBs of a function pointer, we could use the index into the symbol table. This has the tremendous advantage that without hashing, there also can't be any collisions, which would otherwise be a problem at runtime in the function pointers. An implementation could either lookup the target IP/PC in the symbol table of the selected dependency or allocate and initialize a PLT (procedure linking table) at the beginning of a transaction and do lazy loading (doing the lookup once and then patching the PLT so the next time the same entry is used it becomes a single indirect call).

This approach would mandate that the order of exported symbols in a program stays stable over the course of recompilations / redeployments. During redeployment we have to compare the replacement of a program against its previous version anyway in order to make sure that the existing function signatures stay the same. Thus, this step could also be used to reorder the symbol table accordingly.

solana-labs / solana

Solana BPF v2 #20323

Problem

Proposed Solution

Proposed Solution