Open jackcmay opened 2 years ago
FYI @Lichtso @dmakarov
I have been thinking, that instead of adding a few new instructions and removing some others from BPF, it might be better to go with a portable-code approach directly.
We are breaking compatibility with BPF in a major anyway and we want to support other native architectures besides x86-64, such as ARMv8 AArch64 and maybe SPIR-V. These would require an entire data flow analysis and register allocation at least, because they are so different from x86-64 in their data-flow style and would not work as well as the current BPF to x86-64 conversion.
Furthermore, we could think a lot more about the final conversion steps and e.g. do things which BPF is not meant for like:
I have been thinking, that instead of adding a few new instructions and removing some others from BPF, it might be better to go with a portable-code approach directly.
What does this mean in practical terms? What other changes to the BPF instruction set would you suggest (for now considering only supporting non-x86 ISAs)? Maybe some BPF instructions do not map directly to ARM instructions, but is it so radical that we need to redefine the BPF instruction set completely?
For the current BPF to x86-64 conversion it is really hard to do any optimizations, because most rely on static analysis that would not be necessary if that information was not thrown away by the previous compilation step.
For ARMv8 AArch64: It has almost double the registers (31) that x86-64 (16) and BPF (11) have. So not to waste them requires an entire reallocation of the registers and reorganization of the stack spilling. Again requiring a data flow and control flow analysis beforehand. All of this could also be skipped if we would not compile down to a specific register set in the first place.
For SPIR-V: The entire architecture works completely different. While you still program on what is seemingly a single thread, it is executed in bundles of 32 or 64 threads in lockstep. So control-flow-divergence has to be minimized. Also there are much more registers (thousands) for cheap context switching, but almost no cache in turn. Data is best touched only once.
What does this mean in practical terms? redefine the BPF instruction set completely?
What I suggest is not to fork & modify BPF any further but use something else entirely. So far I looked at WASM and GNU lightning. Designing our own from scratch is also possible.
We could change the BPF registers to unlimited number of virtual registers and do the register allocation at run-time. It would add some overhead, of course. I'm not sure it's possible to come up with an instruction set that would map to all real machine instruction sets equally well. Java bytecode is not quite a good example, and optimizing it works well for long running server applications.
SPIR-V also requires a different programming model. It seems one cannot take an arbitrary rust program and compile it for execution on SPIR-V.
What does this mean in practical terms? redefine the BPF instruction set completely?
What I suggest is not to fork & modify BPF any further but use something else entirely. So far I looked at WASM and GNU lightning. Designing our own from scratch is also possible.
I agree this is indeed something to think about.
SPIR-V also requires a different programming model. It seems one cannot take an arbitrary rust program and compile it for execution on SPIR-V.
Definitely, while there have been some funny trials such as rust-gpu, we would really need to think the "more parallelism" thing through from end-to-end throughout the entire ecosystem, not just the VM and ISA.
We could change the BPF registers to unlimited number of virtual registers and do the register allocation at run-time.
You mean as a form of SSA, right? What about an implicit operand addressing like a stack (FILO) or queue (FIFO) machine. These might even reach a better instruction encoding as they don't need to encode their destination operands.
I'm not sure it's possible to come up with an instruction set that would map to all real machine instruction sets equally well. Java bytecode is not quite a good example, and optimizing it works well for long running server applications.
That is the reason why I mentioned WASM. It has its shortcomings, but it works well over a wide range of hardware architectures.
You mean as a form of SSA, right? What about an implicit operand addressing like a stack (FILO) or queue (FIFO)
Yes, this essentially echoes your suggestion of staying in SSA. By stack machine you mean we would redefine BPF abstract machine to be a stack machine the same as Java VM? I guess it's a possibility, but then we really need a good run-time binary translator (in rbpf) to optimize the code. I'm not sure why you're so concerned with instruction encoding -- is it to reduce the code size of on-chain programs?
is it to reduce the code size of on-chain programs?
Yes, I think the BPF 8-byte fixed instruction size is terribly wasteful. It might not be a concern for the Linux kernel, for us however, program size is limited by account size.
would redefine BPF abstract machine to be a stack machine
The queue (FIFO) variant might also do the trick and be more straight forward as the instruction schedule can stay in the same order.
Let's say we have a SSA form like this:
...
%3824 = add %3818, %3821
%3825 = add %3815, %3823
%3826 = mul %3824, %3825
To convert to the queue machine representation you simply:
...
add %5, %2
add %9, %1
mul %1, %0
Not only does it need no destination operands, but also the numeric value of the source operands is bounded. The instruction scheduler can achieve staying in the window of possible relative operand addresses with the same strategy as it uses for the window of possible short jumps.
There is room for both here, I think we should formally move away from "BPF" anyway. In the shorter term, the proposed changes described in this issue's description can provide benefits as well as pave the way for a more flexible evolution path for other SBF formats. Aka, SBFvX doesn't have to be anything like SBFv1 or 2.
@dmakarov How long do you think the last 4 items in the issue description would take to implement?
@dmakarov How long do you think the last 4 items in the issue description would take to implement?
The support for signed division I think we already have enabled in the llvm backend (i'll have to double check).
I started implementing the variable stack size and I believe the remaining work is in RBPF only. The same as now the programs do not manage the stack pointer, but RBPF does, I want to make RBPF manage both frame pointer and stack pointers. On the compiler side there will be only one additional instruction generated in function prologue and epilogue, that communicates the stack size to RBPF. The RBPF at run-time updates its internal data structures managing the frame and stack pointers, without exposing this to the executing program. Both frame and stack pointer are need to know the boundaries of the variable stack frame at all times. I estimate I would need another week to finish this work (but not this current week probably).
The two other items input buffers and callx restriction I haven’t looked at yet. I need to understand better what it would take. Restricting callx to known targets is probably simple. What to do if the target is not known? Issue an error? Removing input buffers instructions also does not take much time, but what should be instead of these instructions? Are they not used at all? Then why is it necessary to remove them? Maybe we can discuss this in more detail?
Removing those instructions cleans up the instruction set, turns them into invalid instructions, and removes the burden on the runtime to support these instructions.
We could change the BPF registers to unlimited number of virtual registers and do the register allocation at run-time. It would add some overhead, of course.
Register allocation is expensive to do, and that's one of the reasons why wasm isn't so great. I think we should tread carefully there.
The other way round, from BPF registers to virtual registers/ssa form is not so expensive. A single pass can create SSA form, and it's not expensive.
Register allocation is expensive
That depends on the quality you want. But for ahead-of-time, maximum performance I agree. Then again, in that case, we don't need to compile often so it can take more time.
A single pass can create SSA form, and it's not expensive.
We have that implemented here. Still does not solve the miss match in the number of registers between different native target architectures.
We could change the BPF registers to unlimited number of virtual registers and do the register allocation at run-time. It would add some overhead, of course.
Register allocation is expensive to do, and that's one of the reasons why wasm isn't so great. I think we should tread carefully there.
The other way round, from BPF registers to virtual registers/ssa form is not so expensive. A single pass can create SSA form, and it's not expensive.
My original response was to the problem of supporting multiple architectures with various number of available registers (x86, ARM, etc). Excuse me, but I don't quite understand the purpose of creating SSA form at run-time other than for simplifying JIT optimizations. It seems at that point we still would need to do the register allocation for the specific hardware registers of the machine the RBPF is running on. If this is the case, why add an extra step of converting limited number of BPF registers to SSA form at run-time?
My original response was to the problem of supporting multiple architectures with various number of available registers (x86, ARM, etc). Excuse me, but I don't quite understand the purpose of creating SSA form at run-time other than for simplifying JIT optimizations. It seems at that point we still would need to do the register allocation for the specific hardware registers of the machine the RBPF is running on. If this is the case, why add an extra step of converting limited number of BPF registers to SSA form at run-time?
I assume the SSA form is needed to support spir-v, that's all.
In order to justify doing register allocation at runtime (with the extra overhead), there would have to be sufficient register spilling in BPF code. I haven't seen this in the BPF code I've seen.
It is already buried in the depths of this conversation:
ARMv8 AArch64: It has almost double the registers (31) that x86-64 (16) and BPF (11) have.
The reason for register allocation is not spilling but simply utilizing all the registers ARM has, as we want to target that in the future as well.
If the BPF code is not spilling, it has no need for extra registers.
Do you have a representative body of BPF programs to analyze how much register pressure we currently have?
Even then, future programs could change everything.
I've been staring a lot at generated BPF code and I haven't seen a lot of spilling. The eBPF was designed explicitly to avoid the runtime register-allocation problem; changing this is should not be based on a hunch.
The eBPF was designed explicitly to avoid the runtime register-allocation problem
Keep in mind that ARM is RISC, it has no memory operands unlike x86-64.
changing this is should not be based on a hunch
That is certainly true. And I think we need a body of representative BPF programs to benchmark and analyze anyway if we want to design driven by our needs.
Support for signed division
It just occured to me that signed integer division would introduce another edge case:
isize::MIN / -1
which overflows as the result is isize::MAX + 1
.
Support for signed division
It just occured to me that signed integer division would introduce another edge case:
isize::MIN / -1
which overflows as the result isisize::MAX + 1
.
Absolutely. Now the code emitter needs two compares and branches before emitting idiv
. div
is already an expensive instruction (in cycles). This becomes even worse with signed div.
There is possibly a case to be made that the bpf code can deal with signed div like it has to do now.
We should try to get the linker to concatenate similar sections (such as all read-only sections) and place them consecutively in virtual address space, so that we don't need to allocate, copy and fill the gaps with zeros here: https://github.com/solana-labs/rbpf/blob/cd37c0bf61dcce22fa48b594005237eeb130af76/src/elf.rs#L430.
Very interesting discussion. Any update on that? Does anyone know the reason why the initial Solana design ended up being based on eBPF rather than WASM, given the fact that there are a bunch of other, pre-existing smart contract chains that are WASM-based. Was it just the Solana founders' familiarity with eBPF?
sub imm
is kind of useless as it can be represented by add imm
with a negative intermediate as well. However, swapping the operands so that the register is subtracted from the immediate would make sense.
neg reg
could then also be easily replaced by sub 0 reg
. Additionally, That would make overflow conditions and sign extensions more consistent.
I finished the task "Restrict the callx jump targets to only allow known symbols": https://github.com/solana-labs/rbpf/pull/397
However, there are some new ideas what to add to SBFv2:
le16
, le32
and le64
(see https://github.com/solana-labs/rbpf/issues/354)sub imm
or swapping the register and immediate operands roles.In the last two weeks I spent some time refactoring, cleaning up and testing SBPFv2 in RBPF. During that I discovered multiple issues that we should consider changing. I think points 1, 3 and 4 should definitely be implemented before releasing SBPFv2 in program-runtime-v1. The rest is only required for program-runtime-v2 but IMO it would still make sense to add it now so that we can get more testing exposure.
In order to speed up and harden the ELF parsing we would ignore relocations and instead apply all of them in lld
beforehand. Relocations may still be emitted for debugging purposes or future use. The issue with relocations is that they can modify any byte in the ELF, including the headers and other sections and thus influence the rest of the parsing process.
Currently, if the output type is set to "shared" then lld
outright refuses to --apply-dynamic-relocs
and only supports --emit-relocs
. This needs to be fixed.
In order to unify the target addressing scheme of call
and callx
and allow both to address internal and external functions,
we would switch from relative addressing to hash keys for internal calls too, just as we did for external calls (see static syscalls).
In llvm
the code emitter for the call
instructions encoding would be changed:
src=1
to offset=0
for internal function callssrc=0
to offset>0
for external function callsThat way the offset
and imm
fields together form an 56 bit function address:
offset
) mark the program the address belongs to: 0
is self
and everything higher is the index in the list of DT_NEEDED
entries plus one.imm
) are the hash key inside the programPreviously the function registry was the unified set of addresses which are relative call targets and relocation targets. With both of these gone, we would instead build the function registry directly from the dynamic symbol table. For this, lld
should build the dynamic symbol table from local and global functions and nothing else.
In order to avoid having to hash all symbol names when parsing the ELFs, we would instead encode the hashed value in the name field of the symbols in binary directly. This would only apply to the dynamic symbol table and the normal symbol table would still use string names for debugging.
In order to link against external programs lld
would emit a list of DT_NEEDED
entries with base58 encoded pubkeys (instead of file names).
@Lichtso All of the above sounds great if we stick with ELF. I had a few thoughts as well:
we would switch from relative addressing to hash keys for internal calls too
This forces a sequential dependency on a memory load for every function call (to look up hash=>destination). Since function calls are quite common, this may worsen performance and trash caches.
Instead, I would go the other direction:
we would instead encode the hashed value in the name field of the symbols in binary directly
Can you be more specific how this would work? Does the name field point to a hex representation of the hash? If both name and hash are specified, what is the behavior when the hash does not match the name?
DT_NEEDED entries with base58 encoded pubkeys
Perhaps we can save some performance using hex instead. A quick benchmark shows that the reference base58_decode_32 algorithm (used in bs58) takes about 1079ns, while fd_base58 takes 123ns. Not the end of the world, I guess.
Let's use this opportunity to specify strict validation rules for the number of entries of the .dynamic
section and the order of these entries (unless this is unreasonably complicated). There should probably be some versioning mechanism for how external imports get resolved. We could use PT_INTERP
to point to a dummy loader such as /lib/ld-solana-sbfv2.so.2.0.0
.
Generally though, I would prefer a new Solana-specific object format instead of working around limitations in ELF. ELF lacks the features we need. I realize that the additional work required to make such a format may be impractical, but I want to at least give it a try.
This forces a sequential dependency on a memory load for every function call (to look up hash=>destination). Since function calls are quite common, this may worsen performance and trash caches.
Good point, at least for interpreters I guess.
Instead, I would go the other direction
We already do distinguish between what you refer to as call and farcall with the src
field in the call
instruction. Instead what we need is a way for dynamic calls (the callx
instruction) to target external functions. How about the following slight modification, which would still unify call
and callx
instructions and avoid an additional lookup for internal function calls:
There is an 58 bit unified address space for all functions:
internal calls: 16 MSBs are zero, 32 LSBs are the absolute IP / PC
external calls: 16 MSBs are the DT_NEEDED
index plus one, 32 LSBs are the hash key
call
instruction encoding: 16 MSBs of the unified address in offset
, 32 LSBs of the unified address in imm
callx
register content: 16 MSBs fixed zeros followed by 58 LSBs of the unified address
Can you be more specific how this would work? Does the name field point to a hex representation of the hash? If both name and hash are specified, what is the behavior when the hash does not match the name?
We could either encode the u32
key directly where the index into the string table would be, or Hex-ASCII encode it as 9 characters (including NULL
termination) in the string table. The later would be slightly less efficient because of the indirection but keep compatibility with other tooling such as disassemblers, readelf etc.
Perhaps we can save some performance using hex instead. A quick benchmark shows that the reference base58_decode_32 algorithm (used in bs58) takes about 1079ns, while fd_base58 takes 123ns. Not the end of the world, I guess.
Hex is fine too, base58 is just a bit nicer for debugging but probably not worth it.
Let's use this opportunity to specify strict validation rules. Generally though, I would prefer a new Solana-specific object format instead of working around limitations in ELF. ELF lacks the features we need. I realize that the additional work required to make such a format may be impractical, but I want to at least give it a try.
I think the consensus with our team ATM is that we stick with ELF to keep the tooling reach, but restrict it down as much as possible for ease of parsing.
@Lichtso Mocking a flat address space is an interesting solution. I wasn't aware that code might need to indirectly call into another object, so that makes sense. That considered, I agree with having both call+callx opcodes handle "near" and "far" calls, and prefer this solution over a mix of hashes and relative addresses.
external calls: 16 MSBs are the DT_NEEDED index plus one, 32 LSBs are the hash key
Using a hash as the lower bits of an address seems a bit exotic ... But not all that new in eBPF land, where low addresses map to kernel function calls.
The remaining issue is the use of binary/opaque identifiers where ELF only supports strings. Ideally, we would find a solution that works the same for symbol hashes and program addresses. Seems like we could either (1) specify a string encoding and use the ELF standard structures, or (2) roll our own custom dynamic imports table and call destination table. I begrudgingly prefer 1.
@alessandrod Pointed out to me, that we don't have to hack the symbol hashes into the symbol table or symbol strings directly. Instead we can use the DT_GNU_HASH
section to define hash values for the symbols in the ELF with almost no overhead. He also recommended the paper Drepper, Ulrich. "How to write shared libraries." on the topic of linking and shared objects in ELF.
Also, instead of using hash keys in the 32 LSBs of a function pointer, we could use the index into the symbol table. This has the tremendous advantage that without hashing, there also can't be any collisions, which would otherwise be a problem at runtime in the function pointers. An implementation could either lookup the target IP/PC in the symbol table of the selected dependency or allocate and initialize a PLT (procedure linking table) at the beginning of a transaction and do lazy loading (doing the lookup once and then patching the PLT so the next time the same entry is used it becomes a single indirect call).
This approach would mandate that the order of exported symbols in a program stays stable over the course of recompilations / redeployments. During redeployment we have to compare the replacement of a program against its previous version anyway in order to make sure that the existing function signatures stay the same. Thus, this step could also be used to reorder the symbol table accordingly.
Problem
Solana Bytecode Format is a variation of eBPF. Some of the architectural requirements are different from eBPF and what Solana needs. A new version 2 is needed to move away from BPF and meet the specific needs of Solana
Proposed Solution
call
are valid destinations forcallx
. (https://github.com/solana-labs/rbpf/pull/397 and https://github.com/solana-labs/rbpf/pull/439)lddw
instruction (https://github.com/solana-labs/rbpf/pull/440)call
/callx
can reach other functions. (https://github.com/solana-labs/rbpf/pull/454)cargo-build-sbf
cargo build-sbf --arch=sbfv2
From a documentation perspective SBFv2 will still be referred to as "SBF" and when calling tools like
cargo build-sbf
a default arch will be provided, v1 initially, then a switch over to v2.