An extensible `NativeVM` that fuses native operations, reducing context-switch overhead

jon-chuang commented 3 years ago

Motivation

The motivation for this discussion stems from several requests to include native code for compute-intensive operations, such as cryptographic operations. https://github.com/solana-labs/solana/pull/17393 https://github.com/solana-labs/solana/issues/17991

The major difficulties lie in how such application-specific operations should be maintained. There is a worry that introducing a large dependency would lead to vulnerabilities.

One option to get around large external dependencies while achieving performance targets would be to allow users to compose their application specific code out of a set of core primitives, such as bignum and field operations.

Although this is not as pleasant as compiling an existing rust library directly into BPF, it's the limitation we have, given that bpf compilation is not super-optimised.

The alternative is to expose "intrinsics". However, the method to do so via BPF is unclear. It could be that a lightweight syscall could be the answer here, which directly modifies bpf AlignedMemory in place. However, as far as I can tell, syscall are far from "lightweight"

In the following, we describe the alternate approach, which is to take the idea of intrinsics further - abstracting away intrinsics into seriaized opcodes which are handled by a separate NativeVM. The problem with this approach, compared with the intrinsics approach, is that it is difficult to add control flow and to mix bpf with native instructions.

An extensible `NativeVM`

VM should encompass any optimised x86 operation, and handled as a NativeVM. The native VM is able to compose native x86 operations and utilises a byte array to store input/output and intermediate results.

The native VM takes as input two raw byte slices (&[u8]).

a data buffer consisting of the serialized input and outputs to the native operations, as well as buffer to store intermediate results.
a program containing a series of opcodes and operand offsets, construction of which is inlined into calling programs via API functions which resolve the opcodes.

However, outside of primitive operations, there will not exist any control flow. NativeVM programs are all linear in the primitive opcodes.

There is no write-protection for the single-buffer memory. This means that it is the caller's responsibility to ensure that the opcodes and offsets are serialized into the opcode buffer correctly.

Here is my suggestion:

NativeVM opcode size: 16 bits (65,536 native operations).
Addressable memory size (into byte array): OFFSET_NUM_BITS. offset: 16bits (64KiB) or 24 bits (16MiB).
Total opcode size: 16 + OFFSET_NUM_BITS * num_operands bits

Here is an example for how to construct a NativeVM syscall:

fn my_bignum_op(x: BigNum<8>, y: BigNum<8>) -> BigNum<24>{
    let mut data_buffer = Vec::with_capacity(size_of::<BigNum<8>>() * 5);

    data_buffer.extend_from_slice(x.to_bytes());
    data_buffer.extend_from_slice(y.to_bytes());
    data_buffer.extend_from_slice(z.to_bytes());

    invoke_native_vm(data_buffer, native_program.key)?;

    let offset = size_of::<BigNum<8>>() * 2;
    let result: BigNum<24> = data_buffer[offset..offset + size_of::<BigNum<24>>()].into();

    result
}

Here is how the opcodes ought to be constructed

    use native_vm::bignum::inline_opcodes as bignum_opcodes;
    let mem = MemoryRegion::new();
    let x = mem.add::<BigNum::<8>>();
    let y = mem.add::<BigNum::<8>>();
    let z = mem.add::<BigNum::<24>>();

    ops.add_mem(&mem);
    ops.add_opcode(bignum_opcodes::mul(x, y, z)?);
    ops.add_opcode(bignum_opcodes::mul_in_place(z, x)?);
    ops.add_opcode(bignum_opcodes::add_in_place(z, y)?);

    let data = ops.to_program_data();
    deploy_account(data)

The opcode data can also be serialized into the transaction data, or it could also be constructed on the fly, or inlined into the calling program, or live in a designated program account.

Error handling will be handled on a per-opcode basis. Errors will be created per-module and live in a u64 namespace, and occupy one band (defined by first 6 bytes, say).

Here is another application, performing a monte carlo simulation for options pricing:

    use native_vm::option_pricing::inline_opcodes as options_opcodes;
    let mem = MemoryRegion::new();
    let x = mem.add::<f64>();
    let seed = mem.add::<[u8; 32]>();
    let res = mem.add::<[f64; 6]>();

    ops.add_mem(&mem);
    ops.add_opcode(options_opcodes::black_scholes(x, seed, res[0])?);
    ops.add_opcode(options_opcodes::black_scholes(x, seed, res[1])?);
    ops.add_opcode(options_opcodes::black_scholes(x, seed, res[2])?);
    ops.add_opcode(options_opcodes::black_scholes(x, seed, res[3])?);
    ops.add_opcode(options_opcodes::black_scholes(x, seed, res[4])?);
    ops.add_opcode(options_opcodes::black_scholes(x, seed, res[5])?);

    let data = ops.to_program_data();
    deploy_account(data)

Just like rbpf, the NativeVM ought to live in a separate repo, away from the main hustle and bustle. Upgrades to the NativeVM in terms of opcode set etc. ought to be feature gated and carefully considered.

CC: @jackcmay @Lichtso @dmakarov

dmakarov commented 3 years ago

Intrinsics call is a call the VM recognizes and handles specifically, eg uses HW accelerated insn sequence to implement the semantics of the intrinsic function. Currently, when RBPF encounters an unresolved call, it tries to reroute it to the runtime system as a syscall. It could be extended to recognize intrinsic calls. In general, in compilers/VM designs intrinsic functions are a relatively inexpensive way to expose hw features without changing the instruction set a VM must be able to execute or a compiler backend has to support. I'm not following what are the advantages of Native VM over adding intrinsics to RBPF.

dmakarov commented 3 years ago

Moreover in case of RBPF there's no dynamic features in the execution environment as is a case for JVM, for example. The JIT in RBPF is a misleading nomenclature, it is an AOT compiler instead. All intrinsic calls could be resolved in the process of the AOT translation, before the program execution starts.

jon-chuang commented 3 years ago

@dmakarov I agree, intrinsics seems the better option.

I was thinking of the fact that bignum intrinsics might have to operate over byte arrays, but even things like AVX already use special "vector" registers. On the other hand a bignum intrinsic can simply take as input a pointer into the VM memory.

I have to say I'm still a bit worried about the intrinsics direction as there are many other primitive operations besides bignum in cryptography such as finite fields.

Further, it's not clear that something like optimised finite field operations alone will be good enough to solve the discrepancy between native and bpf, for complex primitives like elliptic curves and pairings.

I like what you said about resolving ahead of time.

However, for truly optimised performance in cryptography one typically wants the operations to be inlined.

I'm not sure if the intrinsic functions can be inlined in the emitted x86.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

solana-labs / solana