Open maxgillett opened 1 year ago
Should we define both STOREI16
(low) and STOREh16
(high) to be able to write to any byte of a word?
I guess we can start by sorting the calling convention and the related code in TargetLowering
.
That is to say, successfully compiling this snippet may be one of the item in the first milestone:
define i16 @main(i32 %a, i16 %b, i8 %c) {
ret i16 %b
}
I think this LLVM backend can be done in similar to WebAssmebly , something like https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/WebAssembly/WebAssemblyReplacePhysRegs.cpp
I guess we can start by sorting the calling convention and the related code in
TargetLowering
.That is to say, successfully compiling this snippet may be one of the item in the first milestone:
define i16 @main(i32 %a, i16 %b, i8 %c) { ret i16 %b }
Yes, I think that's a good plan. I started working towards this goal by implementing call lowering in GlobalISel. I still need to implement CallLowering::OutgoingValueHandler
. IR translation is working for lowering formal arguments into the stack, but legalization and instruction selection is not complete (see issues #3 and #4).
I think this LLVM backend can be done in similar to WebAssmebly , something like https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/WebAssembly/WebAssemblyReplacePhysRegs.cpp
Yes, I think there are similarities. I think the problem simplifies if using GlobalISel (which WebAssembly does not use), because we can match instructions directly to virtual registers in that framework, and never have to allocate physical registers (edit: I might be wrong about this. It seems you do need to assign virtual registers to a physical register bank first, but like you pointed out, this can probably still be replaced with a virtual register.)
The implemented READ_ADVICE
and WRITE_ADVICE
instructions are missing from the spec.
Note that the spec below may be subject to change, and includes contributions made by @clvv and @dlubarov.
Last updated: 6/21/2023
Architecture
A Valida zkVM consists of a CPU and several coprocessors, which are connected with communication buses. A basic example of a machine layout, omitting some standard chips for simplicity, would be
Communication buses are implemented using the logarithmic derivative lookup argument, and are multiplexed for efficiency (i.e. CPU interactions with multiple chips may share the same bus).
There are multiple VM configurations. The "Core" configuration is always present, and provides instructions for basic control flow and memory access. Additional configurations, such as "Field Arithmetic" or "Additional Jump" build upon the core configuration and offer additional instructions.
Instruction format
Instructions are encoded in groups of 6 field elements. The first element in the group contains the opcode, followed by three elements representing the operands and two immediate value flags: $\text{opcode}, \text{op}_a$, $\text{op}_b$, $\text{op}_c$, $\text{imm}_b$, $\text{imm}_c$.
Program ROM
Our VM operates under the Harvard architecture, where program code is stored separately from main memory. Code is addressed by any field element, starting from $0$. The program counter
pc
stores the location (a field element) of the instruction that is being executed.Memory
Memory is comprised of word-addressable cells. A given cell contains 4 field elements, each of which are typically used to store a single byte (arbitrary field elements can also be stored). All core and ALU-related instructions operate on cells (i.e. any operand address is word aligned -- a multiple of 4). In the VM compiler, the address of newly added local variables in the stack is word aligned.
For example, a U32 is represented in memory by its byte decomposition (4 elements). To initialize a U32 from an immediate value, we use the
SETL16
instruction (see the complete instruction list below), which sets the first two bytes in memory. To initialize a U32 value greater than 16 bits, we can also call theSETH16
instruction to set the upper two bytes.Immediate Values
Our VM cannot represent operand values that are greater than the prime $p$, and cannot distinguish between $0$ and $p$. Therefore, any immediate values greater than or equal to $p$ need to be expanded into smaller values.
Registers
Our zkVM does not operate on general purpose registers. Instead, instructions refer to variables local to the call frame, i.e. relative to the current frame pointer
fp
.Notation
The following notation is used throughout this document:
Operand values:
opa
,opb
,opc
denote the value encoded in the operand a, b, or c of the current instruction.CPU registers:
fp
,pc
denote the value of the current frame pointer and program counter, respectively.Relative addressing:
[a]
denote the cell value at addressa
offset fromfp
, i.e.fp + a
. Variables local to the call frame are denoted in this form. Note that we are omittingfp
in the expression here, but that the first dereference of an operand is always relative to the frame pointer.Absolute addressing:
[[a]]
denotes the cell value at absolute address[a]
. Heap-allocated values are denoted in this form.To refer to relative or absolute element values, we use the notation $[a]\text{elem}$ or $[[a]]\text{elem}$ respectively.
Instruction list
Each instruction contains 5 field element operands, $a, b, c, d, e$. Often, $d$ and $e$ are binary flags indicating whther operands $a$ and $b$ are immediate values or relative offets.
Listed below are the instructions offered in each configuration.
Core
a(fp), c(fp)
b(fp), c(fp)
a(fp), b, c
a(fp), b(fp), c(fp)
a, b(fp), c(fp)
a, b(fp), c
a, b(fp), c(fp)
a, b(fp), c
a, b, c, d, e
Field arithmetic
a(fp), b(fp), c(fp)
a(fp), b(fp), c(fp)
a(fp), b(fp)
a(fp), b(fp)
Note that field arithmetic instructions only operate on the first element in a cell, which represents a field element instead of a single byte.
U32 Arithmetic
a(fp), b(fp), c(fp)
a(fp), b(fp), c
a(fp), b(fp), c(fp)
a(fp), b(fp), c
a(fp), b(fp), c(fp)
a(fp), b(fp), c
a(fp), b(fp), c(fp)
a(fp), b(fp), c(fp)
a(fp), b(fp), c
a(fp), b, c(fp)
a(fp), b(fp), c(fp)
a(fp), b(fp), c
Bitwise
a(fp), b(fp), c(fp)
a(fp), b(fp), c(fp)
a(fp), b(fp), c(fp)
Byte Manipulation
Note: These will not be supported in the initial version.
a(fp), b(fp)
b(fp), c(fp)
b(fp), c
Heap allocation
Notes:
free
)Assembly
Instructions
We will closely follow RISC-V assembly, making modifications as necessary. The most important difference between our zkVM assembly and RV32IM is that instead of registers
x0-31
, we only have two special-purpose registersfp
andpc
. However, we have (up to $2^{31}-1$) local variables, addressed relative to the current frame pointfp
.Calling convention / stack frame
We follow the RISC-V convention and grow the stack downwards. For a function call, the arguments are pushed onto the stack in reverse order. We only allow statically sized allocation on the stack, unlike traditional architectures where
alloca
can be used to allocate dynamically. All dynamic allocation will be compiled to heap allocations. Instead of using a frame pointer that points at the begining of the frame, we use a stack pointer which points at the first free stack cell.Note that:
Pseudo instructions
call label
imm32 (-b+8)(fp), 0, 0, 0, -b(fp)
;jal -b(fp), label, -b(fp)
, where b is the size of the current stack frame plus the call frame size for instantiating a call to labelret
jalv -4(fp), 0(fp), 8(fp)
Implementing MEMCPY/SET/MOVE
Memcpy will require roughly 2 cycles per word. We can follow this memcpy implementation on RISC-V.
Example program
The (partial) stack at the time of executing the first instruction (
sw
) insidefib
after the call frommain
(line 6 above) looks like:Trace
Main CPU
Columns $\text{opcode}, \text{op}_a, \text{op}_b, \text{op}_c, \text{op}_d, \text{op}_e$ are specified by the program code (see the "Instruction Trace" section below).
Trace cells are also allocated to hold buffered read memory values for $\text{addr}_a$ and $\text{addr}_b$, and buffered write values for $\text{addr}_c$. We read and write 4 elements from memory at a time to the main trace. These elements are only constrained when the immediate value flags are not set (see the "Instruction Decoding" section below):
Memory
The memory table is sorted by ($\text{addr}, \text{clk}$)
Instruction decoding
Trace cells are also allocated for each selector. In each cycle, main CPU opcodes are decoded into binary selector flags, or to a single
is_bus_opcode
flag in the case that the opcode is processed by a different chip.Instruction Trace
Each instruction is encoded as 6 field elements
Core
a(fp), c(fp)
b(fp), c(fp)
a(fp), b, c
a(fp), b(fp), c
a, b(fp), c(fp)
a, b(fp), c
a, b(fp), c(fp)
a, b(fp), c
a, b, c, d, e
Field arithmetic
a(fp), b(fp), c(fp)
a(fp), b(fp), c(fp)
a(fp), b(fp)
a(fp), b(fp)
U32 Arithmetic
a(fp), b(fp), c(fp)
a(fp), b(fp), c
a(fp), b(fp), c(fp)
a(fp), b(fp), c
a(fp), b(fp), c(fp)
a(fp), b(fp), c
a(fp), b(fp), c(fp)
a(fp), b(fp), c(fp)
a(fp), b(fp), c
a(fp), b, c(fp)
a(fp), b(fp), c(fp)
a(fp), b(fp), c
Bitwise
a(fp), b(fp), c(fp)
a(fp), b(fp), c(fp)
a(fp), b(fp), c(fp)
Design notes
Frontend target
We are writing a compiler from LLVM IR to our ISA
ZK stack
This is a STARK-based zkVM. We are using Plonky3 to implement the polynomial IOP and PCS.
Field choice
We plan to use the 32-bit field defined by
p = 2^31 - 1
, which should give very good performance on GPUs or with most vector instruction sets.Registers
Our VM has no general purpose registers, since memory is cheap.
Memory
We will use a conventional R/W memory.
Tables
The CPU can do up to three memory operations per cycle, to support binary operations involving two reads and one write.
If we used a single-trace model, we could support this by adding columns for 6 memory operations in each row of our trace: 3 for the chronological memory log and 3 for the
(address, timestamp)
ordered memory log.Instead, we make the memory a separate table (i.e. a separate STARK which gets connected with a permutation argument). We also use multi-table support to implement other coprocessors that are wasteful to include in the main CPU, as their operations may not be used during most cycles (e.g. Keccak).
Continuations
TODO: Explain the permutation-based continuation implementation.
Lookups
Initially, we will support lookups only against prover-supplied tables. The main use case is range checks. To perform a 16-bit range check, for example, we would have the prover send a table containing
[0 .. 2^16 - 1]
in order. (If the trace was not already 2^16, we would pad it. If it was longer than 2^16, the prover would include some duplicates.) We would then use constraints to enforce that this table starts at 0, ends at 2^16 - 1, and increments by 0 or 1.Preprocessed tables can also be useful, particularly for bitwise operations like xor. However, we will not support them initially because they require non-succinct preprocessing.
Floating point arithmetic
Fast floating point arithmetic doesn't seem important for our anticipated use cases, so we will convert float operations to integer ones during compilation.