Summary

I am proposing here is that we move from away from a von Neumann architecture and towards a modified Harvard architecture.

Details

At the moment we mix up decoding and execution of the guest program. (That allows us to support self-modifying code: you can eg implement a JIT that writes some new machine code to a memory address, and then jumps there to execute that. Not that we ever want to do that.)

I am suggesting that we stop supporting self-modifying code, and that we decode ahead of time.

Specifically our Program type would change as follows:

/// A RISC Zero program
 #[derive(Clone, Debug)]
 pub struct Program {
     /// The entrypoint of the program
     pub entry: u32,
     /// This is the lowest address of the program's executable code
     pub base_address: u32,
     /// The instructions of the program
-    pub instructions: Vec<u32>,
+    pub instructions: BTreeMap<WordAddr, DecodedInstruction>,
     /// The initial memory image
     pub image: BTreeMap<u32, u32>,
 }

And DecodedInstruction would become a much simpler structure with entries that are directly useful both for the emulator and for the circuits:

#[derive(Clone, Copy, Debug)]
pub struct DecodedInstruction {
    pub kind: InsnKind,
    /// The index of the first source register.
    ///
    /// Instructions that do not read a source register will have this set to 0.
    pub rs1: u32,
    /// The index of the second source register.
    ///
    /// Instructions that do not read a second source register will have this set to 0.
    pub rs2: u32,
    /// The index of the destination register.
    ///
    /// Instructions that do not write to a destination register will have this set to `RD_NULL`.
    pub rd: u32,
    /// The immediate value.
    ///
    /// This should be able to handle i32::MIN to u32::MAX, so we need i64.
    /// Conversion to a field element can happen with eg `crate::utils::i64_to_base`
    ///
    /// Instructions that don't use an immediate value will have this set to 0.
    pub imm: i64,

    /// The original encoded instruction word.
    ///
    /// This is useful for debugging and logging.
    /// TODO(Matthias): we still use this otherwise, remove those uses.
    #[allow(dead_code)]
    pub word: u32,
}

(Note: no weird fields with meanings that are too hard to explain.)

Stepping through the program in the emulator becomes simpler, but the real gain is that we no longer have to prove the relationship between the instruction as a u32-word and the decoded components of the instruction: the circuits no longer need to know that an instruction can even be encoded as a machine word. It's irrelevant to them.

See https://github.com/scroll-tech/ceno/pull/519 for a prototype

That gave me an idea btw: https://github.com/scroll-tech/ceno/pull/527

That’s basically the same result just much easier to do.

I'm glad you are engaging with the idea.

Alas marginally speeding up the emulator is LL you get from your caching, and that's the least of our concerns. That's because proving takes so much more time than running.

Thanks for writing these details. That clears it up a lot.

So I fully agree with the mechanism that is explained here. That is how it works!

Besides that, looking at the examples, well of course there are endless ways to write it. Code can be moved around. Prefer constructor versus method? On-the-fly (current) versus pre-computed (#519) versus memoized (#527)? That’s implementation details and style preferences.

Example compute in constructor (this proposal):

     /// Instructions that do not read a source register will have this set to 0.
    pub rs1: u32,

Or compute in a method (current):

    /// Get the register source 1, or zero if the instruction does not use rs1.
    pub fn rs1_or_zero(&self) -> u32 {

Same task, same total complexity, effectively same speed. I’m trying to say that it is really not critical and not worth reinventing. But I understand now the fundamental idea you were trying to explain before.

This is not at all about speed of the decoder nor emulator. Sorry, for the confusion, if I gave the impression.

This is about simplifying the circuits, and thus making them easier to verify, and faster to prove.

For that we don't want the circuits to ever come in contact with the original u32 word or any vestiges of weird case analyses. That should all be done beforehand. The representation of the program that the circuits commit to should be as simple and sanitised as possible.

(Another albeit small benefit of decoding ahead of time is that it makes error handling in the emulator simpler: many errors just can't happen at this stage anymore.)

What this approach also makes simpler is unifying eg ADD and ADDI internally, both in the emulator and more importantly in the constraints. You just have a single instruction kind for addition that adds rs1 + (rs2 + imm) (but only needs to check for a single overflow in the circuits, because by construction either at least one of rs2 or imm will be zero.

In the same vein, you can define a single internal multiplication as rs1 * (rs2 + imm) and use that to implement both MUL and immediate left shift in the circuit (and for consistency also in the emulator)).

For that we don't want the circuits to ever come in contact with the original u32 word or any vestiges of weird case analyses. That should all be done beforehand. The representation of the program that the circuits commit to should be as simple and sanitised as possible.

Sorry I should have been more clear: this is how it works already.

Huh? Currently the circuit knows about stuff like funct3, doesn't it?

I saw there are few mention points, would like to point out there are some mixed concepts here

decode ahead of time, and decode only once

This is a good point and agree with that, we might able to cache the decode result to avoid having repeating on-the-fly decoding per function call. I believe #527 motivated by this and instantiate a good move.

claim 1: supporting self-modifying code, and that we decode ahead of time claim 2: That allows us to support self-modifying code: you can eg implement a JIT that writes some new machine code to a memory address, and then jumps there to execute that. Not that we ever want to do that. claim 3: Huh? Currently the circuit knows about stuff like funct3, doesn't it?

I think you might read through the opcode circuit and see "funct3" terms appear as witness, then you claim to found some fishy and then jump to conclusion that there are allow "self-modifying" or "JIT" mechanism which allow to write to program and set pc to jump to it and execute. Based on current mechanism, I would say this probablility of this self-modify behaviour by a "malicious" prover to make cheat proof being "accepted" was "negligible", unless a guest program directly express this on purpose in its source code.

What this step https://github.com/scroll-tech/ceno/blob/master/ceno_zkvm/examples/riscv_opcodes.rs#L119-L123 do is it will decode program line by line "one time", and HASH Vec<(pc, opcode, rs1, rs2, rd, funct3, shamt, imm, ...whatever we like)> into a list of fixed commitments. You can imagine we have a encode a static "Set" here. This behaviour is done just once, and finalized together with prover/verifier key setup.

In opcode circuit, e.g. Rtype, it's known we have something like this https://github.com/scroll-tech/ceno/blob/f9f7459b9133a1e0c76de3b878eefbe260824b7b/ceno_zkvm/src/instructions/riscv/r_insn.rs#L45-L53

Here some instruction detail are like on-the-fly decoding and expose some fishy details, so you claim self-modify code JIT is possible. The answer is with-high-probability it's not possible, because what in fetch circuit do is to HASH all those fields into a digest, then verifier check whether this digest existing in static "Set" or not, yes, this is what we call "lookup" argument. The static "Set", or we call program table, is compiled and settled before key setup.

I haven't been following all the discussion closely, but I noticed there may be some mixed concepts connecting A and B, even though they might not be directly related. I’d like to share my thoughts, in case they might be helpful :)

Here's some background first. I'll reply to your specific points in a second comment.

Risc0 has questionable design choices. We took DecodedInstruction straight from Risc0. InsnRecord, which as far as I can tell we wrote ourselves, is a lot saner.

If you have a look at the RiscV spec, you'll see on page 12 how some of the various instruction formats are laid out. Specifically, you see that all instruction formats define various fields and how they are encoded in the instruction word.

Some formats have common fields. But in general, the set of fields differs between types, and bits are re-used for different fields in different formats. Especially the immediate field is encoded in whatever bits were left over. (If you go further down the spec, you'll see a few more other instruction formats described.)

SP1 is an example of a reasonably sane implementation of that design. They use the rrs-lib crate. You can find definitions like this in rrs-lib:

pub struct RType {
    pub funct7: u32,
    pub rs2: usize,
    pub rs1: usize,
    pub funct3: u32,
    pub rd: usize,
}

pub struct IType {
    pub imm: i32,
    pub rs1: usize,
    pub funct3: u32,
    pub rd: usize,
}

pub struct ITypeShamt {
    pub funct7: u32,
    pub shamt: u32,
    pub rs1: usize,
    pub funct3: u32,
    pub rd: usize,
}

Those are the fields that the spec defines for the different formats. SP1 then goes on and converts to an internal format:

/// RISC-V 32IM Instruction.
///
/// The structure of the instruction differs from the RISC-V ISA. We do not encode the instructions
/// as 32-bit words, but instead use a custom encoding that is more friendly to decode in the
/// SP1 zkVM.
#[derive(Clone, Copy, Serialize, Deserialize)]
pub struct Instruction {
    /// The operation to execute.
    pub opcode: Opcode,
    /// The first operand.
    pub op_a: u8,
    /// The second operand.
    pub op_b: u32,
    /// The third operand.
    pub op_c: u32,
    /// Whether the second operand is an immediate value.
    pub imm_b: bool,
    /// Whether the third operand is an immediate value.
    pub imm_c: bool,
}

The format they use in their proofs looks very similar. So far so sane. (Note also how they use pub fields throughout.)

Now let's have a look at Risc0's risc0/circuit/rv32im/src/prove/emu/rv32im.rs:

impl DecodedInstruction {
    fn new(insn: u32) -> Self {
        Self {
            insn,
            top_bit: (insn & 0x80000000) >> 31,
            func7: (insn & 0xfe000000) >> 25,
            rs2: (insn & 0x01f00000) >> 20,
            rs1: (insn & 0x000f8000) >> 15,
            func3: (insn & 0x00007000) >> 12,
            rd: (insn & 0x00000f80) >> 7,
            opcode: insn & 0x0000007f,
        }
    }

They define a bunch of arbitrary fields, some of which share names and positions with some fields in some formats in the spec. But they are all jumbled into one incoherent mess, instead of being separated into the different formats. (Some other fields are completely made up, and don't appear anywhere in the spec. And the spec also has some instructions where fields of the same name are encoded in different bits.)

Now to go from that mess to the fields that the spec wants for the different formats, they need to do a second round of bit fiddling:

    fn imm_b(&self) -> u32 {
        (self.top_bit * 0xfffff000)
            | ((self.rd & 1) << 11)
            | ((self.func7 & 0x3f) << 5)
            | (self.rd & 0x1e)
    }

    fn imm_i(&self) -> u32 {
        (self.top_bit * 0xfffff000) | (self.func7 << 5) | self.rs2
    }

    fn imm_s(&self) -> u32 {
        (self.top_bit * 0xfffff000) | (self.func7 << 5) | self.rd
    }

    fn imm_j(&self) -> u32 {
        (self.top_bit * 0xfff00000)
            | (self.rs1 << 15)
            | (self.func3 << 12)
            | ((self.rs2 & 1) << 11)
            | ((self.func7 & 0x3f) << 5)
            | (self.rs2 & 0x1e)
    }

    fn imm_u(&self) -> u32 {
        self.insn & 0xfffff000
    }
}

For good measure, their emulator does a third round of bit fiddling.

InsnKind::SLLI => rs1 << (imm_i & 0x1f),
InsnKind::SRLI => rs1 >> (imm_i & 0x1f),
InsnKind::SRAI => ((rs1 as i32) >> (imm_i & 0x1f)) as u32,

SP1 does a single round all via rrs, and in a way that even someone as simple minded as me can understand as a straightforward implementation of the spec.

Alas, we copied our decoding from them, and not SP1. (But notice how our InsnRecord resembles SP1's Instruction much more than it does anything in Risc0.)

Let's move on to self-modifying code, and have a look at Risc0's step function in risc0/circuit/rv32im/src/prove/emu/rv32im.rs ((the origin of our ceno_emul/src/rv32im.rs):

pub fn step<C: EmuContext>(&mut self, ctx: &mut C) -> Result<()> {
    let pc = ctx.get_pc();

    if !ctx.check_insn_load(pc) {
        ctx.trap(TrapCause::InstructionAccessFault)?;
        return Ok(());
    }

    let word = ctx.load_memory(pc.waddr())?;
    if word & 0x03 != 0x03 {
        ctx.trap(TrapCause::IllegalInstruction(word))?;
        return Ok(());
    }

    let decoded = DecodedInstruction::new(word);
    let insn = self.table.lookup(&decoded);
    ctx.on_insn_decoded(&insn, &decoded);

    if match insn.category {
        InsnCategory::Compute => self.step_compute(ctx, insn.kind, &decoded)?,
        InsnCategory::Load => self.step_load(ctx, insn.kind, &decoded)?,
        InsnCategory::Store => self.step_store(ctx, insn.kind, &decoded)?,
        InsnCategory::System => self.step_system(ctx, insn.kind, &decoded)?,
        InsnCategory::Invalid => ctx.trap(TrapCause::IllegalInstruction(word))?,
    } {
        ctx.on_normal_end(&insn, &decoded);
    };

    Ok(())
}

On the face of it, step has all the necessary complications to support self-modifying code. If it doesn't actually support it, then that's because somewhere else they piled on some further complications on top that prevent it; instead of a simpler design like SP1's that avoids the possibility in the first place.

(As an extra bonus, they have about half a dozen places there where they need to deal with potentially invalid instructions.)

Before I close out, two more 'interesting' design choices of Risc0.

First, Risc0 commits to a hash digest of the ELF for their proofs, not something like our InsnRecord or SP1's Instruction. That means they need to proof their decoding. (Both Ceno and SP1 avoid that.) Please have a look at their risc0_zkvm::compute_image_id, if you want to verify that.

/// Compute and return the ImageID of the specified ELF binary.
#[cfg(not(target_os = "zkvm"))]
pub fn compute_image_id(elf: &[u8]) -> anyhow::Result<risc0_zkp::core::digest::Digest> {
    use risc0_zkvm_platform::{memory::GUEST_MAX_MEM, PAGE_SIZE};

    let program = Program::load_elf(elf, GUEST_MAX_MEM as u32)?;
    let image = MemoryImage::new(&program, PAGE_SIZE as u32)?;
    Ok(image.compute_id())
}

Second, here's the type of Program::load_elf used above:

pub fn load_elf(input: &[u8], max_mem: u32) -> Result<Program>`

And here's the definition of Program:

/// A RISC Zero program
pub struct Program {
    /// The entrypoint of the program
    pub entry: u32,

    /// The initial memory image
    pub image: BTreeMap<u32, u32>,
}

Their Program forgets all about the different sections of the ELF. Which ones are supposed to be read-only, which one are supposed to be (non-) executable, etc. I don't know for sure whether their prover supports self-modifying code, but it certainly supports jumping into data sections. They have no way to prevent that. (To spell out the consequences: a sufficiently underhanded guest program author can bury some malicious code in the constants of the program, and then arrange for some other vulnerable part of the program to jump into that code. An audit of the (Rust) source code of the guest won't reveal the malicious code, at most it reveals the stack smashing vulnerability, but almost any piece of unsafe Rust code could hide that.

Alternatively, carelessness can look like underhandedness. Thus, a clever attacker could find an existing program that might already spell out malicious code in its data section completely by accident.)

Enough background for now. I'll reply to your specific points in a second comment later.

With Ceno we are defining and implementing an interpretation of the Risc-V spec.

Ideally, we keep that very simple and straightforward. As much as possible. We don't want to be clever. We don't want to be subtle. We don't want to be tricky. We want to be boring.

Alas, at the moment we are actually defining (at least) two interpretations of the Risc-V spec: one in the emulator, largely copied-and-pasted from Risc0, and one in the prover.

Those two ain't the same. As Ming helpfully pointed out, some of my complaints actually only apply to our emulator, and it was naive of me to assume that our prover and our emulator speak the same language. They don't.

My bold proposal is that we should make them speak the same language. As much as possible, everything that's allowed (or banned) in the prover should be allowed (or banned) in the emulator, and vice versa. They should behave the same way.

Writing a Risc-V emulator is fairly straightforward, especially if, like us, you don't have too wring out every last bit of performance out of this part of the system. (Just to remind you, the majority of the time is spent in the prover, not the emulator.)

Writing a Risc-V prover is a whole different kettle of fish. It's a lot more complicated, and a lot more subtle. It's also a lot more important to get right. I suggest we make both parts of the system speak the same language, and that we make the prover the master of that language.

Making prover and emulator speak obviously the same language helps with reviewing the constraints for correctness: 'just' check if they allow exactly the same behaviour as the emulator. I say 'just' in scare quotes, because that's already hard enough. But it's infinitely easier and less error prone than deliberately designing two different languages and then trying to figure out how they relate to each other, and which deviations are intentional and which are bugs.

https://github.com/scroll-tech/ceno/issues/539 is an interesting illustration of the difficulties here, but also how our recent efforts are already bearing fruit. Nice work, @naure!

Thanks for renaming the issue, that clears it up.

scroll-tech / ceno

Decode statically in the emulator, sync prover and emulator #524

Summary

Details