investigate efficiency of memory finalized circuit

Per discussion here https://github.com/scroll-tech/ceno/pull/251#discussion_r1797717452

Currently in memory consistency check it's relied on max_memory as circuit constant parameter, which obviously bring up a possible gap between real in-used memory since max_memory need to cover the worst case memory consumption.

One solution, as non-uniform circuit and sumcheck based design, it's naturally possible treating memory table as special case of dynamic table, such that "memory size" can be auxiliary input from prover so the size can be determined for different proof and prover just need to "pay-as-you-go". Extra cost of this design is we need to prove address column are well-formed: uniquely and sorted

Here I want to describe how to prove a set { addr_i } does not have duplicated elements (aka. every element has multiplicity one) by applying "Protocol 1" proposed in sec 4 of OLB24 in the multivariate setting (aka. using sumcheck). This can be used in memory init / finalize circuit.

Recap protocol 1 in `OLB24`

The prover encodes the set { addr_i: 0 < i < N } as a univariate polynomial $z(X) = \prod (X - a_i)$. It's easy to see that this is equivalent to the formal derivative $z'(X)$ is co-prime to $z(X)$ in the polynomial ring $F[X]$. By bezout's identity, we know there must exists two polynomials $s(X), t(X)$ of degree less than $N-1$ and $N$ respectively s.t.

$$z(X) s(X) + z'(X) t(X) = 1. $$

According to Schwartz-Zippel lemma, this is equivalent to checking $z(\alpha) s(\alpha) + z'(\alpha) t(\alpha) = 1$ for a random challenge $\alpha \in E$ with negligible soundness error.

Prove protocol 1 using sumcheck

It's easy to see

$z(\alpha) = \prod (\alpha - a_i)$ can be easily proved by our product tower prover if we encodes { a_i } as a multilinear polynomial $Z(\vec{X}) = \sum_{\vec{i}} \textrm{eq}(\vec{X}, \vec{i}) \cdot a_i$ where $\vec{i}$ is the binary decomposition of $i$.
Since $z'(\alpha) = z(\alpha) * \sum \frac{1}{\alpha - a_i}$, and the fraction sum $\sum \frac{1}{\alpha - a_i}$ can also be easily proved by our tower prover.

The problem is reduced to

how to evaluate $s(\alpha)$ and $t(\alpha)$.

evaluates $s(\alpha)$ using tower tree

Note that $s(\alpha) = s_0 + s_1 \alpha + s_2 \alpha^2 + \cdots = s{odd}(\alpha^2) * \alpha + s{even}(\alpha^2)$ where

$s_{even}(X) = s_0 + s_2 * X + \cdots$;
$s_{odd}(X) = s_1 + s_3 * X + \cdots$.

We also encodes { s_i } as a multilinear polynomial $S(\vec{X}) = \sum_{\vec{i}} \textrm{eq}(\vec{X}, \vec{i}) \cdot s_i$ . Assuming $N = 2^k$ is a power of two, let's show how to compute $s(\alpha)$ using layered circuit which can then be proved by our tower prover. We call this poly evaluate tower prover.

Notation: $B_j$ is the boolean hypercube with $j$ variables.

layer 0: $s(\alpha) = v_1[0] + v_1[1] * \alpha$;
layer 1: $v_1[i] = v_2[i, 0] + v_2[i, 1]*\alpha^2$ for $i \in B_1$;
layer 2: $v_2[\vec{i}] = v_3[\vec{i},0] + v_3[\vec{i}, 1]*\alpha^4$ for $\vec{i} \in B_2$ ;
...
layer k-1: $v_k[\vec{i}] = s[\vec{i}]$ for $\vec{i} \in B_k$.

Let's take $k = 3$ as an example.

v3[0,0,0] = s0, v3[0,0,1] = s4, ...;
v2[0,0] = v3[0,0,0] + v3[0,0,1]*alpha^4 = s0 + s4*alpha^4;
v2[1,0] = v3[1,0,0] + v3[1,0,1]*alpha^4 = s1 + s5*alpha^4;
v2[0,1] = v3[0,1,0] + v3[0,1,1]*alpha^4 = s2 + s6*alpha^4;
v2[1,1] = v3[1,1,0] + v3[1,1,1]*alpha^4 = s3 + s7*alpha^4;
v1[0] = v2[0,0] + v2[0,1]*alpha^2 = s0 + s4*alpha^4 + s2*alpha^2 + s6*alpha^6;
v1[1] = v2[1,0] + v2[1,1]*alpha^2 = s1 + s5*alpha^4 + s3*alpha^2 + s7*alpha^6;
s(alpha) = v1[0] + v1[1]*alpha = s0 + s4*alpha^4 + s2*alpha^2 + s6*alpha^6 + s1*alpha + s5*alpha^5 + s3*alpha^3 + s7*alpha^7.

Protocol 1 in multivariate setting

We can use tower prover to the reduce the correctness of $s(\alpha)$ to evaluation of $S(\vec{X})$ at a random point $\vec{r} \in E^k$. We can do the same for $t(\alpha)$.

Therefore protocol 1 in the multivariate setting works as follows:

The prover commits to $Z(X), S(X), T(X)$ and sends three commitments $c_Z, c_S, c_T$ to verifier;
The verifier samples a random challenge $\alpha \in E$;
The prover runs product tower prover for $z(\alpha)$ to get product tower proof $\pi_1$ and point evaluation $Z(\vec{r})$;
The prover runs fraction sum tower prover for $\sum \frac{1}{\alpha - a_i}$ to get fraction sum tower proof $\pi_2$ and point evaluation $Z(\vec{u})$ ;
The prover runs poly evaluate tower prover for $s(\alpha)$ to get poly evaluate tower proof $\pi_3$ and point evaluation $S(\vec{\omega})$;
The prover runs poly evaluate tower prover for $t(\alpha)$ to get poly evaluate tower proof $\pi_4$ and point evaluation $T(\vec{\zeta})$;
The prover sends mpcs opening proof $\pi$ for these evaluations to verifier.
The verifier checks the mpcs opening proof $\pi$ and all these 4 tower proofs. And then verifier checks if $z(\alpha) s(\alpha) + t(\alpha) z(\alpha) * \sum \frac{1}{\alpha - a_i} = 1$.

@kunxian-xia this idea works! but it will introduce an overhead in prover to to interpolate and derive coefficients format for s(x) and t(x).

But I just reviewed again for our original question and suddenly a idea flash into my mind: the way to make us abandon to prove address vector are well format [0, 1, 2, 3, 4, 5, .. 2^b - 1] is because we haven't figure out how to convert into sumcheck PIOP, but... what if we just pre-commit few address vectors 16: [0, 1, ... 2^16-1] 17: [0, 1, ... 2^17-1] ... 28: [0, 1, ... 2^30-1] And we allow prover to auxiliary the index and choose which pre-commit poly to use?

Few benefits for these design

those address commitment just happened in key setup phase, thus it's one-time effort
public input design in one shot. The well-format of address as [0, 1, 2, 3....] can help us retain the design to reserve a memory region for public input without worry about soundness
it greatly leverage non-uniform design for the prover still "pay-as-you go"

Wdyt :) ?

I want to explain why the memory entries that accessed by a guest program have holes (equivalently, the addresses are sparse).

The memory allocator used in RISC0 / SP1 looks like this

/// https://github.com/risc0/risc0/blob/main/risc0/zkvm/platform/src/heap/bump.rs
use crate::syscall::sys_alloc_aligned;
use core::alloc::{GlobalAlloc, Layout};

#[global_allocator]
pub static HEAP: BumpPointerAlloc = BumpPointerAlloc;

pub struct BumpPointerAlloc;

unsafe impl GlobalAlloc for BumpPointerAlloc {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        sys_alloc_aligned(layout.size(), layout.align())
    }

    unsafe fn dealloc(&self, _: *mut u8, _: Layout) {
        // this allocator never deallocates memory
    }

    unsafe fn alloc_zeroed(&self, layout: Layout) -> *mut u8 {
        // NOTE: This is safe to avoid zeroing allocated bytes, as the bump allocator does not
        //       re-use memory and the zkVM memory is zero-initialized.
        self.alloc(layout)
    }
}

As you can see the above allocator is too simple and it never deallocates memory. This can cause big problem that never happen with modern malloc libraries like jemalloc.

Let's assume your guest program has code snippets like this:

fn foo() {
  {
     let mut v = vec![];
     for i in (0..n) {
        v.push(i);
     }
  }

  // ... later
  {
     let mut v = vec![];
     for i in (0..n) {
        v.push(i);
     }
  }
}

There are two issues in the above code snippets.

With allocator that frees memory, two occurrences of v may have same memory address; while with the BumpPointerAlloc allocator, 2nd occurrence of v must have larger memory address than the 1st even though the scope of the two for-loops are disjoint.
For containers like Vec and HashMap, if we don't reserve the "right" capacity before we do the insertions, the utilized ratio may be around 50% because each time the old space is not enough it will grows by twice (ref Vector in rust std library).

Then there will holes in the memory addresses accessed by your guest program.

Like i mentioned in earlier (comment):

The max_mem_address / mem_address_used is around 1.5 in the case of reth.
The max_mem_address / mem_address_used is around 32 in the case of fibonacci with input n=2^20.
Here is a distribution of gap between address in the memory final read set in the case of reth.

Elaborate succintly address form which verifier can evaluate it without any SNARK, and leveraging a multi-variate math fact (credited from @spherel 🔥 )

With above fact, verifier can evaluate addr directly giving challenge vector $\vec{r}$ with just simple field arithmetics, without prover help. The non-uniform prover design from the fact that prover can witness n = ceil(log(max_memory)) per proof, and verifier take it to evaluate address.

This method also expose a powerful fact => we can add constant offset and scalar into the memory, so it end up that we can construct any consecutive memory sub-region starting on arbitrary offset.

Giving 2 assumptions for riscv zkVM access pattern:

heap/stack grows should be consecutively and dense, especially heap, since we dont have gc.
the outlier gap probably come from the heap/stack segment boundary. We can solve this issue by adjusting stack/heap segment easily.

One imperfect is prover still need to pay the cost [max_address, next_pow2]. I would claim this gap might be negligible. Since opcode proof might be bottleneck, and table proof with such non-uniform design already solve the issue nicely. It is like 2/8 principle: 20% effort solve 80% problem. We can left those 80% effort as future work

PR WIP #457

scroll-tech / ceno