taikoxyz / raiko

Multi-proofs for Taiko. SNARKS, STARKS and Trusted Execution Enclave. Our previous ZK-EVM circuits are deprecated.
Apache License 2.0
110 stars 85 forks source link

c-kzg failing in SP1 due to unaligned access #169

Closed CeciliaZ030 closed 2 months ago

CeciliaZ030 commented 4 months ago

Describe the feature request

Following

148

157

162

We still have to fix kzg compilation in SP1. I've included example in pipeline/examples/sp1: https://github.com/taikoxyz/raiko/blob/12826f2c820346b2c42deb757b359568e7d0f793/pipeline/examples/sp1/src/kzg.rs The goal is to get the proof generated, although it seems like extern C overloading does proves & verified:

unsafe {
        let c = calloc(100);
        let m = malloc(200);
        free(c);
        free(m);
    }

Possibly a memory or stack limit issue.

Spam policy

Brechtpd commented 4 months ago

Fails here: https://github.com/succinctlabs/sp1/blob/53c8915b99e06049f73b4e3c8f83c93296764808/core/src/runtime/mod.rs#L593

risc-v spec says this:

image

So allowed in the spec, with the warning it could be slow. Though SP1 seems to disallow it.

Brechtpd commented 4 months ago

Compiling with -mstrict-align seems to work to prevent unaligned accesses in the c code and the kzg code looks to run okay.

CeciliaZ030 commented 4 months ago

Fixed #170

puma314 commented 4 months ago

@Brechtpd thanks for this catch. I think we should note this somewhere in our docs, since it is non-obvious behavior that differs from the RISC-V spec.

Brechtpd commented 4 months ago

Reopening because seems the problem is not completely fixed for all blocks. Using block 101368 on A7, there is still an alignment issue at the same location. Even more, risc0 also fails for the block but with a different error:

SP1:

image

risc0:

image

puma314 commented 4 months ago

@Brechtpd I think that this is the same error across both VMs--seems like "LoadAccessFault" might be similar to our address not aligned error?

From a brief perusal online, it seems like in RISC-V, the hardware can implement only aligned accesses (this is true for us) and then the software can handle the unaligned accesses? (Stack overflow).

So perhaps we have to add a handler at the software layer for this. I can look into it / prioritize this since it seems like a blocker.

Brechtpd commented 4 months ago

@puma314

For risc0 it seems the error originates here: https://github.com/GregAC/rrs/blob/b23afc16b4e6a1fb5c4a73eb1e337e9400816507/rrs-lib/src/instruction_executor.rs#L185, and when going into read_mem it doesn't seem to hit the alignment error path because then the program would panic instead of returning the error that we see. So I think for some reason the memory location which is close to max int might be the problem and be corrupt (though no idea why that would be the case) but I'm not sure if risc0 manages the memory of different things so maybe the address could make sense. 4122387576 % 4 == 0 so I guess the address would also be aligned correctly.

Looking at risc0, they also do not seem to support unaligned memory accesses (unless they do the software fallback somewhere) which does seem to be very common. But it's also strange that the same blob that required -mstrict-align for sp1 did not need it for risc0. Though -mstrict-align is not enough, so who knows what's going on. :)

So hopefully for both the problem is just unaligned loads, will check with risc0 to see what they think about this error, if it ends up being an alignment problem then it would be great to have your help on this!

intoverflow commented 4 months ago

There's two additional flags you might want to pass to the builder:

I'm on mobile at the mobile so cannot test, but if you share steps to reproduce the issue I can do some more digging tonight or tomorrow 🫡

flaub commented 4 months ago

The above address 4122387576 (aka 0xf5b6a478) looks like garbage, in r0vm, it shouldn't be possible for a program to address anything above 0x0c000000 (max addressable range is 192MB). So this feels like a dangling pointer. Since we're talking about C code, it's plausible that this might be a deeper bug in c-kzg itself (or perhaps related to modifications made to get it to run in these VMs)

SchmErik commented 4 months ago

Here's the memory layout in the risc0 zkvm. We only allow guest code to use 0x400 - 0x0c00_0000 between code, stack, and heap. Within this range, the stack occupies the lowest 2 mb of this range, followed by the code, and then followed by the heap. The heap starts where the code ends and grows upward on each allocation towards 0x0c00_0000. Any address above this is illegal for the guest to access and that's what you're seeing here.

The address being 0xf5b6_a478 makes me think that this is either a dangling pointer or a massive allocation close to 4gb. Is there something large being allocated on the heap? Does this work as expected on non-zkvm machines? I also second @intoverflow's advice adding those two flags to the builder.

Brechtpd commented 4 months ago

Made a repro branch here: https://github.com/taikoxyz/raiko/pull/176. So can checkout the kzg-fun branch of this repo, and then:

export TARGET=risc0 MOCK=1
make install
make build && RUST_BACKTRACE=1 make test

change TARGET to sp1 for testing that.

Output for risc0:

image

Output for SP1:

image

There's two additional flags you might want to pass to the builder:

* `-march=rv32im`

* `-falign-functions=2`

Unfortunately they do not seem to fix it, but I added them in the test branch.

The address being 0xf5b6_a478 makes me think that this is either a dangling pointer or a massive allocation close to 4gb. Is there something large being allocated on the heap?

Nothing massive being allocated as far as I know, just the KZG trusted setup data and the blob data in memory (next to the block data, but that isn't much more than a couple of megabytes).

Does this work as expected on non-zkvm machines?

Oh good question because I think the zkVMs do use a different code path than the one taken by our "native" prover that also uses c-kzg but using the normal compilation path that I believe uses the x86 assembly hand optimized code. @smtmfft

We'll need to try that out and check it.

johntaiko commented 4 months ago

Does this work as expected on non-zkvm machines?

Works well in SGX environment.(block 101368 on A7)

johntaiko commented 4 months ago

The above address 4122387576 (aka 0xf5b6a478) looks like garbage, in r0vm, it shouldn't be possible for a program to address anything above 0x0c000000 (max addressable range is 192MB). So this feels like a dangling pointer. Since we're talking about C code, it's plausible that this might be a deeper bug in c-kzg itself (or perhaps related to modifications made to get it to run in these VMs)

It is quite possible that an arithmetic overflow issue from(c-kzg or its dependencies) was encountered on a 32-bit system(4122387576 approachs the maximum of u32), we can test it by running on other 32-bit virtual machines, like qemu.

johntaiko commented 4 months ago

we can test it by running on other 32-bit virtual machines.

So, we can build with debug info, and get the backtrace when panic

jtguibas commented 4 months ago

I did some investigating and this is what I found. To explore @johntaiko's hypothesis of the issue being related to compilation on 32-bit machines, I ran the following steps. Note that this isn't a perfect scenario to test things as it isn't being built on the RISC-V target (QEMU is a good next step IMO).

Create a 32-bit docker container and install requirements

sudo docker run --rm -it i686/ubuntu bash
apt-get update
apt-get install git
apt-get install build-essential
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Clone c-kzg taiko fork and run tests with memory/arithmetic sanitization

Clone repo

git clone https://github.com/smtmfft/c-kzg-4844.git
cd c-kzg-4844

Change bindings/rust/build.rs to compile c-kzg with C arithmetic overflow checks

cc.flag_if_supported("-ftrapv")
cc.flag_if_supported("-fsanitize=undefined")

Run tests with rust overflow checks

rustup toolchain install nightly
RUSTFLAGS="-C overflow-checks=on" cargo test --release

This seems to work FYI.

I also tried pointing raiko to this version of c-kzg-4844 to catch any arithmetic overflows, but it seems to still work on the prover when it's set to TARGET=native and make build && RUST_BACKTRACE=1 make test.

I think a good next step could be to do some more thorough debugging with GDB with the ELFs that get built for SP1/R0.

jtguibas commented 4 months ago

From further investigation, it seems like the issue is inside the call to:

blob_to_kzg_commitment
johntaiko commented 4 months ago

From further investigation, it seems like the issue is inside the call to:

blob_to_kzg_commitment

fix: https://github.com/taikoxyz/raiko/pull/201

Backgroud

Because, we can't use the libc in zkvms, so we hacked the malloc and calloc methods

Case

But, we had a mismatch with the calloc function's signature.https://www.tutorialspoint.com/c_standard_library/c_function_calloc.htm

-unsafe extern "C" fn calloc(size: usize) -> *mut c_void {
+unsafe extern "C" fn calloc(nobj: usize, size: usize) -> *mut c_void {

Call in c-kzg-4844 https://github.com/smtmfft/c-kzg-4844/blob/77e9ba0a65e10e6a470832da2914b17a968da791/src/c_kzg_4844.c#L189

static C_KZG_RET c_kzg_calloc(void **out, size_t count, size_t size) {
    *out = NULL;
    if (count == 0 || size == 0) return C_KZG_BADARGS;
    // need alloc `count * size` but only get `count(4096)` in raiko
    *out = calloc(count, size);
    return *out != NULL ? C_KZG_OK : C_KZG_MALLOC;
}

static C_KZG_RET g1_lincomb_fast(
    g1_t *out, const g1_t *p, const fr_t *coeffs, uint64_t len
) {
    C_KZG_RET ret;
    void *scratch = NULL;
    blst_p1_affine *p_affine = NULL;
    blst_scalar *scalars = NULL;

    /* Tunable parameter: must be at least 2 since blst fails for 0 or 1 */
    if (len < 8) {
        g1_lincomb_naive(out, p, coeffs, len);
    } else {
        /* blst's implementation of the Pippenger method */
        size_t scratch_size = blst_p1s_mult_pippenger_scratch_sizeof(len);
        ret = c_kzg_malloc(&scratch, scratch_size);
        if (ret != C_KZG_OK) goto out;
        ret = c_kzg_calloc((void **)&p_affine, len, sizeof(blst_p1_affine));
        if (ret != C_KZG_OK) goto out;
        ret = c_kzg_calloc((void **)&scalars, len, sizeof(blst_scalar));
        if (ret != C_KZG_OK) goto out;

        /* Transform the points to affine representation */
        const blst_p1 *p_arg[2] = {p, NULL};
        blst_p1s_to_affine(p_affine, p_arg, len);

        /* Transform the field elements to 256-bit scalars */
        for (uint64_t i = 0; i < len; i++) {
            blst_scalar_from_fr(&scalars[i], &coeffs[i]);
        }

        /* Call the Pippenger implementation */
        const byte *scalars_arg[2] = {(byte *)scalars, NULL};
        const blst_p1_affine *points_arg[2] = {p_affine, NULL};
        blst_p1s_mult_pippenger(
            out, points_arg, len, scalars_arg, 255, scratch
        );
    }

    ret = C_KZG_OK;

out:
    c_kzg_free(scratch);
    c_kzg_free(p_affine);
    c_kzg_free(scalars);
    return ret;
}

So, some of the allocated memory overlapped, resulting in a UB.

smtmfft commented 3 months ago

Reopen for we still meet unaligned access in ckzg lib in SP1. Add some findings here: https://hackmd.io/@taik0yue/B15mxaCVC

johntaiko commented 3 months ago

Reopen for we still meet unaligned access in ckzg lib in SP1. Add some findings here: hackmd.io/@taik0yue/B15mxaCVC

Fixed by https://github.com/smtmfft/c-kzg-4844/pull/2

Related update in blst: https://github.com/supranational/blst/compare/master...dyxushuai:blst:master

smtmfft commented 3 months ago

Possible, but I actually tried that...not works in my env, maybe I missed sth. will test later.

Reopen for we still meet unaligned access in ckzg lib in SP1. Add some findings here: hackmd.io/@taik0yue/B15mxaCVC

Fixed by smtmfft/c-kzg-4844#2

Related update in blst: supranational/blst@master...dyxushuai:blst:master

smtmfft commented 3 months ago

Adding flag is not enough, another reason is: https://github.com/taikoxyz/raiko/pull/291

as sp1 requires strict aligned access, potentially some more issues we will meet later, and then that unsafe memory operation in rust seems inevitable. Hopefully there is a rustflag to let all vec allocation be in a aligned manner. will explore that later. (ref global aligned alloc: https://github.com/Brechtpd/cap/blob/a63589670397f85745daa1a22d2de4d46fd742aa/src/lib.rs#L183)

VM debugging is hard, no debug info, no frame stack, and no libc support (can not print in c native library), any tool can make it easier?

smtmfft commented 2 months ago

close this issue as non-reproducible for now.