Multi-thread testing results in error

SkymanOne commented 2 weeks ago

Is there an existing issue?

[X] I have searched the existing issues

Experiencing problems? Have you tried our Discord first?

[X] This is not a support question.

Description of bug

When running multiple tests (in production mode) involving building a prover with default_prover() for each test and executing the same guest program on the same or different data inputs, a local prover can either panic with:

thread '<unnamed>' panicked at /Users/risc0/actions-runner/_work/risc0/risc0/risc0/zkp/src/prove/prover.rs:338:33:
assertion `left == right` failed
  left: [1109155563, 76776968, 726152720, 1606742656]
 right: [0, 0, 0, 0]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at /Users/risc0/actions-runner/_work/risc0/risc0/risc0/zkp/src/prove/prover.rs:338:33:
assertion `left == right` failed
  left: [1937466798, 422672686, 84982592, 642582437]
 right: [0, 0, 0, 0]
thread '<unnamed>' panicked at /Users/risc0/actions-runner/_work/risc0/risc0/risc0/zkp/src/prove/prover.rs:338:33:
assertion `left == right` failed
  left: [916425584, 1323635903, 560063291, 61845575]
 right: [0, 0, 0, 0]
thread '<unnamed>' panicked at /Users/risc0/actions-runner/_work/risc0/risc0/risc0/zkp/src/prove/prover.rs:338:33:
assertion `left == right` failed
  left: [283844281, 1796260247, 1438206498, 1271669024]
 right: [0, 0, 0, 0]
thread '<unnamed>' panicked at /Users/risc0/actions-runner/_work/risc0/risc0/risc0/zkp/src/prove/prover.rs:338:33:
assertion `left == right` failed
  left: [1889568656, 1020291148, 1642680298, 421080635]
 right: [0, 0, 0, 0]
thread '<unnamed>' panicked at /Users/risc0/actions-runner/_work/risc0/risc0/risc0/zkp/src/prove/prover.rs:338:33:
assertion `left == right` failed
  left: [414445105, 1970834655, 939333111, 991974128]
 right: [0, 0, 0, 0]
thread '<unnamed>' panicked at /Users/risc0/actions-runner/_work/risc0/risc0/risc0/zkp/src/prove/prover.rs:338:33:
assertion `left == right` failed
  left: [420338028, 1713299230, 424368403, 1629257559]
 right: [0, 0, 0, 0]

which points to https://github.com/risc0/risc0/blob/2ba504fddd84376235d335ec4db6b2353d967fc9/risc0/zkp/src/prove/prover.rs#L338

or fail to prove the program at prover.prove(...).unwrap() with "verification indicates proof is invalid".

I suspect this is to do with the fact that Rust runs tests in multiple threads by default causing some issues with constraints generation, since running cargo test -- --test-threads=1 resolves an issue.

Steps to reproduce

It is difficult to provide a deterministic reproducer since the issue solely depends on the CPU runtime. However, I managed to hit a similar error with the start template by moving all the host prover-call code to the run() function and setting tests as:

#[cfg(test)]
mod test {
    use crate::run;

    #[test]
    fn test1() {
        run();
        run();
    }

    #[test]
    fn test2() {
        run();
    }

    #[test]
    fn test3() {
        run();
    }

    #[test]
    fn test4() {
        run();
    }

    #[test]
    fn test5() {
        run();
    }
}

and then running cargo test -r multiple times.

At some point you should get something like:

running 5 tests
test test::test3 ... FAILED
test test::test4 ... ok
test test::test5 ... ok
test test::test2 ... ok
test test::test1 ... ok

failures:

---- test::test3 stdout ----
thread 'test::test3' panicked at host/src/main.rs:36:10:
called `Result::unwrap()` on an `Err` value: verification indicates proof is invalid
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

failures:
    test::test3

linear[bot] commented 2 weeks ago

ZKVM-608 Multi-thread testing results in error

SkymanOne commented 2 weeks ago

My suspicion is that it is related to https://github.com/risc0/risc0/blob/54febb7df36f8406d9393cbc3184920a24e9db21/risc0/zkvm/src/host/server/prove/mod.rs#L241 since when a new Rc reference to a local segment prover is constructed in a separate thread, it causes a data race. Perhaps using Arc would address the issue.

austinabell commented 2 weeks ago

Which OS are you using? Assuming you are using the proving server binary and not overriding with the "prove" feature flag? Also assuming you are not running these tests in dev mode, is that correct?

My suspicion is that it is related to

https://github.com/risc0/risc0/blob/54febb7df36f8406d9393cbc3184920a24e9db21/risc0/zkvm/src/host/server/prove/mod.rs#L241

since when a new Rc reference to a local segment prover is constructed in a separate thread, it causes a data race. Perhaps using Arc would address the issue.

I don't think so, that value won't be shared across test threads, if this was Arc it would be the same.

Rust blocks even moving the Rc across thread boundaries (example)

SkymanOne commented 2 weeks ago

Which OS are you using? Assuming you are using the proving server binary and not overriding with the "prove" feature flag? Also assuming you are not running these tests in dev mode, is that correct?

I run macOS on Apple M1. I'm using the 1.1.2 version of risc0 crates. No feature overriding was done. Risc0 crates are used as they were generated by the cargo risczero.

Tests are run in production mode, so full prove generation.

SkymanOne commented 2 weeks ago

I don't think so, that value won't be shared across test threads, if this was Arc it would be the same.

Maybe it is not necessary related to the Rc I pointed out. However, it is very likely there is some global state that gets shared across multiple test threads that causes data racing. Running the tests causes different ones to randomly fail even in the starter template.

SchmErik commented 1 week ago

Which OS are you using? Assuming you are using the proving server binary and not overriding with the "prove" feature flag? Also assuming you are not running these tests in dev mode, is that correct?

I run macOS on Apple M1. I'm using the 1.1.2 version of risc0 crates. No feature overriding was done. Risc0 crates are used as they were generated by the cargo risczero.

Tests are run in production mode, so full prove generation.

Running these in parallel will result in the apple GPU running out of memory and results in these errors. For proving tests, we run them in serial to avoid this. we suggest that you do the same for your workload

dymchenkko commented 3 days ago

I’m facing a similar issue with ECDSA signature verification inside the guest environment. Running this on mac with "metal" feature, without any multithreading, results in a similar error.

To reproduce the issue, you can navigate to the https://github.com/dymchenkko/oyster-monorepo/commit/5f2f817ae492a9c465e223c99e4b25706def06ce (verifier-risczero directory)

risc0 / risc0