Open SkymanOne opened 2 weeks ago
My suspicion is that it is related to https://github.com/risc0/risc0/blob/54febb7df36f8406d9393cbc3184920a24e9db21/risc0/zkvm/src/host/server/prove/mod.rs#L241
since when a new Rc
reference to a local segment prover is constructed in a separate thread, it causes a data race. Perhaps using Arc
would address the issue.
Which OS are you using? Assuming you are using the proving server binary and not overriding with the "prove" feature flag? Also assuming you are not running these tests in dev mode, is that correct?
My suspicion is that it is related to
since when a new
Rc
reference to a local segment prover is constructed in a separate thread, it causes a data race. Perhaps usingArc
would address the issue.
I don't think so, that value won't be shared across test threads, if this was Arc
it would be the same.
Rust blocks even moving the Rc across thread boundaries (example)
Which OS are you using? Assuming you are using the proving server binary and not overriding with the "prove" feature flag? Also assuming you are not running these tests in dev mode, is that correct?
I run macOS on Apple M1. I'm using the 1.1.2 version of risc0 crates. No feature overriding was done. Risc0 crates are used as they were generated by the cargo risczero.
Tests are run in production mode, so full prove generation.
I don't think so, that value won't be shared across test threads, if this was Arc it would be the same.
Maybe it is not necessary related to the Rc
I pointed out. However, it is very likely there is some global state that gets shared across multiple test threads that causes data racing.
Running the tests causes different ones to randomly fail even in the starter template.
Which OS are you using? Assuming you are using the proving server binary and not overriding with the "prove" feature flag? Also assuming you are not running these tests in dev mode, is that correct?
I run macOS on Apple M1. I'm using the 1.1.2 version of risc0 crates. No feature overriding was done. Risc0 crates are used as they were generated by the cargo risczero.
Tests are run in production mode, so full prove generation.
Running these in parallel will result in the apple GPU running out of memory and results in these errors. For proving tests, we run them in serial to avoid this. we suggest that you do the same for your workload
I’m facing a similar issue with ECDSA signature verification inside the guest environment. Running this on mac with "metal" feature, without any multithreading, results in a similar error.
To reproduce the issue, you can navigate to the https://github.com/dymchenkko/oyster-monorepo/commit/5f2f817ae492a9c465e223c99e4b25706def06ce (verifier-risczero directory)
Is there an existing issue?
Experiencing problems? Have you tried our Discord first?
Description of bug
When running multiple tests (in production mode) involving building a prover with
default_prover()
for each test and executing the same guest program on the same or different data inputs, a local prover can either panic with:which points to https://github.com/risc0/risc0/blob/2ba504fddd84376235d335ec4db6b2353d967fc9/risc0/zkp/src/prove/prover.rs#L338
or fail to prove the program at
prover.prove(...).unwrap()
with"verification indicates proof is invalid"
.I suspect this is to do with the fact that Rust runs tests in multiple threads by default causing some issues with constraints generation, since running
cargo test -- --test-threads=1
resolves an issue.Steps to reproduce
It is difficult to provide a deterministic reproducer since the issue solely depends on the CPU runtime. However, I managed to hit a similar error with the start template by moving all the host prover-call code to the
run()
function and setting tests as:and then running
cargo test -r
multiple times.At some point you should get something like: