scroll-tech / ceno

Accelerate Zero-knowledge Virtual Machine by Non-uniform Prover Based on GKR Protocol
Apache License 2.0
21 stars 2 forks source link

Sumcheck prover hang #511

Closed kunxian-xia closed 1 day ago

kunxian-xia commented 5 days ago

You can reproduce this issue by running cmd in the branch of #368.

RAYON_NUM_THREADS=64 RUST_LOG=debug cargo run --release --example fibonacci_elf -- --nocapture

The prover hangs when it's running the main selector sumcheck for SLLI opcode with num_instances = 650.

Last few logs on my machine

2024-10-31T09:41:55.584326Z DEBUG sumcheck::tower: ceno_zkvm::scheme::prover: generated tower proof at round 13/13
2024-10-31T09:41:55.589190Z DEBUG prove_rounds: sumcheck::prover_v2: thread 0: sumcheck round 1/7
2024-10-31T09:41:55.589970Z DEBUG prove_rounds: sumcheck::prover_v2: thread 1: sumcheck round 1/7
2024-10-31T09:41:55.594713Z DEBUG prove_rounds: sumcheck::prover_v2: thread 1: sumcheck round 2/7
2024-10-31T09:41:55.596938Z DEBUG prove_rounds: sumcheck::prover_v2: thread 0: sumcheck round 2/7
2024-10-31T09:41:55.600338Z DEBUG prove_rounds: sumcheck::prover_v2: thread 0: sumcheck round 3/7
2024-10-31T09:41:55.605352Z DEBUG prove_rounds: sumcheck::prover_v2: thread 1: sumcheck round 3/7
2024-10-31T09:41:55.610686Z DEBUG prove_rounds: sumcheck::prover_v2: thread 0: sumcheck round 4/7
2024-10-31T09:41:55.610927Z DEBUG prove_rounds: sumcheck::prover_v2: thread 1: sumcheck round 4/7
2024-10-31T09:41:55.618196Z DEBUG prove_rounds: sumcheck::prover_v2: thread 1: sumcheck round 5/7
2024-10-31T09:41:55.619944Z DEBUG prove_rounds: sumcheck::prover_v2: thread 0: sumcheck round 5/7
2024-10-31T09:41:55.627636Z DEBUG prove_rounds: sumcheck::prover_v2: thread 1: sumcheck round 6/7
2024-10-31T09:41:55.627888Z DEBUG prove_rounds: sumcheck::prover_v2: thread 0: sumcheck round 6/7
2024-10-31T09:41:55.632615Z DEBUG prove_rounds: sumcheck::prover_v2: thread 1: sumcheck round 7/7
2024-10-31T09:41:55.634330Z DEBUG prove_rounds: sumcheck::prover_v2: thread 0: sumcheck round 7/7
2024-10-31T09:41:55.642064Z DEBUG ceno_zkvm::scheme::prover: tower sumcheck finished
2024-10-31T09:41:55.645229Z DEBUG sumcheck::main_sel: ceno_zkvm::scheme::prover: main sel sumcheck start
hero78119 commented 4 days ago

After adding few tracing log, the problem was narrow down and finding the root cause, with command

RAYON_NUM_THREADS=64 RUST_LOG=debug cargo run --release --example fibonacci_elf -- --nocapture

tracing log https://github.com/scroll-tech/ceno/compare/feat/guest-example...feat/guest-example_ming?expand=1

And here is the log before hang

2024-10-31T15:46:45.594959Z DEBUG ceno_zkvm::scheme::prover: tower sumcheck finished
2024-10-31T15:46:45.594988Z DEBUG sumcheck::main_sel: ceno_zkvm::scheme::prover: main-sel sumcheck preparion with log2_num_instances=22, num_threads=64
2024-10-31T15:46:45.631714Z DEBUG sumcheck::main_sel: ceno_zkvm::scheme::prover: main sel sumcheck start
2024-10-31T15:46:45.631860Z DEBUG sumcheck::main_sel:sumcheck::prove_batch_polys: sumcheck::prover_v2: start prove_batch_polys sumcheck with max_thread_id=64, rayon::current_num_threads()=64
2024-10-31T15:46:45.657207Z DEBUG prove_rounds: sumcheck::prover_v2: thread 62: sumcheck round 1/16 appended
2024-10-31T15:46:45.666626Z DEBUG prove_rounds: sumcheck::prover_v2: thread 61: sumcheck round 1/16 appended
2024-10-31T15:46:45.670964Z DEBUG sumcheck::main_sel:sumcheck::prove_batch_polys:main_thread_prove_rounds: sumcheck::prover_v2: main thread: sumcheck round 1/16 appended
2024-10-31T15:46:45.670984Z DEBUG sumcheck::main_sel:sumcheck::prove_batch_polys:main_thread_prove_rounds: sumcheck::prover_v2: main thread: sumcheck round 1/16 read 1 msgs
2024-10-31T15:46:45.674980Z DEBUG sumcheck::main_sel:sumcheck::prove_batch_polys:main_thread_prove_rounds: sumcheck::prover_v2: main thread: sumcheck round 1/16 read 33 msgs
2024-10-31T15:46:45.677774Z DEBUG prove_rounds: sumcheck::prover_v2: thread 0: sumcheck round 1/16 appended
2024-10-31T15:46:45.687479Z DEBUG sumcheck::main_sel:sumcheck::prove_batch_polys:main_thread_prove_rounds: sumcheck::prover_v2: main thread: sumcheck round 1/16 read 61 msgs

In above example, working model is 63 worker on threads + 1 worker in main thread. Logs shows 63 workers (thread id 0-62) has been appended to channel, but at main worker, only able to read 62 records from channel, which one message missing, so the main worker always waiting here.

in most of case I get abnormal 62 msg, but precisely it's < 63 message, everytime I can get different count.

Apparently there must be some race condition, probably some problem in the channel library which I am still trying to targeting.

So I have 2 preliminary plans, either a. have a workaround to skip this issue, probably have a less performant workaround version b. found the root cause and resolve it

I will target on b first, but if still got stuck then I will switch to a, the major goal is unblocking e2e test asap

hero78119 commented 1 day ago

Closed as merge to #368