Closed kunxian-xia closed 1 day ago
After adding few tracing log, the problem was narrow down and finding the root cause, with command
RAYON_NUM_THREADS=64 RUST_LOG=debug cargo run --release --example fibonacci_elf -- --nocapture
tracing log https://github.com/scroll-tech/ceno/compare/feat/guest-example...feat/guest-example_ming?expand=1
And here is the log before hang
2024-10-31T15:46:45.594959Z DEBUG ceno_zkvm::scheme::prover: tower sumcheck finished
2024-10-31T15:46:45.594988Z DEBUG sumcheck::main_sel: ceno_zkvm::scheme::prover: main-sel sumcheck preparion with log2_num_instances=22, num_threads=64
2024-10-31T15:46:45.631714Z DEBUG sumcheck::main_sel: ceno_zkvm::scheme::prover: main sel sumcheck start
2024-10-31T15:46:45.631860Z DEBUG sumcheck::main_sel:sumcheck::prove_batch_polys: sumcheck::prover_v2: start prove_batch_polys sumcheck with max_thread_id=64, rayon::current_num_threads()=64
2024-10-31T15:46:45.657207Z DEBUG prove_rounds: sumcheck::prover_v2: thread 62: sumcheck round 1/16 appended
2024-10-31T15:46:45.666626Z DEBUG prove_rounds: sumcheck::prover_v2: thread 61: sumcheck round 1/16 appended
2024-10-31T15:46:45.670964Z DEBUG sumcheck::main_sel:sumcheck::prove_batch_polys:main_thread_prove_rounds: sumcheck::prover_v2: main thread: sumcheck round 1/16 appended
2024-10-31T15:46:45.670984Z DEBUG sumcheck::main_sel:sumcheck::prove_batch_polys:main_thread_prove_rounds: sumcheck::prover_v2: main thread: sumcheck round 1/16 read 1 msgs
2024-10-31T15:46:45.674980Z DEBUG sumcheck::main_sel:sumcheck::prove_batch_polys:main_thread_prove_rounds: sumcheck::prover_v2: main thread: sumcheck round 1/16 read 33 msgs
2024-10-31T15:46:45.677774Z DEBUG prove_rounds: sumcheck::prover_v2: thread 0: sumcheck round 1/16 appended
2024-10-31T15:46:45.687479Z DEBUG sumcheck::main_sel:sumcheck::prove_batch_polys:main_thread_prove_rounds: sumcheck::prover_v2: main thread: sumcheck round 1/16 read 61 msgs
In above example, working model is 63 worker on threads + 1 worker in main thread. Logs shows 63 workers (thread id 0-62) has been appended to channel, but at main worker, only able to read 62 records from channel, which one message missing, so the main worker always waiting here.
in most of case I get abnormal 62 msg, but precisely it's < 63 message, everytime I can get different count.
Apparently there must be some race condition, probably some problem in the channel library which I am still trying to targeting.
So I have 2 preliminary plans, either a. have a workaround to skip this issue, probably have a less performant workaround version b. found the root cause and resolve it
I will target on b first, but if still got stuck then I will switch to a, the major goal is unblocking e2e test asap
Closed as merge to #368
You can reproduce this issue by running cmd in the branch of #368.
The prover hangs when it's running the main selector sumcheck for
SLLI
opcode withnum_instances = 650
.Last few logs on my machine