So in between calls the validator pool might change. Actually ValidatorPool::reset_epoch should do all changes in one go (while holding a write lock), but maybe not all validators got selected for the new epoch, so a validator ID is missing. Also if this wasn't the case, the IDs would still be messed up.
Solution
As a first fix, the impl for the IdentityRegistry can override the signers_weight implementation, so atleast all weights can be fetched with holding the lock only once.
In a second step we need to remove the possibility that an aggregation for epoch X queries the ValidatorPool which was reset to epoch X + 1.
In the first implementation of IdentityRegistry it actually wasn't intended to have interior mutability. We can reimplement that again by copying all public keys and weights only when the IdentityRegistry is constructed. All necessary data should be available at this point anyway.
Notes
Same log dump as #59
The node kept working after this panic and this is should happen very rarely. So low priority.
Log
DEBUG consensus | Now at block #32895
DEBUG validator_network | pBFT proposal by validator 11: 341b44023fc028f26b6beef4b8f01aadbc3e25b2f77a1a5a9ff59dfd31e9b8b8
DEBUG consensus | Now at block #32896
DEBUG validator | Setting validator to active: pk_idx=28
WARN pool | No validator info for: 8 (2 votes)
ERROR panic | thread 'tokio-runtime-worker-2' panicked at 'Unknown signers in signature: MultiSignature { signature: Signature(850d7f1bd3cef158bf45c7631db6820c0ad7801f83a9c7d178ccf8f027575225d652d8cb1c3091fb6279232c742d095f), signers: Bit
Set(29)[0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 29, 30] }': validator/src/signature_aggregation/voting.rs:171
stack backtrace:
0: log_panics::init::{{closure}}
1: std::panicking::rust_panic_with_hook
at src/libstd/panicking.rs:468
2: std::panicking::continue_panic_fmt
at src/libstd/panicking.rs:373
3: std::panicking::begin_panic_fmt
at src/libstd/panicking.rs:328
4: nimiq_validator::signature_aggregation::voting::VotingProtocol<T>::votes::{{closure}}::{{closure}}
5: nimiq_validator::signature_aggregation::voting::VotingProtocol<T>::votes
6: nimiq_validator::validator_network::ValidatorNetwork::on_pbft_commit_level_update
7: <F as nimiq_utils::observer::PassThroughListener<E>>::on_event
8: nimiq_validator::validator_agent::ValidatorAgent::on_pbft_commit_message
9: <F as nimiq_utils::observer::PassThroughListener<E>>::on_event
10: nimiq_messages::MessageNotifier::notify
11: <F as nimiq_utils::observer::PassThroughListener<E>>::on_event
12: nimiq_utils::observer::PassThroughNotifier<E>::notify
13: <futures::stream::for_each::ForEach<S,F,U> as futures::future::Future>::poll
14: futures::future::chain::Chain<A,B,C>::poll
15: <futures::future::select::Select<A,B> as futures::future::Future>::poll
16: <futures::future::map::Map<A,F> as futures::future::Future>::poll
17: <futures::future::map_err::MapErr<A,F> as futures::future::Future>::poll
18: futures::task_impl::Spawn<T>::poll_future_notify
19: std::panicking::try::do_call
20: __rust_maybe_catch_panic
at src/libpanic_unwind/lib.rs:79
21: tokio_threadpool::task::Task::run
22: tokio_threadpool::worker::Worker::run_task
23: tokio_threadpool::worker::Worker::run
24: std::thread::local::LocalKey<T>::with
25: std::thread::local::LocalKey<T>::with
26: tokio_reactor::with_default
27: tokio::runtime::threadpool::builder::Builder::build::{{closure}}
28: std::thread::local::LocalKey<T>::with
29: std::thread::local::LocalKey<T>::with
30: std::sys_common::backtrace::__rust_begin_short_backtrace
31: std::panicking::try::do_call
32: __rust_maybe_catch_panic
at src/libpanic_unwind/lib.rs:79
33: core::ops::function::FnOnce::call_once{{vtable.shim}}
34: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
at /rustc/ac162c6abe34cdf965afc0389f6cefa79653c63b/src/liballoc/boxed.rs:942
35: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
at /rustc/ac162c6abe34cdf965afc0389f6cefa79653c63b/src/liballoc/boxed.rs:942
std::sys_common::thread::start_thread
at src/libstd/sys_common/thread.rs:13
std::sys::unix::thread::Thread::new::thread_start
at src/libstd/sys/unix/thread.rs:79
36: start_thread
37: __clone
DEBUG consensus | Now at block #32897
DEBUG sink | Closing connection, reason: SendFailed (Some("SendFailed"))
DEBUG sink | Error closing connection: send failed because receiver is gone
DEBUG consensus | Now at block #32898
Seems like there is a race condition where we reset the
ValidatorPool
which is used by Handel as aIdentityRegistry
. But the impl looks like this:So in between calls the validator pool might change. Actually
ValidatorPool::reset_epoch
should do all changes in one go (while holding a write lock), but maybe not all validators got selected for the new epoch, so a validator ID is missing. Also if this wasn't the case, the IDs would still be messed up.Solution
As a first fix, the impl for the
IdentityRegistry
can override thesigners_weight
implementation, so atleast all weights can be fetched with holding the lock only once.In a second step we need to remove the possibility that an aggregation for epoch X queries the
ValidatorPool
which was reset to epoch X + 1. In the first implementation ofIdentityRegistry
it actually wasn't intended to have interior mutability. We can reimplement that again by copying all public keys and weights only when theIdentityRegistry
is constructed. All necessary data should be available at this point anyway.Notes
Log