Closed Eitu33 closed 1 year ago
Any suggestions to reproduce?
Any suggestions to reproduce?
in consensus-worker check_desync
you can just change the condition to true or true after some slot so that it's triggered directly.
I've found what I beleive to be the culprit: the launch function initialises the (non-async) ctrlc handler, and with a de-sync, the launch fn gets called again, and from the initialiser documentation:
/// # Errors
/// Will return an error if another `ctrlc::set_handler()` handler exists or if a
/// system error occurred while setting the handler.
///
I could put together a fix relatively easily, buuuut....
There are quite a few ctrl-c handlers speckled through the codebase. Might be worth discussing having a single, global ctrlc-signal?
let sig_int_triggered = Arc::new((Mutex::new(false), Condvar::new()));
let sigint_triggered_clone = Arc::clone(&sig_int_triggered);
ctrlc::set_handler(move || {
*sigint_triggered_clone
.0
.lock()
.expect("double-lock on interupt bool in ctrl-c handler") = true;
sigint_triggered_clone.1.notify_all();
})
.expect("Error setting Ctrl-C handler");
Then wherever the code needs to be sig-int aware, we provide that boolean as appropriate.
in the bootstrap code we have this:
let int_sig = interupted
.0
.lock()
.expect("double-lock() on interupted signal mutex");
let wake = interupted
.1
.wait_timeout(int_sig, bootstrap_config.retry_delay.to_duration())
.expect("interupt signal mutex poisoned");
if *wake.0 {
return Err(BootstrapError::Interupted(
"Sig INT during bootstrap retry-wait".to_string(),
));
}
under the hood, the sigint variable gets passed into a dedicated system thread, and is in an event loop. in our case
where we have code like this:
// interrupt signal listener
let (tx, rx) = crossbeam_channel::bounded(1);
let interrupt_signal_listener = tokio::spawn(async move {
signal::ctrl_c().await.unwrap();
tx.send(()).unwrap();
});
/* ... */
match rx.try_recv() {
Ok(_) => {
info!("interrupt signal received");
break false;
}
Err(crossbeam_channel::TryRecvError::Disconnected) => {
error!("interrupt_signal_listener disconnected");
break false;
}
_ => {}
}
we would instead have
if !global_ctrlc.0.lock().unwrap().clone() {
break false;
}
...though I would like to set things up so that it can take a RwLock instead (currently cannot)
The teams input would be valuable
Personally, I'm having hard time understanding if you can make a PR it would be more comprehensive for me I think. But the priority is for the fix
I'll get to work on the PR to see if we can have a single toggle
I'm having the same issue on first bootstrap with labnet
Still having the issue with https://github.com/massalabs/massa/pull/3971 fix, will be removing the control c handler for now
For context: I've been sick/away for the better part of the last two weeks, and this is my first day back, hence the gap.
This is the sequence of events, and work done today:
This is what I've tried to do to reproduce the bug
...no luck on any front.
Context
This was found testing the new versioning system, after experiencing a desynchronization, node tried to bootstrap again and process terminated with the following error: